One of the ways to allow more control over scanning coverage is to define specific paths on the asset that should or should not be crawled and potentially tested during the scan. Under Application Scan settings, you can specify included and avoided paths/URLs.
Use included paths/URLs to specify root-relative paths that should be tested during the scan.
While Application Scan can crawl pages that are accessible from the main page, as well as those that are hidden, we can’t guarantee we find and test all potential URLs. By specifying included paths you set paths that are mandatory for the scanner to both crawl and run tests against.
Ex. including path /secret-admin-panel/ for scan profile example.com means the scan will test the http(s)://example.com/secret-admin-panel/ URL, and crawl all additional content that’s linked from that page*.
Use avoided paths/URLs to specify root-relative paths that should not be crawled or tested during the scan.
If you’re aware of any pages or other content that Application Scan shouldn’t tamper with in any manner, or take too much scanning time (ex. blog posts, product listing), these can be easily avoided.
Ex. avoiding path /blog for scan profile example.com means the scan won’t cover any URL starting with http(s)://example.com/blog
Use included and avoided paths/URLs in combination to test a specific part of the web application.
Ex. disallowing path /, while allowing path /blog/ for scan profile example.com means the scan will only cover URLs starting with http(s)://example.com/blog/, and even won’t test the root page (http(s)://example.com/) of the website.
Ex. disallowing path /items/, while allowing path /items/1001 could be a useful combination if you have various product pages with the same structure under /items/, and you want to avoid our crawler taking time to map out all those pages, and instead just want to test only one.
Included and avoided paths/URLs are applied to all origins the scan covers. Ex. for the scan profile example.com if /admin is specified as included path, with the site having a subdomain blog.example.com, and both sites being accessible via HTTP and HTTPS protocols, the following URLs will be tested:
To avoid this behaviour for subdomains, the subdomain should be avoided.
* Note that limits of Application Scan still apply. Hence if the scope of the scan is too large, the scan may not cover the specified URL.
Settings up included/avoided paths/URLs
1. Click on your scan profile, then select Application Scan Settings. Use “Which paths/URLs must we include?”to add one or more included URLs. Note that for a large range of URLs, consider using our Forced Browsing functionality.
2. Use “Which paths/URLs must we avoid?” for paths you don't want us to touch.
Both included and disallowed paths support asterisk wildcards in case you want to apply a rule for a group of pages.
Ex. the website provides product details under paths /product/5/details, /product/6/details, etc. To set all of these as included/avoided paths/URLs, you can use /product/*/details
Note that asterisks are implicitly included in all paths by default. Using /admin as allowed/disallowed path is equivalent with using /admin*. When using asterisk explicitly will remove this functionality, for example using /product/*/details is not the same as using /product/*/details*.
There is no limit to wildcards you can include for the path patterns. As an example, adding blog/*/guestblog/*/details would be supported as well.
1. If I want to add a Custom User Behaviour / Recorded Login, would it be affected by the included/avoided paths setting?
Yes, all the Application Scan Settings including the list of included/avoided paths will be passed to the scanner at the very beginning of the scan. This means when replaying the recorded scenarios, the settings provided by you will be respected.
2. Can I avoid scanning dynamic URLs?
No, it is currently impossible to avoid dynamic URLs from scanning.
3. Will avoiding /language/select also avoid en/language/select?
No, if you wish to avoid /en/language/select you would need to include the beginning part of the path as well (which must start with a leading slash). Here the following path would need to be avoided:
4. Can I avoid a subdomain together with its path, e.g. shop-prod.example.com/products?
No, in this case you would need to avoid the entire /products path for any subdomain or add the subdomain shop-prod to the list of disallowed subdomains:
If none of these solutions are suitable for you and you would prefer avoiding the absolute URL instead, reach out to us via firstname.lastname@example.org and we’d be happy to help and adjust this setting for you from our end.
5. Can I use a regular expression to include/avoid paths?
Regex is currently not supported.
6. Is there any difference between disallowing /product and /product/ ?
As the specified path is matched from the start of the root-relative URL, including /product will disallow any path starting with such text, ex. /product/, /products, /productimages/, while in case of disallowing /product/ will still allow us to scan /products, /productimages/ if such paths exist on the website.
7. If I disallow '/' and only allow '/products/', will that allow everything that starts with '/products/'? Or do I have to add an additional path If I want the scanner to go to e.g. /products/notebooks/example-notebook?
Allowing the path /products/ should be enough. If you however will still see these paths missing in your Crawled URLs report after running a new scan, you can try adding the /products/notebooks/ to the list of allowed URLs.
8. Is there a difference between allowing products/notebooks and products/*
That depends on what you want to achieve.
If you want us to crawl /products/notebooks specifically, allowing exactly this path /products/notebooks will be fine.
If you however want to allow anything under /products you can go for a pattern (/products/*)
As an example, to allow e.g. products/notebooks/A4 as well as /accounts/sketchbooks/A4 would be /accounts/*/A4