Manual for Integrity & Scrutiny - Settings

Blacklists and whitelists - Do not check / Only follow / Do not follow

In a nutshell, checking means just asking the server for the status of that page without actually visiting the page. Following means visiting that page and scraping all the links off it.

Checking a link is sending a request and receiving a status code (200, 404, whatever). Integrity and Scrutiny will check all of the links it finds on your starting page. If you've checked 'Check this page only' then it stops there.

But otherwise, it'll take each of those links it's found on your first page and 'follow' them. That means requesting and loading the content of the page, then going through the content finding the links on that page. It adds all the links it finds to its list and then goes through those checking them, and if appropriate, following them in turn.

Note that it won't 'follow' external links, because it would then be crawling someone else's site - it just needs to 'check' external links

You can ask Integrity or Scrutiny to not check certain links, to only follow or not to follow certain links. You do this by typing part of a url into the relevant box. For example, if you want to only check the section of your site below /engineering you would type '/engineering' (without quotes) into the 'Only follow urls containing...' box. You don't need to know about pattern matching such as regex or wildcards, just type a part of the url.

Number of threads

Using more threads may crawl your site faster, but it will use more of your computer's resources and your internet bandwidth.

Using fewer will allow you to use your computer while the crawl is going on with the minimum disruption.

The default is seven, minimum is one and maximum is 30. I've found that using more than this has little effect.

Archive pages while crawling

When Integrity crawls the site, it has to pull in the html code for each page in order to find the links. WIth the archive mode switched on, it simply saves that html as a file in a location that you specify at the end of the crawl.

If you need to go back and refer to them or use them as a backup that's fine but it doesn't alter those files in any way (eg making the links relative) so they're not particularly user-friendly if you want to view them.

Ignore querystrings

The querystring is information within the url of a page. It follows a '?' - for example www.mysite.co.uk/index.html?thisis=thequerystring. If you don't use querystrings on your site, then it won't matter whether you set this option. If your page is the same with or without the querysrting (for example, if it contains a session id) then check 'ignore querystrings'. If the querystring determines which page appears (for example, if it contains the page id) then you shouldn't ignore querystrings, because Integrity or Scrutiny won't crawl your site properly.

Pages have unique titles

Choosing this option is a quicker and more accurate way to crawl your site, but it only works if each of your pages has a different title.

After checking each internal link, the app has to then fetch the contents of the page, read through it and pull out the links from that page. That's how it crawls the site. It'll get a link like "index.html" lots of times (on every page perhaps) so before fetching the contents, it has to decide whether it's done that page already. It compares the new link with the list of those it's already done.

Integrity used to use the url to determine this. However, it's often the case that the same page is referred to by a number of different urls - eg peacockmedia.co.uk and peacockmedia.co.uk/index.html are the same page, but a web crawler can't know that. Some content management systems can refer to the same page by quite a few different urls. That means that the app could do lots more work than it needed to, and over-report the number of links and pages.

Global Preferences

Location of Validator (Scrutiny feature)

By default, this screen uses W3C's HTML validation service. This is a donation-funded service.

I believe you can download, install and run the validator on your Mac for free. I can't support you with this, but if you are successful, you can enter the url of your instance of the validator in the appropriate box in Preferences.

User agent string

You can change the user-agent string to make it appear to the server to be a browser (known as 'spoofing').

Go to Preferences and paste your chosen user-agent string into the box

There is an incredibly comprehensive list of browser user-agent strings on this page: http://www.zytrax.com/tech/web/browser_ids.htm

If you would like to find the user-agent string of the browser you're using now, just hit this link:
What's my user-agent string?

Ignore leading / trailing slashes and mismatched quotes around urls

When Integrity crawls the site, it has to pull in the html code for each page in order to find the links. If code is hand written it may not be perfectly-formed and contain certain problems but the page will still appear to work properly in a web browser.

This option will allow the app to be more forgiving and overlook these problems as the web browser does (if checked) or be less forgiving and flag up these problems so that you can fix them.

Check for robots.txt and robots meta tag (Scrutiny feature)

The robots.txt file and robots meta tag allow you to indicate to web robots such as the Google robot and Scrutiny that you wish them to ignore certain pages. The preference in Scrutiny is off by default and switched on in preferences. All links are followed and checked regardless of this setting, but if a page is marked as 'noindex' in the robots meta tag or disallowed in the robots.txt file, it will not be included in the sitemap, SEO or validation checks. robots.txt must have a lowercase filename, be placed in te root directory of your website and be constructed as at http://www.robotstxt.org/robotstxt.html

<<index

You are viewing part of the manual / help for Integrity and Scrutiny for Mac OSX by Peacockmedia.