Integrity and Scrutiny software support - FAQs

 

Before emailing your question, please take a quick look at the FAQ's below to see whether your question's answered.

Also see Integrity's home page or Scrutiny's home page for full version history and other information.

Failing that, please email me at shiela@peacockmedia.co.uk. Don't forget to tell me about your Mac, version of OSX and the url that you're starting from.

Can I put in a username and password to crawl pages that require authentication?

This can have disastrous results because some website systems have a web interface with controls (including 'delete' buttons) that look to web crawlers like links.

I have now included this feature into Scrutiny with the necessary advice, warnings and disclaimers.

What does the "page titles are unique" option do?

Choosing this option is a quicker and more accurate way to crawl your site, but it only works if each of your pages has a different title.

After checking each internal link, the app has to then fetch the contents of the page, read through it and pull out the links from that page. That's how it crawls the site. It'll get a link like "index.html" lots of times (on every page perhaps) so before fetching the contents, it has to decide whether it's done that page already. It compares the new link with the list of those it's already done.

Integrity used to use the url to determine this. However, it's often the case that the same page is referred to by a number of different urls - eg peacockmedia.co.uk and peacockmedia.co.uk/index.html are the same page, but a web crawler can't know that. Some content management systems can refer to the same page by quite a few different urls. That means that the app could do lots more work than it needed to, and over-report the number of links and pages.

Should I set "ignore querystrings"?

The querystring is information within the url of a page. It follows a '?' - for example www.mysite.co.uk/index.html?thisis=thequerystring. If you don't use querystrings on your site, then it won't matter whether you set this option. If your page is the same with or without the querysrting (for example, if it contains a session id) then check 'ignore querystrings'. If the querystring determines which page appears (for example, if it contains the page id) then you shouldn't ignore querystrings, because Integrity or Scrutiny won't crawl your site properly.

What does altering the number of threads do?

Using more threads may crawl your site faster, but it will use more of your computer's resources and your internet bandwidth.

Using fewer will allow you to use your computer while the crawl is going on with the minimum disruption.

The default is seven, minimum is one and maximum is 30. I've found that using more than this has little effect.

I need to make Integrity or Scrutiny appear to be a 'real' browser

You can change the user-agent string to make it appear to the server to be a browser (known as 'spoofing').

Go to Preferences and paste your chosen user-agent string into the box

There is an incredibly comprehensive list of browser user-agent strings on this page: http://www.zytrax.com/tech/web/browser_ids.htm

If you would like to find the user-agent string of the browser you're using now, just hit this link:
What's my user-agent string?

What's the difference between 'checking' and 'following'?

In a nutshell, checking means just asking the server for the status of that page without actually visiting the page. Following means visiting that page and scraping all the links off it.

Checking a link is sending a request and receiving a status code (200, 404, whatever). Integrity and Scrutiny will check all of the links it finds on your starting page. If you've checked 'Check this page only' then it stops there.

But otherwise, it'll take each of those links it's found on your first page and 'follow' them. That means requesting and loading the content of the page, then going through the content finding the links on that page. It adds all the links it finds to its list and then goes through those checking them, and if appropriate, following them in turn. Note that it won't 'follow' external links, because it would then be crawling someone else's site - it just needs to 'check' external links

You can ask Integrity or Scrutiny to not check certain links, to only follow or not to follow certain links. You do this by typing part of a url into the relevant box. For example, if you want to only check the section of your site below /engineering you would type '/engineering' (without quotes) into the 'Only follow urls containing...' box. (You will also need to start your crawl at a page containing that term).

You don't need to know about pattern matching such as regex or wildcards, just type a part of the url.

What do the red and orange colours mean in the list?

To check a link, Integrity sends a request and receives a status code back from your server (200, 404, whatever).

The 'status' column tells you the code that the server returns to Integrity when it checks each link. 200 means that the link is good, 300 means there's something not quite right (usually a redirection) but the link still works, 400 codes mean that the link is bad and the page can't be accessed and 500 codes mean some kind of error with the server. So the higher the number, the worse the error and Integrity colours these (by default) white, orange and red.

(There's a full list of all the possible status codes here: http://en.wikipedia.org/wiki/List_of_HTTP_status_codes) but Integrity helpfully gives you a description of the status as well as the code number.

What is the difference between the 'by link' and flat views?

The 'by link' view is a list of links - remember that each link can occur more than once (for example your 'home' link will probably appear on every one of your pages) - so in the 'by link' view, each link will be listed once, and you can open it up to see a list of pages that the link appears on (and if it's broken you'll probably have to fix it on each of those pages).

The 'flat view' is provided for a number of reasons; the 'by link' list is expanded so that each occurrence of each link has a row in the table. if you're exporting your list to an Excel spreadsheet then this view will be much more suitable.

You may find that when you're fixing your links you might prefer one view over the other.

When I export to csv the file is corrupt in Excel

Because the 'by link' view shows a list of pages in one column when exported (and this list will show in one cell when opened in Excel) it may break Excel's 256-character limit.

Switch to the 'flat view' before exporting, which expands the 'by link' view so that each occurrence of each link is shown on a separate row. This should solve the problem. If it produces a very large file you may prefer to switch to 'bad links only' before exporting.

The web server doesn't like fast page requests and stops responding

This isn't uncommon, and there are a couple of things you can do. First of all, the 'threads' slider sets the number of requests that Scrutiny/Integrity can make at once. If you move this to the extreme left, then Scrutiny/Integrity will send one request at a time, and process the result before sending the next. This alone may work. If not, then there's a box beside that slider which allows you to set a delay (in seconds). You can set this to what you like, but a fraction of a second may be enough.

It crashes

I'm happy to investigate but I'll need to know about your mac and version of OSX and the url that you're starting from.

If your site is a larger site (for example, tens of thousands of pages) then the memory use and demand on the processor will really (and necessarily) increase as the lists of pages crawled and links checked get longer.

If the site is large enough then the app will eventually run out of memory and obviously can't continue.

A couple of problems I've diagnosed have involved messageboards on a site. To Integrity and Scrutiny, a well-used messageboard can look like tens of thousands of unique pages and it will try to list and check all of those pages.

You may want to check all pages within that board in order to check the links within postings. But if not (and I'd suggest not), then you can improve matters by ignoring querystrings in the settings. All discussion board pages will be treated as being the same page. Beware though that if any of your content pages rely on a querystring to display their content, then the site may not be crawled properly. In that case you might instead use the 'do not follow' box to exclude discussion board pages.

What does 'Location of Validator' mean?

By default, this screen uses W3C's HTML validation service. This is a donation-funded service.

It's possible to download, install and run the validator on your Mac for free. I can't support you with this, but if you are successful, you can enter the url of your instance of the validator in the appropriate box in Preferences.

As an alternative to the installation instructions above I can recommend the free 'Validator S.A.C' from Chuck Houpt. It's easy to download and run. You will need to start the validator as a web service but this is relatively easy too and the instructions are on the same page as the download. You will then need to enter http://localhost/w3c-validator as the location of the validator.

Scrutiny's HTML Validation screen times out

Scrutiny currently starts validation as soon as the first page is crawled, but it respects W3C's wish for automated requests to be no less than a second apart. This is why validation may still be running after your crawl has finished.

The public service is not always available and has limits, reporting 'The Request Timed Out' after a certain number of checks.Consider running your own instance of the validator. (See previous question)

What does the archive feature do?

When Integrity crawls the site, it has to pull in the html code for each page in order to find the links. WIth the archive mode switched on, it simply saves that html as a file in a location that you specify at the end of the crawl.

If you need to go back and refer to them or use them as a backup that's fine but it doesn't alter those files in any way (eg making the links relative) so they're not particularly user-friendly if you want to view them.

What does 'Check for robots.txt and robots meta tag' do? (Scrutiny feature)

The robots.txt file and robots meta tag allow you to indicate to web robots such as the Google robot and Scrutiny that you wish them to ignore certain pages. The preference in Scrutiny is off by default and switched on in preferences. All links are followed and checked regardless of this setting, but if a page is marked as 'noindex' in the robots meta tag or disallowed in the robots.txt file, it will not be included in the sitemap, SEO or validation checks. robots.txt must have a lowercase filename, be placed in te root directory of your website and be constructed as at http://www.robotstxt.org/robotstxt.html

Can I give you some money?

Yes - Integrity works for free with no restrictions but donations are very much appreciated and enable me to spend more time developing. Especially if you use it a lot or are a business user. Donate here

Scrutiny is free while in Beta - if you like it then please buy a licence key when it is out of Beta.