What Web Crawlers See

Web Crawlers are also called Automatic Indexers, Bots, Web Spiders or Web Robots.

How does a web crawler work?

Many people seem to be under the impression that a web crawler or web spider crawls along the Internet and establishes itself on the web servers it finds and read the contents of the hard disk to find what it wants. This is what a virus would do - establishing itself on a machine and then executing various actions. Most web servers, ideally all, wouldn't allow such a thing to happen. It breaches all security concepts. So, how does a web crawler work?

Very simple: the executable code never leaves the machine from where it operates. It sends requests to web servers and is served web pages, or other resources, by the web server. If it finds links, it sends requests to the pages to which these links point. That's it. The page served to the web crawler is exactly the same as the one served to your browser. It doesn't matter if the data on the page is static data on the page or dynamic data loaded into the page from a database, the web crawler sees it.

Does the web crawler see what I see in my browser?

The web crawler sees what the server serves it. The question is, can the web crawler make sense out of everything served to it? Your browser executes Javascript on your machine and renders the result to you. Can a web crawler do this? If it was programmed to do so, but the conventional wisdom is that everything generated by Javascript is not visible to a web crawler. The same goes for any Flash content - your browser has a plug-in that renders the Flash content on your machine. Do the web crawlers see Flash content? Some may, some may not. I'd play it safe and not rely on web crawlers to see any Flash content. It is generally accepted that web crawlers don't see images and videos or hear sound files. So, if you want the crawler to make some sense of your images they must have the "alt" attribute with a text value briefly descibing the image, like "ACME company logo".

So what does the web crawler see?

It sees plain old text, whether delivered from a database or as static contents of a web page. This includes alt tags of images and meta-data elements in the header. To see a web page much as a cralwer would see it, download the Lynx Browser, which is a text only browser and use it to look at your website.

What do I take from this?

You do want your site to feature on search engines. Therefore you do some Search Engine Optimisation. What you take from this is that what you want the search engine to know about on your pages must be included as text on your pages after the page has been served to the web crawler. You don't have to be afraid that text loaded from a database won't be visible to the web crawler.

And finally...

Let us know if you have anything to add or any remarks. There is much I haven't said about web crawlers, but the idea was to let you know what they see and take into account and what they don't see. The form for sending mail is below.

If you find what you learned on this page useful, please use the social media widgets at the bottom and pin, tweet, plus-one or whatever this page.

Submit a comment

Use and empty line to separate paragraphs in the "Comment" text area.

Links and html markup are not allowed.

SQL Exception thrown: An I/O error occured while sending to the backend.