Search 101 - how search engines work - part 1

Posted by Mark Rogers on Jan 30, 2008 | 


Having some background on how search engines work is very useful when you're trying to optimize your site. We hope we've a bit of perspective on this, having spent the best part of a decade implementing search engines and web crawlers.

How do words on a web page end up searchable? This happens in three phases:

The two important components here are the web crawler (which we'll cover in this post) and the index (which works just like an index in a book, and we'll cover in the next post).

Web crawler

The web crawler has a very simple job to do: it visits each page on the web and does two things:

Once it's finished with a page it moves to the next page on the list to visit and repeats the process.

It all sounds simple, so what can go wrong?

Note: In reality, it's a bit more complex since large search engines like Google speed things up by running multiple web crawlers, and store multiple copies of the index, but the basic process is the same.

Extracting text and links: HTML

When a search engine scans a page for text, it's looking for words in the between your HTML tags. It also extracts links from tags like <a href="somelink.htm">

As an example, consider the tag above. If you miss out the closing quote from the link name you end up with broken HTML, but some browsers (e.g. Internet Explorer) have error handling code that detects this and guess that the closing angle bracket ends the link. Most search engines are less forgiving and include the angle bracket and all the text that follows it as part of the link. This means somelink.htm won't get indexed.

Extracting text: JavaScript

Content created by JavaScript is mostly invisible to search engines. In particular, they can't "run" the JavaScript to produce HTML. The best they can do is try extracting links and words from strings embedded in the JavaScript.

Note: There are lots of reasons for this but security is one of them. How many recent browser security holes were down to JavaScript problems? 80%, 90%, more?

Extracting text: Images

This presents a problem if all your text is in images - the search engine can't see any text in your images - though some search engines index ALT text and the image file name (notably Google Image Search).

Extracting text: Flash

Flash is problematic, although some (but not all) search engines can extract text and links from Flash movies. Don't depend on this though - the search engines use a tool provided by Adobe which does a good job most of the time, but it can't read all Flash movies and crashes in some cases. You won't see the crash of course, but the text of movies triggering the crash will never appear in any search indexes.

Preventing indexing problems

So what can you do to stop this happening? Finding all these problems manually is very difficult, but by amazing co-incidence we produce a tool called SortSite that checks every page on web site for these sorts of problems.


First posted Jan 2008