Search 101 - how search engines work - part 1
Having some background on how search engines work is very useful when you're trying to optimize your site. We hope we've a bit of perspective on this, having spent the best part of a decade implementing search engines and web crawlers.
How do words on a web page end up searchable? This happens in three phases:
- Crawling (or spidering) the web, finding pages people want to search
- Indexing words on web pages
- Searching the index (i.e. the bit that happens when you type a search into Google)
The two important components here are the web crawler (which we'll cover in this post) and the index (which works just like an index in a book, and we'll cover in the next post).
The web crawler has a very simple job to do: it visits each page on the web and does two things:
- Adds words on the page to the index
- Adds links on the page to the list of pages to visit
Once it's finished with a page it moves to the next page on the list to visit and repeats the process.
It all sounds simple, so what can go wrong?
Note: In reality, it's a bit more complex since large search engines like Google speed things up by running multiple web crawlers, and store multiple copies of the index, but the basic process is the same.
Extracting text and links: HTML
When a search engine scans a page for text, it's looking for words in the between your HTML tags. It also extracts links from tags like
As an example, consider the tag above. If you miss out the closing quote from the link name you end up with broken HTML, but some browsers (e.g. Internet Explorer) have error handling code that detects this and guess that the closing angle bracket ends the link. Most search engines are less forgiving and include the angle bracket and all the text that follows it as part of the link. This means somelink.htm won't get indexed.
Extracting text: Images
This presents a problem if all your text is in images - the search engine can't see any text in your images - though some search engines index ALT text and the image file name (notably Google Image Search).
Extracting text: Flash
Flash is problematic, although some (but not all) search engines can extract text and links from Flash movies. Don't depend on this though - the search engines use a tool provided by Adobe which does a good job most of the time, but it can't read all Flash movies and crashes in some cases. You won't see the crash of course, but the text of movies triggering the crash will never appear in any search indexes.
Preventing indexing problems
So what can you do to stop this happening? Finding all these problems manually is very difficult, but by amazing co-incidence we produce a tool called SortSite that checks every page on web site for these sorts of problems.