First of all, the internet is big. How big? Click here to find out the number of web pages.
Considering the size of the wold wide web, search engines do a pretty good job figuring out what you’re looking for, based on a couple of words you type into a search bar.
They do such a good job, that most people don’t even look past the first five or ten results.
How do they accomplish this?
Let’s start by imagining a search for the “best toothbrush.”
Step one: indexing
A search engine doesn’t look through actual web pages. Rather it looks through a massive index, stored on a server, that has been assembled, and is constantly being updated by “webcrawlers” which request webpages and store every word found in them.
Once you have the the index that tells you what words are on every page, the computer can search to find which pages have both the words “best” and “toothbrush” and show you the results
they might end up looking something like this
all of these pages contain the words “best” and “toothbrush” but which is the one that I really want, and how could a computer tell the difference?
The location trick
Not only does a computer index every word on a webpage, but it also indexes the location of every word on every page.
This allows you to look for an exact phrase
Let’s say I hear a song and I can only remember this snippet of the lyrics “got such dark eyes”
If I put those words in quotes, a search engine can search its index for any page that has the word “got” at location n, “such” at location n+1, “dark” at location n+2, and “eyes” at location n+3, and show me only those results
But search engines also use the location trick to determine relevance.
In my toothbrush example, the closer the word “best” is to the word “toothbrush”, the more relevant the computer decides web page is
This test helps to weed out results like
But it probably would not weed out the page claiming a “toothbrush works best to get gum out of carpet” The word “best” is only one word away from “toothbrush”
But there’s another trick search engines use. It’s called:
The metaword trick
To understand the metaword trick you have to know that a webpage can be broken down into various parts: Head, title, body, etc, and these parts are established using html tags
The title of a web page might be constructed like this:
<title>Title of the document</title>
The meta word trick rests on assumptions such as, if the word “best” and “toothbrush” are in the title of a web page, it’s probably more relevant to a search for best toothbrush, than results where those words are merely found in the body.
To accomplish this aspect of search, a search engine must index the location of each of the tags
The metaword trick catapulted AltaVista to the top of the internet search game in 1995.
And then came Google and their page rank
The Hyperlink Trick
Page rank is based on the popularity of a website, and popularity is mainly determined by the incoming hyperlinks a web page has. That is, how many people who have websites linked to your page. The more links you have, the higher your page rank.
But shouldn’t an incoming link from a popular website be worth more than a link from a website nobody has ever heard of? Yes
The Authority Trick
All websites start out with an authority score of 1, but if a page has incoming links its authority is calculated by adding up the authority scores of the pages that point to it. (Actually a webpage gives a page it links to its authority score divided by the number of outgoing links it contains)
Here’s a diagram from John MacCormick s book 9 Algorithms that Changed the Future.
Because 100 people with a page rank of 1 had linked to Alice Water’s home page, she passed on a page rank of 100 to Bert’s scrambled eggs recipe, while John MacCormick’s homepage only had two incoming links and passed on a page rank of 2 to Ernie’s Recipe. Therefore, Bert’s recipe will appear first in the results of searches for scrambled eggs.
But there is a potential problem here. What if we have a cycle of hyperlinks, like this:
If C, D, and E all start with a page rank of 1, page A, which all three link to, will have a rank of 3, which it will pass along to page B, and B will pass it to page E. But now A is out of date. But if we adjust A, we’ll have to adjust B, and then E, and around and around it goes.
The Random Surfer Trick
The solution to this problem is an algorithm that simulates a random surfer, who randomly visits a page, and randomly follows a chain of hyperlinks, picking a new page to start on 15% of the time.
The more hyperlinks a page has to it, and the more popular the pages that link to it (also the fewer links those pages have) the more often the random surfer will visit, and the higher it’s page rank will be.
Though there are many other factors that influence Google Searches, it is this idea that catapulted Google to the top of the internet search game
But to run a successful search you need more than a good algorithm. Click here to look inside one of Google’s data centers
And here are 9 questions you need to answer