The Google index started with 26 million pages in 1998, shot to a billion pages in 2000 and now it is has hit a new milestone: 1 trillion (as in 1,000,000,000,000) unique url's. Google claims not to index every one of the trillion pages as indexing can be expensive as many of them are similar to each other, or represent auto-generated content. The way Google indexes has evolved it now indexes blogs and other rapidly changing websites every 15 minutes. Michael Arrington of Techcrunch hints at something big coming up next week which may challenge Google position of having the most comprehensive index of any search engine.
To keep up with this volume of information, our systems have come a long way since the first set of web data Google processed to answer queries. Back then, we did everything in batches: one workstation could compute the PageRank graph on 26 million pages in a couple of hours, and that set of pages would be used as Google's index for a fixed period of time. Today, Google downloads the web continuously, collecting updated page information and re-processing the entire web-link graph several times per day. This graph of one trillion URLs is similar to a map made up of one trillion intersections. So multiple times every day, we do the computational equivalent of fully exploring every intersection of every road in the United States. Except it'd be a map about 50,000 times as big as the U.S., with 50,000 times as many roads and intersections.
0 Comments