Google Caffeine
12th August 2009 Posted in Search
Google recently announced developer tests of its new search architecture, codenamed Google Caffeine. This is a new set of algorithms underlying search that are optimised for speed. Google regularly update their search algorithm, so why is this different? Read on.
Regular Updates
Google makes constant updates to their algorithm in their search for relevancy in the SERPs. Google’s approach is that people will continue to use their search engine if it continues to provide the most useful results. These changes are all done by different people, who only see part of the whole picture, and are intended to give weighting preferring pages with good semantics, pages others have found useful, pages for large brands, homepages for companies, etc etc.
This makes SEO a bit of a dark art. Changes are unannounced and randomly scheduled, weightings are kept secret, and it’s never clear exactly what the requirements are. There are experts around who specialise in reverse engineering the Google algorithm - running tests to see exactly what changes do and don’t move them up the rankings, both in the short or long term. This is the main source of SEO knowledge and it keeps shifting.
Google Caffeine Update
But this one is different. This is not an algorithm optimised for relevancy, this one is optimised for efficiency. For most people, the main area of difference will be speed. Google’s underlying architecture (Google File System - GFS) is subject to several sets of restraints (see below). This update is the first step in a large attempt to improve the speed and reliability across all the Google platforms: search, youtube, gmail, etc.
Updating the search algorithm in this way can’t be done without changes in the overall search results. Already we see small but noticeable changes in the rankings of certain websites when running tests for the normal search versus the new Caffeine search.
But we are seeing improvements in speed. Significant ones. The entire page loads more quickly when returning search results. See the two screenshots below for comparison.
Regular
Caffeine
The other thing we notice is that we get more results for the Caffeine search. It appears to be more efficiently indexing blog posts, forums and twitter. These are difficult to index because in general you get a lot of duplicate content and a lot of noise. Google are traditionally rather conservative about what is included to try to keep the results clean. Now it looks like the new algorithm is a little more sophisticated at finding the relevant content amongst these and making it available in the search results.
Google File System
Google’s file system (GFS) was developed in the early days of Google as a way to store terabytes and terabytes of information (mostly crawler and indexing logs) that can be hunted through by their algorithms. Each cell (normally one data centre) contains a master that holds the main table and a series of clients that hold the data. This meant that now that Google have increased in size so much that they are looking at storing petabytes and petabytes of information instead, this one-master setup has become a bit of a bottleneck.
The other issue is that the system uses 64MB chunk sizes. I.e. if it wants to extract a piece of data it needs to load the chunk into memory and search a fairly large chunk. This is fine for search indices, because such index files can be several gigabytes in size. However for the newer applications Google are using more and more (gmail for instance) these chunks are too large to be efficient.
GFS is only part of the story. To actually process this data you need a system sitting on top of the file system to structure and store the database. Traditionally Google’s system was called MapReduce. This system would take a required operation, map it to several servers to process and search the bits of the database, then reduce the results back together to give you the output required. As this aged it got replaced with BigTable, a simple set of key-value pairs that very efficiently reduces how much of the database needs to be searched by being very structured in the keys, and distributes the processing effectively across the hundreds of thousands of machines in Google’s server farms.
These are both written with GFS in mind - BigTable specifically to try to get around some of the limitations of GFS and it’s appropriateness for non-search applications.
But there’s a new GFS on the horizon. For two years now Google have been working on a replacement for GFS. We don’t expect to see it in operation soon - it’s an enormous job to change over the entire system one piece at a time - but it works using multiple masters per cell (removing the bottleneck and reducing the latency that is a killer for applications like youtube) and 1MB chunk sizes (meaning you don’t need to load so much data into memory when you’re likely to be using lots of little pieces of information like gmail).
There’s no official word yet about Caffeine, but it has been in the works for months and it seems likely that it will have been written with this new GFS in mind. Hopefully as a search application it will be able to take advantage of the best features of the replacement file system and we should start seeing better searches that can index more pages more efficiently whilst returning them faster than ever before.
To learn more about Caffeine read the blog post here.
To try it for yourself, visit:
http://www2.sandbox.google.com/search?hl=en&q=XX&aq=f&oq=&aqi=g10&gl=uk
and replace the XX with your search term (this is for UK searches).
To learn more about GFS and its replacement, read an interview with Sean Quinlan from Google here.
Share this article
Pulley!
