Thursday, May 6, 2010

Make Search Engines Use Your Keywords with the Canonical Tag

It used to be that search engines would index keywords listed in web page meta keywords header tags like this:
<meta name="keywords" content="Schools Out Forever Maximum Ride" />

However, so may sites overloaded that tag with spam that search engines started ignoring it entirely. The challenge for site owners became where else to put keywords that search engines would still see. People noticed that Google not only looked for keywords in text, but also in domain names and URLs. So the trick became how to get keywords into your URLs.

Usually an URL includes a one-to-one mapping to a file name on the web server (or database-driven sites may use IDs in query strings). So webmasters could include keywords in file and directory names, but that gets tedious because generally anything between / characters is also a physical sub-directory, and it just doesn't work for database-driven sites. Using physical file and directory names would mean your web servers would have files in tons of individual sub-directories that would become impossible to maintain.

The good thing is there's no law that says an URL has to exactly equal a physical file name. So one solution is to set up your web server to rewrite URLs to come up with the real file name.

For example, all these URLs render the same content:
http://www.amazon.com/dp/0446618896
http://www.amazon.com/Schools-Out-Forever-Maximum-Ride/dp/0446618896
http://www.amazon.com/asdf-asdf-asdf-asdf-asdf/dp/0446618896

But how do you tell Google what your preferred URL is, since it could find any of those URLs? That's where the canonical tag comes in.

If you look at the source code for the pages at any of those URLs and find the canonical tag, you'll see that they all use the same value, no matter what the actual URL was:
<link rel="canonical" href="http://www.amazon.com/Schools-Out-Forever-Maximum-Ride/dp/0446618896" />

So Google should generally link to http://www.amazon.com/Schools-Out-Forever-Maximum-Ride/dp/0446618896 from its index, no matter what URL its spider really found the page at.

The trick Amazon does to make all those pages render the same thing probably utilizes web server URL-rewriting to ignore anything between "http://www.amazon.com/" and "/dp/0446618896" and simply serve whatever content is at location 0446618896 (or in their case, whatever's in the database with that ID). URL rewriting is an arcane topic, but should be familiar to system administrators who manages web servers.

Since Amazon can then include any keywords in their URLs, the other thing they probably do is ensure consistency in how they link each product. So no matter where they have their links (sitemaps, site search, product listing pages, etc.), they always use a single preferred canonical URL.