This is a proposal to indicate a preferred host name (e.g. domain with or without “www”) for search engine robots by adding a “Canonical-host” entry to the robots.txt file.
Valid host values are as per RFC 2396 and RFC 2732, i.e. “hostname | IPv4address | [IPv6address]”.
It has always been a common practice to make a web site accessible both with and without a “www” host name. This remains the way sites are almost always configured by default by an ISP under managed hosting plans. While potentially interesting from a usability standpoint (both www.example.com and its shorter form, example.com, will work when typed in a browser’s address field), this results in several problems as soon as the different URLs pointing to the same host are published on the web in spite of the site’s maintainer preference for one specific form.
Known issues include:
- Discrepancy between search engine result URLs and URL preference of site maintainers
- Inconsistencies within search engine results (some pages on a site listed with “www”, others without)
- Same site with and without “www” is listed as having a different “page rank”
- Failed matches between search engine results and categorization schemes
- Difficulty, for a search engine, to accurately determine whether different URLs pointing to the same IP address (e.g. HTTP 1.1 virtual hosts) are actually meant to point to the same web site, or not (after all, the content itself can change between crawls)
- No clarity about number of actual sites indexed by search engines (are different uncanonized URLs pointing to the same web site counted multiple times?)
Solutions to this are limited in part because:
- Inconsistent incoming links (e.g. with and without “www”) are not under the control of a site’s maintainer
- While HTTP redirects could be used to express a preference (e.g. by permanently redirecting accesses to example.com to www.example.com, or vice versa), not all managed hosting providers give the customer access to such configuration options
Resorting to robots.txt to solve this problem comes natural for several reasons:
- Robots.txt provides a method “for encoding instructions to visiting robots”
- Robots.txt is popular among robots
- Robots.txt is always accessible by a site’s maintainer
- Robots.txt is already site-centered (one robots.txt per site)
- Martijn Koster’s “A Method for Web Robots Control” RFC allows for extensions to the robots.txt format (“extension = token : *space value [comment] CRLF”)
While this discussion centers on the presence or lack of the “www” host name, which is a very practical and frequent issue, the aim is to propose a flexible solution that can be applied to other situations as well.
In consideration of the above, the proposal is made to define an extension token named “Canonical-host”, allowing the maintainer of a web site to indicate a preferred host name value to be used by robots to access and index the site.
- Robots should interpret and follow this preference in the same way as they would process a permanent HTTP redirect (status 301)
- Search engines and web categorization systems (“directories”) should consider the preference as a request to update their host name records, if required
Post Scriptum: Robots.txt vs. rel=”canonical”
In 2009 the major search engines announced support for the rel=”canonical” attribute:
Although from a per-page rather than per-site perspective, the new implementation addresses many of the needs covered by this proposal. At the same time though, it requires adding a tag on each page, and it cannot be applied to scenarios where the content administrator has no control over the HTML headers, e.g. with many CMS systems, or with web services. Not to mention non-HTML content (audio, video, images, etc.)
As of 2012, both Yandex and Google are supporting a “Host” directive that is substantially the same as the one I was proposing under “Canonical-host”.