Three Canonicalization Problems Fixed with .htaccess

Posted By May 4th, 2010

Given the detrimental effect duplicate content can have on search engine rankings, it is surprising how common canonicalization issues are, even on major websites. Fortunately, however, many of these problems can be easily solved by appropriate use of redirects to enforce a single URL for each piece of unique content. In this article, we look at three common canonicalization issues and how they can be fixed using .htaccess and 301 redirects.

For any of this to work you’ll need an Apache server with mod_rewrite enabled. To get started, add the following to the top of your .htaccess file. If this results in a 500 server error, your host probably doesn’t support mod_rewrite, and there is likely not much you can do about it.

Options +FollowSymLinks
RewriteEngine On

Domain Canonicalization

While many web users view www.example.com and example.com as functionally equivalent addresses, search engines cannot make such an assumption since, technically, the URLs are different and a webserver could return different content for each. In reality, hosting packages often automatically create a ‘www’ subdomain to serve the same content as the main domain, leading to many pages appearing twice in search results – both with and without the ‘www’ prefix.

The simple fix for this is to pick one domain to use as the canonical hostname and redirect any requests for other domains to that one:

RewriteCond %{HTTP_HOST} !^www.example.com [NC]
RewriteRule ^(.*)$ http://www.example.com/$1 [L,R=301]

HTTP/HTTPS Canonicalization

Websites handling sensitive information (e.g. ecommerce) will often need to make use of secure connections via the https protocol. Typically, this is only required for certain parts of the site, such as those accessed via a login, but links to the https versions of normal pages often creep into search results.

The following example ensures that any requests for paths beginning with ‘/checkout’ will be redirected to https if they are not using the secure protocol, but attempts to access any other parts of the site via https will be redirected to their non-https equivalents.

# redirect non-https requests for /checkout to https
RewriteCond %{HTTPS} off
RewriteRule ^checkout/ https://%{HTTP_HOST}%{REQUEST_URI} [R=301,L]
# redirect all other https requests to http
RewriteCond %{HTTPS} on
RewriteCond $1 !^checkout/
RewriteRule ^(.*) http://%{HTTP_HOST}%{REQUEST_URI} [R=301,L]

Directory Index Canonicalization

Apache allows us to specify a list of pages which will be returned if a directory page is requested (i.e. a url ending with a slash), such as http://www.example.com/. Typically, index.html or index.php are used. However, direct access to these files is not prevented and careless creation of links can easily lead to http://www.example.com/ and http://www.example.com/index.php appearing in search results separately.

The following rule ensures that any requests ending in /index.php will be redirected to the parent directory.

# matches original request header
# (to avoid infinite loop with apache internal rewriting)
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /.*index\.php\ HTTP/
RewriteRule ^(.*)index\.php$ http://%{HTTP_HOST}/$1 [R=301,L]

5 Responses to “Three Canonicalization Problems Fixed with .htaccess”

  1. Andrea Moro says:

    I’m not an expert of .htaccess file, but it looks like the Second chunk of snippets doesn’t match with the description provided.

  2. David Streater says:

    How does the canonicalization tag fit into this? And is it ok to use one?

  3. Andrew Mabbott says:

    Good question, David.

    My preference would always be to enforce a single URL via redirects wherever possible as this may provide a slightly better experience for users than the canonical tag and it also helps to ensure that all inbound links point to the same URL.

    I think the canonical tag is more useful in a situation where multiple URLs are required to effect a slight variation in content (such as ?sort=asc vs. ?sort=desc) but only one version of the page needs to be indexed by search engines.

    Of course, the canonical tag could be considered for any of the above scenarios if a redirect is not possible (mod_rewrite not enabled, non-Apache server etc.)

  4. andrew says:

    Hi Andrea,
    Thanks for the feedback. Could you elaborate a little more on this? Is it the first scenario (htaccess code to enfore a single domain) which has a problem?

  5. Paul says:

    Hi All, i found this really informative.

    I really like the way in which apache works however, I have a friend’s website that’s on both http and https. Should I be 301 redirecting the https to http? and if so, does anyone know the correct approach for doing this on IIS?

Leave a Reply