https-vs-http-duplicate-content

How to resolve http vs https duplicate content issues

Due to their different ports, the http:// and https:// versions of your content are considered as two different websites by search engines. In order to prevent duplicate content issues that can impede your SEO, here’s some great techniques for ensuring only one version of your content is cached.

SSL certificates are of course common place, but if you don’t take any measures to prevent both being cached, then you may end up seeing something like this in the SERPS:

example-https-cache

This dummy screenshot illustrates how Google can crawl, cache and display a mixture and duplications of both secure and non-secure content. In order to prevent these duplications and/or mixtures, there’s are a few different ways for nominating a single consistent format.

Option 1. The Canonical Link Element

In the first instance, I’d recommend utilising the Canonical Link Element. This handy Link element looks a little like this and is simply added within the Head tags of your site’s pages:

<link rel="canonical" href="http://www.domain.com/star-wars/toys/cut-outs/luke-skywalker.html"/>

Supported and recommended by Google, Yahoo and Bing (sort of), the Canonical Link Element allows webmasters to declare the preferred or ‘canonical’ location for their content.

So if we refer back to our screenshot example from earlier, applying the Canonical Link Element to our https:// content will ensure that only the http version is used within the search listings. To learn more about the Canonical Link Element, Matt Cutts explains it here back in 2009 when it was introduced.

The Canonical Link Element & WordPress

If your website is built using WordPress, then it’s worth noting that Yoast’s fabulous SEO for WordPress plugin provides everything you need to deploy the Canonical Link Element:

yoast-all-in-one-seo-canonical

Via the ‘Permalinks’ Settings tab you can choose between http and https if you have the website running under both.

The Canonical Link Element within Magento

Since Magento 1.4 the Canonical Link Element feature is built-in as standard. It was added quite quickly by Magento’s developers to tackle existing duplicate content issues due to products being categorised and accessed via many different URLs.

magento-canonical-features

It’s worth noting however that the canonical tag will only appear on categories and products by default. Despite this, checking both boxes will prevent the key canonicalisation issues faced with secure Magento websites.

Option 2. Serve a different Robots.txt for https

Whilst the Canonical Link Element is the best solution, it isn’t always practical to implement. If your site is bespoke, large or a custom CMS, then implementing a site-wide Canonical tag might be too complicated a task.

Like I mentioned earlier, the http and https versions of your website are treated as two separate sites by the search engines. Therefore, it is possible to disallow the bots from crawling the http version of your website. To do this, we can use .htaccess to serve two different robots.txt files. One for the secure https site, and one for the regular non-secure http site. Save your regular robots.txt file for the non-secure site as robots.txt, but save your disallowed secure robots.txt file as robots_ssl.txt like in the screenshot below:

robots-files

So now you should have two different robots.txt files on the root layer of your site. The regular ‘allowed’ version for your non-secure http site, and a second Robots file saved as robots_ssl.txt that will be served for the secure site. Next, apply the following commands to your site’s .htaccess file:

RewriteEngine on 
RewriteCond %{SERVER_PORT} ^443$ 
RewriteRule ^robots\.txt$ robots_ssl.txt [L]

This .htaccess rewrite will essentially serve ‘robots_ssl.txt’ as ‘robots.txt’ when the user/bot accesses the file via port 443 (the secure https version of the website). Once in place, try loading the robots.txt file via both http and https. You should hopefully get something like this:

Secure Vs Non-Secure Robots.txt

It’s worth noting however that blocking the bots from your https website won’t remove any cached https listings, it will merely prevent them from being crawled. Therefore this is a good preventative solution, but if you’re looking to remove duplicates from the listings then include the ‘Meta Robots Tag’ in Option 3 below:

Option 3 – Meta Robots Tag

If neither the Canonical Link Element or Robots.txt solution are viable, then there is one final method that you can deploy – The Meta Robots Tag. Much like with the .htaccess code in Option 2, PHP can also be used to detect if the content is being served via https. By adding the following code within the Head tag of your pages:

<?php
if (isset($_SERVER['HTTPS']) && strtolower($_SERVER['HTTPS']) == 'on') {
echo '<meta name="robots" content="noindex,follow" />'. "\n";
}
?>

The Meta Robots tag will be deployed on each page of the secure website only. The Meta Robots Tag is set to allow the page(s) to be crawled but not to be indexed/cached.

Conclusion

Matt Cutts Canonical Slide
A slide from an old Matt Cutts presentation about the Canonical Link Element.

Whilst mild duplicate content issues can arise with site-wide SSL certificates, they can be very easily prevented and overcome by effectively using the Canonical Link Element.

Whether using an established framework like Magento or WordPress, or working with a bespoke build, The Canonical Link Element is the most effective and Google Recommended solution.

However, the other two options above can be equally effective although they make take longer to take effect if you’re fixing duplicate content issues rather than preventing them. Would love to hear anyone’s questions, comments or contributions to this post in the comments below.

  • Max

    The problem with using canonical tags on https pages refrecning http, is that you will no longer get the secure green pad lock symbol in the browser, as you are calling non secure elements – Just as has happened on this very page

    • http://www.creare.co.uk/ James Bavington

      Hi Max, thanks for highlighting the bug with our SSL certificate not validating. We’ve now corrected that, our contact form was causing a site-wide issue. I don’t believe however that the canonical tag can cause an issue like this. The reference within the Meta tag isn’t actually called, it’s just linked – so the page doesn’t actually depend on it.

      If you view this page securely at https:// and view the source, you’ll see we have the http:// in the canonical tag and our Padlock is (after being fixed a moment ago) green and secure.

      James.

    • Tom Atkinson

      Canonical tag is the best and only solution. Redirects aren’t good because they are inflexible. Robots isn’t good because it blocks Googlebot and kills any link juice you had going to the non-canonical version. Canonical consolidates your rankings nicely.

  • http://www.seojeek.com/ Alex Vallejo

    If I switched from http to https, do I need to update all my backlinks? Or does this canonical tag handle that? Or would doing a 301 redirect on my .htaccess handle that?

    • http://www.creare.co.uk/ James Bavington

      Hi Alex, you should be fine, but if you don’t need the https – it wouldn’t hurt at all to redirect it.

  • http://greencoffeebeanmaxreviews.co.uk/ Ali Raza

    I have a question which no one answered me till now if you can help me. I have https on my site and now I am unable to use feedburner signup form on my site. Is there any way to exclude feed page from https.

  • seseskiz

    ok, how do we test to see if if this works! is there a tool that can scan and test if the site has http or https problems?

  • Harve

    We’ve just discovered that google is showing both http and https results. I’ve got a couple of questions:

    1. Should we use the canonical link element to point to http or https? My understanding is that there is some small SEO benefit to be gained by using https, however there are also drawbacks such as slower page loading time etc (already a bit of an issue for us). The majority of the website doesn’t need to be secure, but people log in to our app via the same domain and it is very important that is secure.

    2. If we opt for http how can we keep the public facing website as http, but the secure login as https? We are using Yoast – if we force http I assume it will do that across the entire site, including the login?

  • http://www.wordsuccor.com/ Emily Johns

    Great info! Thanks so much for putting this together AND making it easy to understand. :)

  • Fred Smith

    In Option 2, you state that using Robots.txt file to allow for HTTP and disallow for HTTPS? Why would I want to disallow bots to HTTPS and not disallow to HTTP? When I thought having a HTTPS site is better?