noindex

Magento SEO – Category Filters and Duplicate Content Issues

Update: We’ve also rolled this functionality into our Creare SEO Magento Extension which is available for free on Magento Connect and also featured within the book “Magento Search Engine Optimisation” – win a free copy.

A common problem in many websites, Magento included, is the duplication of parameter-based pages. The most common occurrence of this within Magento tends to be Layered Category pages – when an attribute filter is active.

Google (and other search engines) tend to treat parameter-wielding URLs as separate pages – mainly due to the vast amount of websites that still use parameters to serve all of their standard web pages (domain.com/?pageid=753 look familiar?).

As the content of our category pages only differs very slightly when these filters are active – we don’t really want Google to cache them as separate pages – especially if we have a lot of text content in our category description. The duplication of the category description could lead to duplicate content penalties being imposed on our category page/website in general.

There are a few ways to combat this problem – some work really well, some don’t. I’m a personal fan of using every trick in the book to get the desired result – and I’d urge you to do the same – mainly due to a few of these techniques being inconsistent with persuading Google to de-index (or not index in the first place) these “duplicate pages”.

The main tricks tend to be:

  • Canonical Tag
  • Robots.txt file
  • Google Webmaster Tools URL Parameters
  • Meta Robots NOINDEX

Here’s a breakdown of the positives and negatives of each of these techniques.

Canonical Tag

The canonical tag is already built into later versions of Magento (I believe 1.4 onwards?) and you can find the settings to turn it on within System > Configuration > Catalog > Search Engine Optimisation.

noindex-canonical

What this feature will do is place a canonical tag within yourelement. Canonical tags basically tell Google where the “master” version of the page is – and to ignore the current page if it’s different to it’s canonical URL.

How does this affect our category filter pages? Well, on all of our category filter pages (Category URL’s with the ?cat=3 for example) this canonical tag should be enabled – telling Google that this filter page is really just a copy of the master category and not to penalise the page or the site because of it.

The main issue I have found with the canonical tag is that it doesn’t work consistently – especially for categories. Product pages I have found work extremely well with the canonical tag (usually telling Google to cache the product URL as domain.com/product-url.html rather than with categories) but even with the same functionality enabled on category pages – I receive duplicate listings in the SERPs – possibly because of those parameters.

Robots.txt file

The robots.txt file has pretty much been around since search engines were invented and is one of the most useful text files you’ll find on any website.

The main purpose of the robots.txt file is to let search engines know which pages they can access and which areas they can’t.

Within a robots.txt file you can specify search engines to ignore our parameter based pages simply by adding the following line:

Disallow: /*?*

This will disallow any URL with a question mark anywhere inside it. Obviously this is useful for our category filters problem – but if you want to use this you really need to make sure that areas of your website are not relying on parameters for any of your vital pages that you want cached by the search engines.

The main problem I have found with robots.txt files is that they disallow search engines from VISITING the page again. They don’t ask the search engine to de-index / remove from it’s database your ‘already cached’ duplicate pages – at least not immediately. So really you should use the robots.txt technique as soon as you launch your website – otherwise you’ll still end up with duplicate pages in the SERPs for some time.

Google Webmaster Tools URL Parameters

Built into Google WMT is a feature called URL Parameters that supposedly allows you to specify what your parameters are doing to a page. It looks like it’s engineered toward helping site owners alleviate a few of the problems caused by parameters that are simply adapting the content of the page to help with usability.

You can find this tool within your WMT dashboard > Crawl > URL Parameters.

There’s a warning within Google WMT notifying users that placing the wrong parameters could result in many pages being removed from it’s search results – meaning that unlike the robots.txt, this technique may actually help to remove those duplicate pages.

I’d recommend (if you wish to go down this route) simply to add your primary attribute codes that you have set up in your layered navigation. For instance – the following URL parameters have worked for our website:

noindex-urlparameters

As you can see – there’s a lot of URLs featuring those parameters – and our website doesn’t even boast an extensive product catalogue!

Meta Robots NOINDEX

Every Magento developer will be aware that Magento comes with it’s own Meta Robots tag – when a site is live it normally looks like this:

<meta name="robots" content="INDEX,FOLLOW" />

When a website is in development this is normally set to NOINDEX,NOFOLLOW – basically blocking the website from Google happening across it.

I’m a big fan of the robots meta tag – however I must confess that I do not believe it to be 100% bulletproof – all too often I’ve seen development domains in Magento being cached by Google – simply due to the robots.txt file being missing.

That said, the NOINDEX value has certainly been used to great effect in the SEO world for telling Google to de-index pages – and it certainly works inside a Magento installation too.

Normally, to implement the NOINDEX tag you would do it within the DESIGN tab of either a category/page or product.

For instance if I wanted to NOINDEX a category I’d go ahead and add the following code into the custom design tab (custom layout xml):

<reference name="head">
   <action method="setRobots">
       <value>noindex,follow</value>
   </action>
</reference>

This will replace the site-wide meta robots tag just for this category – HOWEVER is this really what we want? I don’t think so – we still want our category to be indexed, we only want this tag to be replaced when our category filters are active.

The above XML adjustment is useful though for de-indexing/blocking specific products and pages that you want to remain active but non-indexable by search engines. Again, blocking via robots.txt is always recommended – but if it’s already been cached – implement this code as well.

Adding Meta Robots NOINDEX for Parameter-wielding Categories

So how can we implement this NOINDEX value only when a category filter is active? Well, the answer is by using an Observer and a little bit of PHP.

This implementation will form part of our Magento SEO extension that will be in place later this year but for now here are the important parts of the code.

To create this as an extension you simply need to create your module declaration file, a config.xml and an Observer.php all within the appropriate folders (for a more detailed explanation of this please see this post).

Within the config.xml we need to track our observer:

<frontend>
......
<events>
	<controller_action_layout_generate_xml_before>
		<observers>
			<noindex>
				<type>singleton</type>
				<class>noindex/observer</class><!-- replace with your module name -->
				<method>changeRobots</method>
			</noindex>
		</observers>
	</controller_action_layout_generate_xml_before>
</events>
......

In our Observer.php we need to add the following code:

public function changeRobots($observer)
{
	if($observer->getEvent()->getAction()->getFullActionName() == 'catalog_category_view')
	{
		$uri = $observer->getEvent()->getAction()->getRequest()->getRequestUri();
		if(stristr($uri,"?")): // looking for a ?
			$layout       = $observer->getEvent()->getLayout();
			$product_info = $layout->getBlock('head');
			$layout->getUpdate()->addUpdate('<reference name="head"><action method="setRobots"><value>NOINDEX,FOLLOW</value></action></reference>');
			$layout->generateXml();
		endif;
	}
	return $this;
}

What the above will do is create an observer that will check to see whether we are on a category view page. If we are it will then check to see if we have a ? in the URL. If we do then we inject our XML layout changes directly into the page – swapping the meta robots value to NOINDEX,FOLLOW.

Conclusion & Findings

It’s hard to pin down a particular technique to a particular result – especially when search engine algorithms change all the time – as does website content. The best example I can give you is a short graph showing our index status across a period of months from when we launched our new website.

Please bear in mind that creare.co.uk is not the largest website in the world – only around 500 pages. Have a look and see how many pages Google believe we had – all due to duplicate pages via parameters!

noindex-results

You can definitely see the dramatic effect our changes (all of those techniques above) have had on our page index count with Google – we now tend to have only one version of our main pages in the SERPs (though to be honest there’s still a few that are still to be removed) rather than hundreds of copies.

A couple of important points to take away if you’re planning on implementing any of the above:

  1. Be very careful – try not to get your main web pages de-indexed!
  2. Try using one or more of the techniques at the same time – never rely on one method to solve all issues
  3. Always use the canonical and robots.txt files and make sure you do these first
  4. Use the meta NOINDEX method if you’re already suffering with duplicate listings in Google
  5. If you need to use the NOINDEX method – make sure you temporarily stop your robots.txt from doing the same thing for the same pages – otherwise Google will not revisit the page and see your new NOINDEX tag!
  6. Give the URL Parameters in Google WMT a go – and let me know if you see any notable changes!

Other than that, I think it about covers it. Thanks for visiting – any questions please leave them below and I’ll do my best to answer them. If you enjoyed this post on Magento duplicate content issues you may like James’ post on HTTP vs HTTPS duplicate content troubles.

  • Hugo

    Hi Thanks for the excelent information on NOINDEX or parameter-wielding categories! I have implemented this sitewide for my Magento store. This way any URL with an ? in it will be NOINDEX (I removed the if statement with the catalog_category_view). This works well up until the point I enable my FPC (Full Page Cache) at that moment I get a PHP error telling the headers are already sent. Is there a solution to this?

    Thanks again on the info!

    • http://www.creare.co.uk/ Robert Kent

      Hi Hugo, Are you using Magento EE FPC or an extension? If it’s an extension please let me know which and I’ll have a look at it to see if there is a conflict of any kind.

      Caching could be an issue with this depending on the severity – as the observer is really just adjusting the XML stored in configuration – if FPC bypasses the usual method then we may need to accomodate for that.

      Saying that, the “Headers already sent” error normally occurs when attempting to redirect a page after data has already been output to the browser. Have you got an error log I could have a look at to see the trigger for this error?

      • Hugo

        Hi Robert,

        I am using Magento CE, so the FPC I am using is an extension. It is the LestiFPC extension found here: http://gordonlesti.com/lestifpc/

        I think the situation you describe about FPC bypassing the usual method is indeed what happens. This is part of the error log I think you need right? If not let me know!

        2013-11-01T12:00:26+00:00 DEBUG (7): HEADERS ALREADY SENT: [0] /home/gitaar1b/public_html/somedomain.com/app/code/core/Mage/Core/Controller/Response/Http.php:52
        [1] /home/gitaar1b/public_html/somedomain.comlib/Zend/Controller/Response/Abstract.php:766
        [2] /home/gitaar1b/public_html/somedomain.com/app/code/core/Mage/Core/Controller/Response/Http.php:83
        [3] /home/gitaar1b/public_html/somedomain.com/app/code/core/Mage/Core/Controller/Varien/Front.php:188
        [4] /home/gitaar1b/public_html/somedomain.com/app/code/core/Mage/Core/Model/App.php:354
        [5] /home/gitaar1b/public_html/somedomain.com/app/Mage.php:683
        [6] /home/gitaar1b/public_html/somedomain.com/index.php:87

        Thanks again!

  • Annoyed

    robots.txt? – The absolute Last Resort!
    Google have been advising Against using this sort of approach for Years (and years!).
    The use of robots.txt includes the following issues;
    1) The risk of screwing up the site crawl
    2) Can impact PR Flow through your site (and thus rankings!)
    3) Errors could resort in desired pages not showing in the Index (and messing with rankings/conversions)

    Instead, in order of preference;
    1) Fix at Root (alter Magento code to handle it properly and deploy Canonical Link Elements (stripping variant parameters and values, as well as ordering params to be in same/set order!))
    2) Use Google Webmaster Tools and the Parameter Handling tool (designed Specifically for this sort of problem!)
    3) Use NoIndex/Follow (technically you only need insert “NoIndex”, Follow is default behaviour)

    …and the most important option;
    $) Bitch at the crud developers that Still do Not get the grasp of Canonical, despite it being beaten around the field for more than 4 years!
    There is No excuse for these heavily developed systems (or new systems) to launch with these well established problems.
    Things like pagination and root/ + root/?p=1 should Not happen at all.
    ONLY if enough people complain will the Devs take the hint and start handling things properly.

    • http://www.creare.co.uk/ Robert Kent

      Hi ‘Annoyed’,
      Some very good points there and I can understand your frustration with the core code. Hopefully Magento 2 will be implementing some of these fixes into the framework. Until then (or even after) we must do our best to fix what needs fixing for the benefit of the community.

      The Magento developers main aim is to get a framework up and running that performs as an ecommerce store and they have done that extremely well, I suppose it’s the job of the community to provide feedback as we have done – especially in relation to SEO – and for Magento to take note of these issues and implement them wherever possible in future releases.

      I do agree however that these major issues have been around for a very long time now.

  • Amar

    Hi

    Great article. Im having a problem with one of my sites where nearly 70,000 pages have been indexed. Site has had a revamp last year but initially was done by a different development team.

    Ive implemented a robots.txt file for https as well as http. Google has indexed the https version of the site hoping this robots.txt file will fix this issue. Also the majority of the duplicate content is from catalog search which i have disallowed in the robots.txt file and have requested a dozen or so urls with catalogsearch to be removed.

    How long can i expect changes to show in the index??

    The site in question is http://www.velaclothing.co.uk

    Any kind of advice would be much appreciated

    Regards

    • http://www.creare.co.uk/ Robert Kent

      Hi Amar,

      I think your first priority should be to sort out your Meta Titles and Meta Descriptions – looking at the index in Google (screenshot). Our CreareSEO extension should help you with these.

      The robots.txt may take weeks for the duplicate pages to be removed – unless you follow the noindex method and remove the restriction from your robots.txt. Again the CreareSEO extension will help you with this on filtered category pages.

      For catalog search pages though you might be able to manually add the following to your layout XML.

      Inside the node add:

      noindex,follow

      This is untested but I think it should work ok – again you may want to remove the robots.txt restriction so that Google can re-crawl these pages and discover the noindex tag.

      • Amar

        Hi Robert

        Thanks for the advice much appreciated have applied this to the catalogsearch and removed the entry from robots.txt.

        I have also applied the tags to advanced search and popular search terms page, these appear to be working.
        Have checked search and filter pages and this appears to to be working when viewing page source. How long until I can see some progress in duplicate content??

        Also I have installed the CreareSEO extension and all seems good will get onto sorting the Meta Title and Meta Descriptions next.

        Would you recommend to request a removal of the /catalogsearch/ directory or just the url in WMT?

        Is there a guide on implementing CreateSEO extension to it’s best??

        Many Thanks

        • http://www.creare.co.uk/ Robert Kent

          Hi Amar,

          It shouldn’t matter either way – I’d recommend letting in run for a week and seeing if your WMT errors / index page count goes down (should start next time those pages are crawled).

          We have a video on our CreareSEO page for configuring the extension (http://www.creare.co.uk/creare-seo-magento-extension) but for general SEO tips and tricks for Magento sites I can recommend my book – http://www.amazon.co.uk/Magento-Search-Engine-Optimization-Robert/dp/1783288574/ :)

          • Amar

            OK great, well playing the waiting game now. Checking in WMT it appears as tho there’s been a small increase in the number of pages indexed since applying the Creare SEO extension…….maybe I need to wait a week or so to see any progress……..Also when I type Vela Clothing into google it is bring the https version of my homepage up without a description (have set separate robots.txt for https to disallow crawl) should not my http version be appearing instead by now??…….Site was last crawled on the 9th according to indexed pages in WMT the no index was applied a few days prior to this…….will be looking into getting your book for further reading……

          • Amar

            Google is showing a 113,000 pages as being indexed site:www.velaclothing.co.uk

            What else could possibly be causing this??

            Thank you for your help

          • http://www.creare.co.uk/ Robert Kent

            Hi Amar, It’s still showing as 78,800 for me on google.co.uk. As for the HTTPS issue – I’d recommend 301 redirecting your homepage to the non-https version using something like this (at the bottom of your .htaccess):

            #redirect just homepage to http
            RewriteCond %{HTTPS} on
            RewriteRule ^$ http://www.velaclothing.co.uk [R=301,L]

            Getting your homepage indexed on the correct URL will set you off in the right direction for getting the rest of your website indexed correctly.

            A waiting game would be best – they tend to update the index status graph every week in WMT so you may need to wait 1-3 weeks for a recognisable shift

          • Amar

            Hi Rob

            Hope your well. Finally the duplicate content appears to be reseeding. Down to 700 indexed pages from 70,000 plus…

            Will definitley be recommending your extension to others.

            Im having a hitch on on of our client sites with upgrading the extension, Magento Connect just appears to be hanging and not showing updates……I think it may have something to do with a rule in the htaccess file….any suggestions as i dont seem to remember the rule i excluded.

            Great work

          • http://www.creare.co.uk/ Robert Kent

            Hi Amar, Glad to see it’s working for you – yeah unfortunately it just takes a little times to get those indexed pages cleared out. Magento connect does sometimes have an issue with an “index.php” rewrite rule (the rule that forces the removal of index.php from URLs) – the downloader requires this to be able to fetch the Magento Connect files. Normally looks like this:

            RewriteCond %{THE_REQUEST} ^.*/index.php
            RewriteRule ^(.*)index.php$ http://www.yourdomain.com/$1 [R=301,L]

  • allen

    how to remove the layered navigation category’s ?cat=234 param ?

  • shirtsofholland

    Did you turn this into an extension? And was the checking for the ? the best solution?

  • http://www.fabstrands.com Fabstrands

    Hello Rob Kent
    Have you worked out the extensions according to “Adding Meta Robots NOINDEX for Parameter-wielding Categories”?
    We have faced this problem, and we are using amshopby extension,
    Now the category filtered result page have a “shopby” which we set in the admin.
    We want to reset the NOINDEX at each result page.
    I appreciate your kindly reply.
    Thank you!

  • David Gossage

    I would like to disable the noindex tag just for paginated pages, as this will help the crawling of these pages and add relevance to the internal links. Is this possible with the Creare plugin?

    If the rel=next/prev is present and the canonical tags are set up correctly (which they are not) then noindexing these pages should not be necessary.

  • shirtsofholland

    is looking for a ? the smartest thing to do? maybe better count the number of active filters $_head = $this->getLayout()->getBlock(‘head’);
    $_filters = $this->getLayer()->getState()->getFilters();
    if ($_head && $_filters) {
    $_head->setRobots(‘NOINDEX,FOLLOW’);
    }

  • http://www.creare.co.uk/ Robert Kent

    Hi Cameron,

    I see the issue – your ajax filter appends the selection after a # in your URL. In this case it might be possible to swap the ‘?’ in the code mentioned above with a ‘#’. Then when the search engines visit this page via their indexed link they should be presented with the noindex meta robots tag.

    The meta robots tag is usually the best method to strictly inform search engines of the need to remove these pages (unless manually performing remove URL through WMT).

    If you perform the above you may want to firstly comment-out the robots.txt line first so that search engines revisit the affected pages to see the new meta tag.

    Let me know if the above works out for you or if you have any problems with it.

  • Cameron

    Hi Robert,

    Thank you for your reply! I completely understand. I attempted to do the meta robots tag rename as per your instructions but it’s not doing anything. I changed ‘?’ to ‘#’. Files are located at code/local/seo/etc/config.xml and code/local/seo/Model/observer.php. I changed the class in config.xml to seo/observer.

    I don’t know much about writing code but I can read it and everything makes sense to me so I’m not understanding why it isn’t working. Flushed Magento Cache and Cache storage as well as browser cache. Any ideas?

    Attached are screenshots of my files.

  • http://www.creare.co.uk/ Robert Kent

    Hi Cameron,

    Yes the config.xml needs a bit of work – I’ve emailed you a request to get in there and finish it off for you. Either that or I can send you a .zip of a complete module file.

  • Cristina

    Hi, I am having trouble getting to this to work as well and am desperate for a solution! I have implemented all of your suggestions; just can’t get the code sorted out for NOINDEX on filtered pages. Could you please send me the zip module as well? Thank you!