Magento, Google Merchant Centre & Robots.txt

Recently we’ve been seeing a sharp increase in the number of disallowed Google Merchant (Google shopping) feeds due to the following error: Product pages cannot be crawled because of robots.txt restriction

Google Merchant Centre Error

The advice given by Google was the place some code at the end of our robots.txt file (we are using our Magento boilerplate robots.txt) that essentially negates the effect of everything above it.

The problem is that we have modified our robots.txt for a reason, and there are areas that we wish to block search engine access to – therefore we don’t really want to add this code.

Unfortunately they wouldn’t let us know which lines in our robots.txt file might be causing the restriction so Tom Nolan (PPC), Ashley Mason (SEO) and I went to work on figuring out what could be causing the issue.

After a few days of testing different combinations of our robots.txt file we came up with the following diagnosis (with hindsight we should have spotted this straight away):

The Magento product URLs within our feed contained our session id e.g.


http://www.mymagento.com/my-product.html?SID=adsfasdfaudifydf338asdfasdf

Our robots.txt contained the following line:


Disallow: /*?SID=

Therefore we were submitting to Google Merchant Centre, product URLs that we have explicitly blocked in our robots.txt file – Google were right.

Rather than removing this line from our robots.txt file (as we still want to disallow access to URLs containing session parameters) we decided to work on our custom feed implementation in order to remove the SID string from the end.

To do this we simply substituted:


$product->getProductUrl();

for


$product->getProductUrl(false);

The above code returns our product URL without our session ID. Just a simple tweak and our Merchant Centre feeds are now once again being accepted.

The SID line is common in most Magento robots.txt implementations as it helps protect our website against duplicate cached versions of our pages. Depending on your own Merchant Centre feed implementation you may or may not be submitting URLs with the SID appended. The combination of these two elements is what causes the problem and the simply fix above will sort it out.

The image below shows a Merchant Centre Feed before and after the fix:

Merchant Centre Fix