Duplicate Content Problems and Solutions

warning: mysqli_query(): (HY000/1021): Disk full (/tmp/#sql-temptable-a70f-79a34-2d8.MAI); waiting for someone to free some space... (errno: 28 "No space left on device") in /chroot/home/spanishs/spanishseo.org/html/includes/database.mysqli.inc on line 108.
: Function ereg() is deprecated in /chroot/home/spanishs/spanishseo.org/html/includes/file.inc on line 895.
: Function ereg() is deprecated in /chroot/home/spanishs/spanishseo.org/html/includes/file.inc on line 895.
: Function ereg() is deprecated in /chroot/home/spanishs/spanishseo.org/html/includes/file.inc on line 895.
: Function ereg() is deprecated in /chroot/home/spanishs/spanishseo.org/html/includes/file.inc on line 895.
: Function ereg() is deprecated in /chroot/home/spanishs/spanishseo.org/html/includes/file.inc on line 895.
: Function ereg() is deprecated in /chroot/home/spanishs/spanishseo.org/html/includes/file.inc on line 895.
: Function ereg() is deprecated in /chroot/home/spanishs/spanishseo.org/html/includes/file.inc on line 895.
warning: mysqli_query(): (HY000/1021): Disk full (/tmp/#sql-temptable-a70f-79a34-2d9.MAI); waiting for someone to free some space... (errno: 28 "No space left on device") in /chroot/home/spanishs/spanishseo.org/html/includes/database.mysqli.inc on line 108.
warning: mysqli_query(): (HY000/1021): Disk full (/tmp/#sql-temptable-a70f-79a34-2da.MAI); waiting for someone to free some space... (errno: 28 "No space left on device") in /chroot/home/spanishs/spanishseo.org/html/includes/database.mysqli.inc on line 108.
: preg_replace(): The /e modifier is deprecated, use preg_replace_callback instead in /chroot/home/spanishs/spanishseo.org/html/includes/unicode.inc on line 345.
: preg_replace(): The /e modifier is deprecated, use preg_replace_callback instead in /chroot/home/spanishs/spanishseo.org/html/includes/unicode.inc on line 345.
: preg_replace(): The /e modifier is deprecated, use preg_replace_callback instead in /chroot/home/spanishs/spanishseo.org/html/includes/unicode.inc on line 345.

Posted by Augusto Ellacuriaga on September 15, 2008 | 1 comment

Google recently issued an official opinion regarding duplicate content in an article titled Demystifying the "duplicate content penalty." They emphasized the fact that some techniques will not cause a penalty, but instead may affect the performance of a website in the Search Engine Result Pages (SERP).

According to a post by Google made on Monday, December 18, 2006 at 2:28 PM

Duplicate content generally refers to substantive blocks of content within or across domains that either completely match other content or are appreciably similar.

In a different post, Google representatives also stated that "Duplicate content doesn't cause your site to be placed in the supplemental index. Duplication may indirectly influence this however."

Indeed, there might not be a duplicate content penalty, but that does not mean that duplicate content cannot hurt your site, or that a manual reviewer will knock it off of the rankings because of presumable duplicity.

Google's clarification also has created speculation, misunderstanding and confusion for two main reasons:

People are starting to believe that now it is ok to reproduce content using doorway pages and thin affiliate sites without getting any penalties. I would even venture to say that scrappers are feeling more comfortable with their illicit activities of copyright violation, when in reality this is still against the Terms Of Service of Search Engines.
Other issues that directly affect duplicate content were not addressed at all, furthering the myth, use and abuse of certain techniques. For example, there are still pending questions about Google's position in regards to duplicate content in global sites targeting several markets under one same language such as USA, UK, Canada and Australia.

In this article I will explore some common ways that people incur in duplicate content and will provide alternatives, though arguably the best, to deal with those issues.

Duplicate URLs, Page Titles, Meta Descriptions and Meta Keywords

Perhaps one of the biggest problems that novice, and some experience webmasters, face in terms of SEO is largely created by dynamic websites such as Content Management Systems (CMS), blogs and eCommerce sites.

For instance, let’s say you have an eCommerce site that has 20 products split in 2 pages. The first URL will likely be something like this:
www.yoursite.com/duplicate-content-product.html

The second URL will only append a number, page number or other characters at the end of the URL to make it unique like in this example:
www.yoursite.com/duplicate-content-product-p2.html

Even though the URLs are technically different, from the Search Engine perspective they are almost identical; hence, increasing the likelihood of duplicated URLs.

Bear in mind that even though eCommerce sites use dynamic URLs, they can be changed to the more SEO friendly static URLs by using mod_rewrites.

Additionally, the Meta Description for the first URL will most likely be as same as the second one if no special implementation of a module is added to separate this information.

At this point not only we have duplicate URLs, but also duplicate Meta Descriptions and Meta Keywords if they are included. These kinds of problems are predominately detected through notifications like with Google Webmaster Tools.

One practical solution is to use nofollow in pagination links and noindex/nofollow Meta tags in subsequent pages. The subsequent pages can also be blocked with robots.txt commands.

Furthermore, make an effort to write different titles, description and when possible target different keywords for each page. Don’t rely on short descriptions, because they tend to create more problems than expected. Make use of the nosnippet tag for some specific cases.

Multiple URLs Pointing to the Same Page

From 2000 through 2003, William Pugh and Monika Henzinger conducted research for Google regarding duplicate content. One of the issues covered in the PDF presentation of the US Patent 6658423 was that multiple URLs like http://www.cs.umd.edu/~pugh and
http://www.cs.umd.edu/users/pugh was thought to create duplicate or near duplicate content problems.

Now that concern seems to be a part of the past. Google, and apparently other Search Engines, are grouping all similar URLs into one cluster, and then proceeding to select what they consider is the best URL to represent the cluster. Once the URL has been chosen, the link popularity will be consolidated on that one link.

However, this does not mean that having multiple URLs can not dilute link popularity as stated by Google representatives.

Likewise, pointing to one page through several URLs can be contra productive, because it will be up to the Search Engine to determine which the main source is out of all the options available. As Michael Gray stated

When you leave it up to Google you are hoping they guess that’s what you wanted, while they do get it right in many cases, there are lots of times where they don’t.

There are a couple of options that can help. The first one is to use Theming and Siloing concepts for grouping the information. For instance, if you are developing a blog like this one for your SEO Company, you can categorize your information and try to keep each posts under only one category instead of dispersing it in 5 or 6. Interlinking posts within the same category will also help with theming.

The second option is to let the Search Engine know which URL, out of the many pointing to the same page, you prefer by including it in your Sitemap.xml file. And if you don't want to use your sitemap.xml to indicate your preferred choice, the other option is using robots.txt commands to block the non-preferred URL. This should be combined with the use of nofollow to all those links to the non-preferred URL.

Canonical Issues

The canonical issue has been covered all over the web. Basically, you should decide to either use the www version of the site, like in www.SpanishSEO.org, versus the non-www version (SpanishSEO.org). Once the decision is made, then you can use a 301 redirect l to send the non-chosen version to the preferred version.

You should also let Google know your preferred choice through the Google Webmaster Tools console.

Text Files and Print Pages

The other well known duplicate content issue is created by PDF, MS Word, MS Excel and any other files that can be read by a crawler and that is not excluded in robots.txt. The same situation applies to pages used for printing purposes that concentrate on clean textual information more than the website’s design elements. To take care of this problem simply disallow crawlers from accessing those similar files and pages through robots.txt or add the noindex meta tag to those pages.

Navigational Problems

If you have several ways to reach the same page in the navigation bar, make sure to eliminate, block or restrict all unneeded navigation paths. At this point you should consider preserving link popularity by not serving several URLs and allowing the bots to access your content from different angles creating a potential circular navigation. This will ultimately confuse the bots during the discovery process while increasing your bandwidth.

You can use the nofollow attribute in the navigation or use JavaScript to have better control. Keep in mind that theories like the first link only counts, even though it is still inconclusive, may also affect the weight giving to links on a page.

And if you are using breadcrumbs for usability purposes, a good way to improve it is by using cookies, though this has to be carefully thought out and researched to avoid other problems.

Linking Structure

Ever heard about problems caused by domains with split PR? This is a very common problem that webmasters fail to recognize mainly because of poor use of URL paths like SpanishSEO.org/blog and SpanishSEO.org/blog/. They are two completely different addresses.

If your links point to yoursite.com, yoursite.com/index.php or www.yoursite.com, you are basically dealing with 2 or more different sites. There are 5 things you can do to fix this:

Make sure all internal links in your site are pointing to the domain selected based on your canonical preference.
Use only the chosen canonical preference for all external links. Do NOT use www.yoursite.com/index.htm or yoursite.com/index.htm. If you have external links pointing to URLs other than the preferred option, try to change them one by one.
If you have breadcrumbs make sure to verify all the links used as your preferred choice. People tend to forget to change the link pointing to "Home" in the breadcrumbs.
Use either relatives or absolute links, not both. An example of a relative link for this site is “/” pointing to the homepage. An absolute link is http://www.yoursite.com/
Add the following code to your .htaccess to redirect yoursite.com/index.html to yoursite.com for static pages.

RewriteRule /index.htm / [NC,R=301,L]
This code works for www or non-www domains. If you use dynamic pages, the following code may do the work. Keep in mind that depending on your setup and needs, your .htaccess might will need additional customization.
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /([^/]*/)*(index¦home)\.(html?¦php¦aspx?).*\ HTTP/ RewriteRule ^(([^/]*/)*)(index¦home)\.(html?¦php¦aspx?)$ http://www.yoursite.com/$1? [R=301,L]

Don't forget to change www.yoursite.com for your domain name.

I think this cover some of the more severe problems that can cause duplicate content. And as said before, these are arguably the most effective ways to deal with each problem. If there is something else you consider necessary to add or change, please feel free to voice your opinion.

Bookmark/Search this post with:

Spanish SEO