Duplicate Content Problems and Solutions

  • : Function ereg() is deprecated in /home/spanishs/public_html/includes/file.inc on line 895.
  • : Function ereg() is deprecated in /home/spanishs/public_html/includes/file.inc on line 895.
  • : Function ereg() is deprecated in /home/spanishs/public_html/includes/file.inc on line 895.
  • : Function ereg() is deprecated in /home/spanishs/public_html/includes/file.inc on line 895.
  • : Function ereg() is deprecated in /home/spanishs/public_html/includes/file.inc on line 895.
  • : Function ereg() is deprecated in /home/spanishs/public_html/includes/file.inc on line 895.
  • : Function ereg() is deprecated in /home/spanishs/public_html/includes/file.inc on line 895.
  • strict warning: Non-static method view::load() should not be called statically in /home/spanishs/public_html/sites/all/modules/views/views.module on line 843.
  • strict warning: Declaration of views_plugin_display::options_validate() should be compatible with views_plugin::options_validate(&$form, &$form_state) in /home/spanishs/public_html/sites/all/modules/views/plugins/views_plugin_display.inc on line 1707.
  • strict warning: Declaration of views_plugin_display_block::options_submit() should be compatible with views_plugin_display::options_submit(&$form, &$form_state) in /home/spanishs/public_html/sites/all/modules/views/plugins/views_plugin_display_block.inc on line 184.
  • strict warning: Declaration of views_handler_field_broken::ui_name() should be compatible with views_handler::ui_name($short = false) in /home/spanishs/public_html/sites/all/modules/views/handlers/views_handler_field.inc on line 615.
  • strict warning: Declaration of views_handler_sort_broken::ui_name() should be compatible with views_handler::ui_name($short = false) in /home/spanishs/public_html/sites/all/modules/views/handlers/views_handler_sort.inc on line 82.
  • strict warning: Declaration of views_handler_filter::options_validate() should be compatible with views_handler::options_validate($form, &$form_state) in /home/spanishs/public_html/sites/all/modules/views/handlers/views_handler_filter.inc on line 585.
  • strict warning: Declaration of views_handler_filter::options_submit() should be compatible with views_handler::options_submit($form, &$form_state) in /home/spanishs/public_html/sites/all/modules/views/handlers/views_handler_filter.inc on line 585.
  • strict warning: Declaration of views_handler_filter_broken::ui_name() should be compatible with views_handler::ui_name($short = false) in /home/spanishs/public_html/sites/all/modules/views/handlers/views_handler_filter.inc on line 609.
  • strict warning: Declaration of views_plugin_row::options_validate() should be compatible with views_plugin::options_validate(&$form, &$form_state) in /home/spanishs/public_html/sites/all/modules/views/plugins/views_plugin_row.inc on line 124.
  • strict warning: Declaration of views_plugin_row::options_submit() should be compatible with views_plugin::options_submit(&$form, &$form_state) in /home/spanishs/public_html/sites/all/modules/views/plugins/views_plugin_row.inc on line 124.
  • strict warning: Non-static method view::load() should not be called statically in /home/spanishs/public_html/sites/all/modules/views/views.module on line 843.
  • strict warning: Declaration of views_handler_filter_many_to_one::init() should be compatible with views_handler_filter::init(&$view, $options) in /home/spanishs/public_html/sites/all/modules/views/handlers/views_handler_filter_many_to_one.inc on line 61.
  • strict warning: Declaration of views_handler_filter_term_node_tid::value_validate() should be compatible with views_handler_filter::value_validate($form, &$form_state) in /home/spanishs/public_html/sites/all/modules/views/modules/taxonomy/views_handler_filter_term_node_tid.inc on line 303.
  • warning: Creating default object from empty value in /home/spanishs/public_html/sites/all/modules/views/includes/handlers.inc on line 650.
  • strict warning: Non-static method view::load() should not be called statically in /home/spanishs/public_html/sites/all/modules/views/views.module on line 843.
 

Google recently issued an official opinion regarding duplicate content in an article titled Demystifying the "duplicate content penalty." They emphasized the fact that some techniques will not cause a penalty, but instead may affect the performance of a website in the Search Engine Result Pages (SERP).

According to a post by Google made on Monday, December 18, 2006 at 2:28 PM

Duplicate content generally refers to substantive blocks of content within or across domains that either completely match other content or are appreciably similar.

In a different post, Google representatives also stated that "Duplicate content doesn't cause your site to be placed in the supplemental index. Duplication may indirectly influence this however."

Indeed, there might not be a duplicate content penalty, but that does not mean that duplicate content cannot hurt your site, or that a manual reviewer will knock it off of the rankings because of presumable duplicity.

Google's clarification also has created speculation, misunderstanding and confusion for two main reasons:

  1. People are starting to believe that now it is ok to reproduce content using doorway pages and thin affiliate sites without getting any penalties. I would even venture to say that scrappers are feeling more comfortable with their illicit activities of copyright violation, when in reality this is still against the Terms Of Service of Search Engines.
  2. Other issues that directly affect duplicate content were not addressed at all, furthering the myth, use and abuse of certain techniques. For example, there are still pending questions about Google's position in regards to duplicate content in global sites targeting several markets under one same language such as USA, UK, Canada and Australia.

In this article I will explore some common ways that people incur in duplicate content and will provide alternatives, though arguably the best, to deal with those issues.

Duplicate URLs, Page Titles, Meta Descriptions and Meta Keywords

Perhaps one of the biggest problems that novice, and some experience webmasters, face in terms of SEO is largely created by dynamic websites such as Content Management Systems (CMS), blogs and eCommerce sites.

For instance, let’s say you have an eCommerce site that has 20 products split in 2 pages. The first URL will likely be something like this:
www.yoursite.com/duplicate-content-product.html

The second URL will only append a number, page number or other characters at the end of the URL to make it unique like in this example:
www.yoursite.com/duplicate-content-product-p2.html

Even though the URLs are technically different, from the Search Engine perspective they are almost identical; hence, increasing the likelihood of duplicated URLs.

Bear in mind that even though eCommerce sites use dynamic URLs, they can be changed to the more SEO friendly static URLs by using mod_rewrites.

Additionally, the Meta Description for the first URL will most likely be as same as the second one if no special implementation of a module is added to separate this information.

At this point not only we have duplicate URLs, but also duplicate Meta Descriptions and Meta Keywords if they are included. These kinds of problems are predominately detected through notifications like with Google Webmaster Tools.

One practical solution is to use nofollow in pagination links and noindex/nofollow Meta tags in subsequent pages. The subsequent pages can also be blocked with robots.txt commands.

Furthermore, make an effort to write different titles, description and when possible target different keywords for each page. Don’t rely on short descriptions, because they tend to create more problems than expected. Make use of the nosnippet tag for some specific cases.

Multiple URLs Pointing to the Same Page

From 2000 through 2003, William Pugh and Monika Henzinger conducted research for Google regarding duplicate content. One of the issues covered in the PDF presentation of the US Patent 6658423 was that multiple URLs like http://www.cs.umd.edu/~pugh and
http://www.cs.umd.edu/users/pugh was thought to create duplicate or near duplicate content problems.

Now that concern seems to be a part of the past. Google, and apparently other Search Engines, are grouping all similar URLs into one cluster, and then proceeding to select what they consider is the best URL to represent the cluster. Once the URL has been chosen, the link popularity will be consolidated on that one link.

However, this does not mean that having multiple URLs can not dilute link popularity as stated by Google representatives.

Likewise, pointing to one page through several URLs can be contra productive, because it will be up to the Search Engine to determine which the main source is out of all the options available. As Michael Gray stated

When you leave it up to Google you are hoping they guess that’s what you wanted, while they do get it right in many cases, there are lots of times where they don’t.

There are a couple of options that can help. The first one is to use Theming and Siloing concepts for grouping the information. For instance, if you are developing a blog like this one for your SEO Company, you can categorize your information and try to keep each posts under only one category instead of dispersing it in 5 or 6. Interlinking posts within the same category will also help with theming.

The second option is to let the Search Engine know which URL, out of the many pointing to the same page, you prefer by including it in your Sitemap.xml file. And if you don't want to use your sitemap.xml to indicate your preferred choice, the other option is using robots.txt commands to block the non-preferred URL. This should be combined with the use of nofollow to all those links to the non-preferred URL.

Canonical Issues

The canonical issue has been covered all over the web. Basically, you should decide to either use the www version of the site, like in www.SpanishSEO.org, versus the non-www version (SpanishSEO.org). Once the decision is made, then you can use a 301 redirect l to send the non-chosen version to the preferred version.

You should also let Google know your preferred choice through the Google Webmaster Tools console.

Text Files and Print Pages

The other well known duplicate content issue is created by PDF, MS Word, MS Excel and any other files that can be read by a crawler and that is not excluded in robots.txt. The same situation applies to pages used for printing purposes that concentrate on clean textual information more than the website’s design elements. To take care of this problem simply disallow crawlers from accessing those similar files and pages through robots.txt or add the noindex meta tag to those pages.

Navigational Problems

If you have several ways to reach the same page in the navigation bar, make sure to eliminate, block or restrict all unneeded navigation paths. At this point you should consider preserving link popularity by not serving several URLs and allowing the bots to access your content from different angles creating a potential circular navigation. This will ultimately confuse the bots during the discovery process while increasing your bandwidth.

You can use the nofollow attribute in the navigation or use JavaScript to have better control. Keep in mind that theories like the first link only counts, even though it is still inconclusive, may also affect the weight giving to links on a page.

And if you are using breadcrumbs for usability purposes, a good way to improve it is by using cookies, though this has to be carefully thought out and researched to avoid other problems.

Linking Structure

Ever heard about problems caused by domains with split PR? This is a very common problem that webmasters fail to recognize mainly because of poor use of URL paths like SpanishSEO.org/blog and SpanishSEO.org/blog/. They are two completely different addresses.

If your links point to yoursite.com, yoursite.com/index.php or www.yoursite.com, you are basically dealing with 2 or more different sites. There are 5 things you can do to fix this:

  1. Make sure all internal links in your site are pointing to the domain selected based on your canonical preference.
  2. Use only the chosen canonical preference for all external links. Do NOT use www.yoursite.com/index.htm or yoursite.com/index.htm. If you have external links pointing to URLs other than the preferred option, try to change them one by one.
  3. If you have breadcrumbs make sure to verify all the links used as your preferred choice. People tend to forget to change the link pointing to "Home" in the breadcrumbs.
  4. Use either relatives or absolute links, not both. An example of a relative link for this site is “/” pointing to the homepage. An absolute link is http://www.yoursite.com/
  5. Add the following code to your .htaccess to redirect yoursite.com/index.html to yoursite.com for static pages.


RewriteRule /index.htm / [NC,R=301,L]

This code works for www or non-www domains. If you use dynamic pages, the following code may do the work. Keep in mind that depending on your setup and needs, your .htaccess might will need additional customization.

RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /([^/]*/)*(index¦home)\.(html?¦php¦aspx?).*\ HTTP/
RewriteRule ^(([^/]*/)*)(index¦home)\.(html?¦php¦aspx?)$ http://www.yoursite.com/$1? [R=301,L]

Don't forget to change www.yoursite.com for your domain name.

I think this cover some of the more severe problems that can cause duplicate content. And as said before, these are arguably the most effective ways to deal with each problem. If there is something else you consider necessary to add or change, please feel free to voice your opinion.

No votes yet