Practical Solutions for Blocking Spam, Bots and Scrapers

  • : Function ereg() is deprecated in /home/spanishs/public_html/includes/file.inc on line 895.
  • : Function ereg() is deprecated in /home/spanishs/public_html/includes/file.inc on line 895.
  • : Function ereg() is deprecated in /home/spanishs/public_html/includes/file.inc on line 895.
  • : Function ereg() is deprecated in /home/spanishs/public_html/includes/file.inc on line 895.
  • : Function ereg() is deprecated in /home/spanishs/public_html/includes/file.inc on line 895.
  • : Function ereg() is deprecated in /home/spanishs/public_html/includes/file.inc on line 895.
  • : Function ereg() is deprecated in /home/spanishs/public_html/includes/file.inc on line 895.
 

Spam has become a major problem for many of us that deal with content development on a regular basis. It takes different forms, affects performance and compromises the security of a site.

There are bots that specialize in spamming by inserting links and unsolicited information in input fields and forms of websites. The comment section of a blog, subscription or price match forms in eCommerce sites are very susceptible to these types of attacks.

Likewise, unauthorized scrapers represent a thread to intellectual property by means of extracting content of a site to generate scraper sites merely for advertisement, more specifically Adsense. Not only these scraper sites violate copyrights, but also may alter the rankings of the targeted site in the Search Engine Results Page (SERP) due to duplicate content issues.

Even though Google assures that their Search Engine is pretty good at identifying the source of content, the ranking position of websites with low quality indicators may be replaced by the scraper site. Scraping will hinder any Search Engine Optimization efforts and create an additional cost and time to the affected party.

Additionally, there are undesirable web robots created for harvesting email addresses from websites that will be used or sold for spamming purposes.

There is also a type of bots that scan the web looking for unlicensed content in documents, movie and MP3 files. These bots belong to companies that pretend to protect the rights of others by unadvertedly violating the privacy of webmasters.

Similarly, some crawlers are used for spying purposes by collecting keywords, meta descriptions, web technology information and other details that ultimately will be sold to competitors.

But the most dangerous of all are malware bots that are used for injecting viruses or malware in websites, or scanning them looking for security vulnerabilities to exploit. It’s no surprise that blogs and ecommerce sites that become target of these practices end up being hacked and injected with porn or Viagra links.

All these undesirable bots tend to look for information that is normally off limits and in most situations, they completely disregard robots.txt commands.

On the other hand, there are legitimate reasons for crawling a site. For instance to get a site indexed by Search Engines, retrieve data for affiliate partners and other good practices. However, this post focuses on how to block malicious bots, crawlers at a server level and protect your site from spam.

Blocking User Agents with Robots.txt

The first simple solution is to disallow a User Agent through robots.txt. Simply add this to your robots.txt indicating the User Agent name where corresponds:

User-agent: BotNameYouWantToBlockGoesHere
Disallow: /

As mentioned before, the vast majority of bad bots, and in some instances good bots like Yahoo! Slurp, Googlebot and MSN bot, do not respect robots.txt. Under those circumstances, more drastic measures are needed.

Blocking Spam

In a recent post, Dr. Peter J. Meyers, a cognitive psychologist by training and a programmer by blood, revealed his secrets for spam fighting using an algorithmic approach.

His post explains in details the filters he has been using in his war against spam.

A top-rank member of SEOMoz.org, Dr. Pete goes beyond simple explanation. He provides an anti-spam solution that has been in the making for over the past two years.

The php code filters out words, links, strips out some HTML tags and even takes into consideration vowels-to-consonant ratios for non-Roman (also known as non-Latin) characters such as Chinese and Cyrillic. The Cyrillic alphabet is part of the writing system used by six Slavic languages: Belarusian, Bulgarian, Macedonian, Russian, Serbian and Ukrainian.

Blocking Bots, Scrapers and Crawlers with .htaccess

After carefully researching for a solution, I found information that led me to implement some hacks in the .htaccess file. If you are not familiar with .htaccess, or you don't feel comfortable with manipulating the file, I suggest you read the Apache Tutorial on .htaccess files and review the mod_rewrite documentation.

Even though this code has been tested in several sites, nothing is 100% guaranteed. Use it at your own risk if you are certain of your skills. SpanishSEO.org, its members and/or related parties shall not be held responsible for a server failure, website crash or any sort of malfunction caused as a result of using the code explained below.

All you will need to do is copy and paste the following code into your .htaccess file. You will also find copy of the files at the end of the post.

# Blocking Bots and Spiders
RewriteEngine On
RewriteCond {REQUEST_URI} =sitemaps.xml
RewriteRule ^ sitemaps.xml [L]
RewriteCond %{REMOTE_HOST} ^77.91.224.* [OR]
RewriteCond %{HTTP_USER_AGENT} ia_archiver [NC,OR]
RewriteCond %{HTTP_USER_AGENT} discobot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} discobot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^BlackWidow [OR]
RewriteCond %{HTTP_USER_AGENT} ^Bot\ [OR]
RewriteCond %{HTTP_USER_AGENT} ^ChinaClaw [OR]
RewriteCond %{HTTP_USER_AGENT} ^Custo [OR]
RewriteCond %{HTTP_USER_AGENT} ^DISCo [OR]
RewriteCond %{HTTP_USER_AGENT} ^Download\ Demon [OR]
RewriteCond %{HTTP_USER_AGENT} ^eCatch [OR]
RewriteCond %{HTTP_USER_AGENT} ^EirGrabber [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailWolf [OR]
RewriteCond %{HTTP_USER_AGENT} ^Express\ WebPictures [OR]
RewriteCond %{HTTP_USER_AGENT} ^ExtractorPro [OR]
RewriteCond %{HTTP_USER_AGENT} ^EyeNetIE [OR]
RewriteCond %{HTTP_USER_AGENT} ^FlashGet [OR]
RewriteCond %{HTTP_USER_AGENT} ^GetRight [OR]
RewriteCond %{HTTP_USER_AGENT} ^GetWeb! [OR]
RewriteCond %{HTTP_USER_AGENT} ^Go!Zilla [OR]
RewriteCond %{HTTP_USER_AGENT} ^Go-Ahead-Got-It [OR]
RewriteCond %{HTTP_USER_AGENT} ^GrabNet [OR]
RewriteCond %{HTTP_USER_AGENT} ^Grafula [OR]
RewriteCond %{HTTP_USER_AGENT} ^HMView [OR]
RewriteCond %{HTTP_USER_AGENT} HTTrack [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Image\ Stripper [OR]
RewriteCond %{HTTP_USER_AGENT} ^Image\ Sucker [OR]
RewriteCond %{HTTP_USER_AGENT} Indy\ Library [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^InterGET [OR]
RewriteCond %{HTTP_USER_AGENT} ^Internet\ Ninja [OR]
RewriteCond %{HTTP_USER_AGENT} ^JetCar [OR]
RewriteCond %{HTTP_USER_AGENT} ^JOC\ Web\ Spider [OR]
RewriteCond %{HTTP_USER_AGENT} ^larbin [OR]
RewriteCond %{HTTP_USER_AGENT} ^LeechFTP [OR]
RewriteCond %{HTTP_USER_AGENT} LinksManager.com_bot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} linkwalker [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Mass\ Downloader [OR]
RewriteCond %{HTTP_USER_AGENT} ^MIDown\ tool [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mister\ PiX [OR]
RewriteCond %{HTTP_USER_AGENT} ^Navroad [OR]
RewriteCond %{HTTP_USER_AGENT} ^NearSite [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetAnts [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetSpider [OR]
RewriteCond %{HTTP_USER_AGENT} ^Net\ Vampire [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetZIP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Octopus [OR]
RewriteCond %{HTTP_USER_AGENT} ^Offline\ Explorer [OR]
RewriteCond %{HTTP_USER_AGENT} ^Offline\ Navigator [OR]
RewriteCond %{HTTP_USER_AGENT} ^PageGrabber [OR]
RewriteCond %{HTTP_USER_AGENT} ^Papa\ Foto [OR]
RewriteCond %{HTTP_USER_AGENT} ^pavuk [OR]
RewriteCond %{HTTP_USER_AGENT} ^pcBrowser [OR]
RewriteCond %{HTTP_USER_AGENT} ^RealDownload [OR]
RewriteCond %{HTTP_USER_AGENT} ^ReGet [OR]
RewriteCond %{HTTP_USER_AGENT} ^SiteSnagger [OR]
RewriteCond %{HTTP_USER_AGENT} ^SmartDownload [OR]
RewriteCond %{HTTP_USER_AGENT} ^SuperBot [OR]
RewriteCond %{HTTP_USER_AGENT} ^SuperHTTP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Surfbot [OR]
RewriteCond %{HTTP_USER_AGENT} ^tAkeOut [OR]
RewriteCond %{HTTP_USER_AGENT} ^Teleport\ Pro [OR]
RewriteCond %{HTTP_USER_AGENT} ^VoidEYE [OR]
RewriteCond %{HTTP_USER_AGENT} ^Web\ Image\ Collector [OR]
RewriteCond %{HTTP_USER_AGENT} ^Web\ Sucker [OR]
RewriteCond %{HTTP_USER_AGENT} webalta [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WebAuto [OR]
RewriteCond %{HTTP_USER_AGENT} WebCollage [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WebCopier [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebFetch [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebGo\ IS [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebLeacher [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebReaper [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebSauger [OR]
RewriteCond %{HTTP_USER_AGENT} ^Website\ eXtractor [OR]
RewriteCond %{HTTP_USER_AGENT} ^Website\ Quester [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebStripper [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebWhacker [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebZIP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Wget [OR]
RewriteCond %{HTTP_USER_AGENT} ^Widow [OR]
RewriteCond %{HTTP_USER_AGENT} ^WWWOFFLE [OR]
RewriteCond %{HTTP_USER_AGENT} ^Xaldon\ WebSpider [OR]
RewriteCond %{HTTP_USER_AGENT} Yandex [NC,OR]
RewriteCond %{HTTP_USER_AGENT} zermelo [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Zeus [NC, OR]
RewriteCond %{HTTP_USER_AGENT} ZyBorg [NC]
RewriteRule .* bot-response.php [L]

Let's break down the above code into sections for better understanding:


RewriteEngine On

You need to turn the rewrite engine on the first line before any other rewrite statements. If you need to turn it off for testing purposes, simply replace "On" for "Off". There is no need to remove the entire code.


RewriteCond {REQUEST_URI} =sitemaps.xml
RewriteRule ^ sitemaps.xml [L]

This short-circuit rule indicates that if the bot asks for the sitemap, then the sitemap will be delivered and all other rules below won't apply. The [L] , is the last rule a request will see, which literally means "stop the rewriting process here and don't apply any more rewriting rules".


RewriteCond %{REMOTE_HOST} ^77.91.224.* [OR]

This line is for blocking IPs either individually or by a range. If you want to add one IP simply copy that line and change the IP address. For IP ranges use the IP wildcards. Make sure the [OR] stays at the end of the line because it leads to the next condition. In the example above, an IP range that is well known for scraping gets blocked.


RewriteCond %{HTTP_USER_AGENT} ia_archiver [NC,OR]

This line is for blocking bots based on the user agent. "NC" makes the pattern case-insensitive, which indicates that the name of the bot can be either upper case or lower case. Without "NC" it will only be an exact match. "OR" is always needed if you are adding a continuation line.


RewriteCond %{HTTP_USER_AGENT} ZyBorg [NC]

This is the line before the last one and MUST NOT include "OR". [NC] is still helpful to cover lower or upper case user agents.


RewriteRule .* bot-response.php [L]

This line is the rule to call the file bot-response.php, which contains an image. An alternative to this final line is a 403 Forbidden redirect. The code for 403 Forbidden is:


RewriteRule ^.* - [F,L]

"F" indicates the forbidden condition and "L" tells mod-rewrite that this is the last rule that needs to be processed in this case, and to stop rewriting as soon as it is processed.

The code for bot-response.php is:

<?
header('Content-type: image/gif');
readfile('bot-response.gif');
exit;
header("Location: bot-response.gif");
?>

Make sure there is no space after "?>" otherwise, php will give you an error.

The bot-response.php file calls the picture bot-response.gif, which in this case is a 1000x600 gif file that will break the scraper's layout if the content is shown elsewhere. Hopefully this will prevent scrapers from using stolen content on a different site.

You can change the name, format and size of the picture and file at your convenience by editing the PHP code. Just replace the name of the new image in bot-response.php and change the file name in the aforementioned hack in .htaccess.

Remember to place bot-response.php in the root. You can test the file by entering yoursite.com/bot-response.php in the address bar, or click here to see it in action.

You can download the following files here:

  • .htaccess
  • bot-response.php
  • bot-response.gif

Now that you have the tools to block bots, scrapers, crawlers and spammers, you need to make sure to understand How to Identify User-Agents and IP Addresses for Blocking Bots and Crawlers explained in the next post.

Your rating: None Average: 3.7 (3 votes)