How to Identify User-Agents and IP Addresses for Blocking Bots and Crawlers
good article. I am programming a real time bot detecter. Because many of my problemes are from stealth bots. Bots with usual user-agent.
Thanks for stopping by avanzaweb. Welcome to SpanishSEO :)
Visible bots are easier to deal with than stealth bots, but they are still beatable.
We are also testing a real time bot detector in php. Will have more details once the final testings are done.
Please let me know when you have your solution in place to check it out.
Augusto,
What’s the big deal in doing a roundup reverse and forward lookup in a possibly spoofed IP? I suppose that by reversing a faked Yahoo! IP will end up in a hostname ‘*crawl.yahoo.net’. And in the same way, by doing a forward lookup in that hostname, we’ll get the same faked Yahoo IP… right?
Maybe I missed something in the whole explanation… could you enlight me?
Tanks a lot.
Hi Alvaro,
By doing a round trip DNS lookup you will be able to identify whether an IP, User-Agent and/or hostname are spoofed or legit. The information simply will not match across all of them and that’s your biggest evidence that something is not right and you are potentially dealing with a fake crawler.
You cannot depend on only one factor. Reverse and forward should be run together and the information compared. And if you want to add one more layer of verification, then lookup the whois information of that IP.
A spoofed bot is easy to detect based on its IP, because the hostname will not have the name of the main Search Engines or a quality website. They can call it whatever they want, but the truth is that anyone can fake the identifying information.
For instance, a bot coming from a proxy IP with a User-Agent saying that is the Yahoo! bot most likely will not reverse to crawl.yahoo.net. However, in some instances Search Engines like Yahoo also use IPs that are out of their IP range. Yahoo, Microsoft and Google are well known for using Proxy IP’s that don’t fall into their own range of IPs. In fact, if you see your logs you will probably find IPs from bots coming from Brasil, India, China that appear to belong to Google even though their User-Agents may or may not identify as belonging to Google. So if you run a reverse lookup, you can have the hostname and will be a stronger indicator that the bot, even though without proper identification, belong to the Search Engine. But this is not enough to prove that the bot indeed belongs to Google.
But what if you find unnamed bots, with no User-Agent information, from IPs that don’t fall into the IP ranges of the major Search Engines? That doesn’t necessarily mean that they are faked bots though. A reverse lookup will give you the hostname of that bot that could be xyz.crawl2.yahoo.com. And if you perform a forward DNS Lookup to that hostname, it should give you the same IP address that you originally found. Conversely, if the forward lookup gives you a different IP, then most likely you are dealing with a fake bot.
I hope this explanation helps more.
Cheers,
Ok, but what if someone simply fake the initial IP information, providing one from a trusted bot? I presume it should take us to a valid hostname, and so forward lookup will end up in the same faked yet trusted IP, right?
So the question is…. could a bot do that (fake to a trusted IP) to pass as a valid bot?
Tanks very much.
Yes, a bot could do that!
IP spoofing happens through botnets and DoS attacks (Denial of Service attacks). However, it requires a high degree of understanding of different technologies.
In situations like that there are other things that can be implemented to further protect your site from fake bots/attacks. Packet filtering is one of them.
Augusto, thanks for the answers.
I’ve been testing reverse lookups in this Google iplist, and trying to gather the hostnames… but I found that most of those IPs have no host defined… from nearly 265 tests only three of them returned valid hosts, see:
216.239.45.4 >: 216-239-45-4.google.com
216.239.51.96 >: kc-in-f96.google.com
216.239.51.97 >: kc-in-f97.google.com
If - as you said - not always bots from a trusted source use a consistent user Agent, and if not always those bots reply to a legit hostname,could we consider practically impossible to identify the logs from the desired bots in a website? If not, what other consistent method could we use?
Even though you may run into issues explained above, the most important bots/crawlers’ information, like Google bot coming the IP 66.249.70.244, hostname crawl-66-249-70-244.googlebot.com and identifying itself as googlebot.com, tends to remain stable. Search Engines are aware that small changes to those important bots will cause some serious headaches to webmasters.
For increased protection use a white listing method. That means that you will command your server to ONLY allow certain bots coming from a list of pre-identified IPs, hostnames and User-Agents. The rest of bots will be forbidden and will not be able to access your information. That also can be combined with a massive IP blocking from geographical areas that you are not particularly interested in targeting and that might represent a threat to your security or information.
The main drawback with the white listing method is that sometimes good sources of traffic can get blocked. Even using coffee shops’ proxies to access your site will not go through and that can cost you some traffic. Additionally, you will have to keep an eye on the changes related to bots of your interest, though this is very infrequent.