LogZ Free Log Analyzer for User-Agent and IP Address Detection

  • warning: mysqli_query(): (HY000/1021): Disk full (/tmp/#sql-temptable-a70f-6c694-12f7c.MAI); waiting for someone to free some space... (errno: 28 "No space left on device") in /chroot/home/spanishs/spanishseo.org/html/includes/database.mysqli.inc on line 108.
  • : Function ereg() is deprecated in /chroot/home/spanishs/spanishseo.org/html/includes/file.inc on line 895.
  • : Function ereg() is deprecated in /chroot/home/spanishs/spanishseo.org/html/includes/file.inc on line 895.
  • : Function ereg() is deprecated in /chroot/home/spanishs/spanishseo.org/html/includes/file.inc on line 895.
  • : Function ereg() is deprecated in /chroot/home/spanishs/spanishseo.org/html/includes/file.inc on line 895.
  • : Function ereg() is deprecated in /chroot/home/spanishs/spanishseo.org/html/includes/file.inc on line 895.
  • : Function ereg() is deprecated in /chroot/home/spanishs/spanishseo.org/html/includes/file.inc on line 895.
  • : Function ereg() is deprecated in /chroot/home/spanishs/spanishseo.org/html/includes/file.inc on line 895.
  • warning: mysqli_query(): (HY000/1021): Disk full (/tmp/#sql-temptable-a70f-6c694-12f7d.MAI); waiting for someone to free some space... (errno: 28 "No space left on device") in /chroot/home/spanishs/spanishseo.org/html/includes/database.mysqli.inc on line 108.
  • : preg_replace(): The /e modifier is deprecated, use preg_replace_callback instead in /chroot/home/spanishs/spanishseo.org/html/includes/unicode.inc on line 345.
  • : preg_replace(): The /e modifier is deprecated, use preg_replace_callback instead in /chroot/home/spanishs/spanishseo.org/html/includes/unicode.inc on line 345.
  • : preg_replace(): The /e modifier is deprecated, use preg_replace_callback instead in /chroot/home/spanishs/spanishseo.org/html/includes/unicode.inc on line 345.
 

LogZ Free Log Analyzer

LogZ is a Free Log Analyzer developed in php that retrieves data from raw log files and organizes it in an easy yet relevant way for further analysis. LogZ works with any Apache combined log file format.

Now you can know exactly which bots and crawlers visit your website. Our Free Log Analyzer tabulates data on User-Agents (UA), the number of visits they made, and shows their IP Addresses or Hostnames. LogZ also provides additional tools to perform Reverse and Forward DNS Lookups, get the Whois information of a Domain Name or IP Address, and find the Geo Location of an IP Address.

This can be used to help you update your blocked robots and crawlers lists as discussed in my previous post titled Practical Solutions for Blocking Spam, Bots and Scrapers.

I. Download LogZ Free Log Analyzer

To get you started follow these steps:

  1. Download LogZand unzip it.
  2. Upload logz.php to the root of your site.
  3. Create a folder with the name logz in the root directory. If you decide to use a different folder name, then change line 12 of logz.php and include the new folder path.
    define("LOGFOLDER","./logz"); ?>
  4. For security purposes, password protect the logz folder. If you are using Cpanel, use the option Password Protect Directories shown in the left side of this image:
    Password Protected Directories

    If you are not using Cpanel, you can password protect the folder by using this .htaccess and .htpasswd password generator.

  5. Download a raw log file from your server and unpack it.
  6. Upload the unpacked log file to the logz folder. You can upload as many log files as you want to the same folder.
  7. Call logz.php using yoursite.com/logz.php.

II. Blocking Bots and Crawlers with LogZ

Before I even start explaining the features of LogZ Free Log Analyzer, I highly recommend you to read How to Identify User-Agents and IP Addresses for Blocking Bots and Crawlers to familiarized with basic concepts about User-Agents, Hostnames, and DNS Lookups.

This explanation will also show you the steps taken to decide which bots and crawlers to block.

Section I: User and Log File Information

Once LogZ is setup and running, the first section to appear in your browser will be something similar to this one:

LogZ Free Log Analyzer

There are 8 important areas in this section that contain relevant information:

  1. Your browser sends this User-Agent header - shows the User-Agent string with details of your browser and your system.
  2. From this IP Address - shows your current IP Address.
  3. Databases - are links to a couple of sources of information that provide details on bots and User-Agents.
  4. File Name - contains the log files of the site in analysis. For example purposes I uploaded two different log files named website1.org and website2.net. However, these files are not included in the script.
  5. Size (In Bytes) - shows information about the size of each log file expressed in bytes.
  6. Date and Time - shows the date and time when the log files were downloaded from the server.
  7. Sort by UA - is the first column of the User-Agent (UA) Information section. This link will take you to the list of User-Agents in alphabetical order.
  8. Sort by # of Visits - this link will take you to the list of User-Agents arranged by the number of visits made to the website.

If you click on the LogZ logo you will be redirected to this post for questions and additional information.

For better navigation most links are in orange and will get underlined when the mouse is over. The only exceptions are the Sort by header titles and the Creative Commons links.

I will use website1.org sorted by UA to show you how Logz works.

Section II: User-Agent Listing

After clicking on Sort by UA you will be taken to the User-Agent listing section where the data is sorted in alphabetical order (#2 in the image below). Otherwise, if you chose Sort by # of Visits in the prior section (#8 in the image above), it would have been sorted by the number of visits (#1 image below).

You can sort the listing either way at anytime by clicking on the header title of each column. Should you want to re-arrange the listing according to the number of visits, then click on Sort by # of Visits (#1). You can always go back to the prior section by clicking back (#3).

Sorting by User Agent

In case you are wondering about the number of visits of a User-Agent, this amount is in direct proportion to the number of inclusions in the log file of this same UA.

Your good understanding of User-Agent strings will be vital to detect bad bots and crawlers in this section. For instance, a quick check through the list of User-Agents sorted by the number of visits shows a suspicious UA named 'Python-urllib/1.17'. This User-Agent draws attention by not providing information in the string and having a potential association to the term Phyton programming language.

Log Analyzer Bot Detection

You can either lookup the suspicious User-Agent in the databases Bots vs Browsers or User-Agents, or do a quick search in any Search Engine using quotations for more precise results (i.e. "suspicious User-Agent goes here" ). If the information about the UA indicates that you should block it, then you can use any of the methods listed in Practical Solutions for Blocking Spam, Bots and Scrapers.

The next step in the log file analysis process is to find out the IP Address or Hostname of the suspicious UA. To do so, click on the the number 2 to the left of 'Python-urllib/1.17'. You will be taken to the third section.

Section III: IP Address and Hostname Identification

The third section contains data sorted by IP Address or Hostname (#1) and the number of visits. A simple difference between IP's and Hostnames is that the former are only numerical (i.e. 16.25.45.164 for IPv4), while the latter can be a combination of numerical and non-numerical characters (i.e. crawl-66-249-70-244.googlebot.com).

Furthermore, due to the upcoming changes in non-ASCII Multi Language Domain Names, you should take into consideration the following paragraph from Wikipedia to recognize Hostnames:

Unlike domain names, hostname labels can only be made up of the ASCII letters 'a' through 'z' (case-insensitive), the digits '0' through '9', and the hyphen. Labels cannot start nor end with a hyphen. Special characters other than the hyphen (and the dot between labels) are not allowed, although they are sometimes used anyway.

Additional information about IP Address and Hostnames can be found by following the links.

LogZ Free Log Analyzer shows in #2 the Hostname 202.237.232.72.static.reverse.ltdomains.com that belongs to the suspicious bot before detected.

Sorting by Hostname and IP

Round Trip DNS Lookups

As suggested in the best alternatives to identify undesirable bots and crawlers, performing Forward and Reverse DNS Lookups to validate the authenticity of a crawler is the way to go for maximum protection.

Moreover, running a Whois Lookup based on the User-Agent or IP Address to verify if the crawler falls into a legitimate IP range will give you an additional layer of confirmation.

You will be able to perform all these tasks with LogZ Free Log Analyzer. LogZ comes with a very handy tool provided by WebSitePulse.com that will allow you to run diagnostic tests for your website.

1. Forward DNS Lookup

If you get a Hostname instead of an IP Address, you will have to perform a Forward DNS Lookup to find the respective IP. Using the suspicious bot as an example, follow these steps:

  1. Click on the DNS tab.
  2. Select Hostname test.
  3. Enter the Hostname 202.237.232.72.static.reverse.ltdomains.com in the Hostname/IP field.
  4. Don't forget to type-in the security code and hit the 'Test it' button to get the details.
  5. The results will be displayed in a new page at WebSitePulse.com.

DNS Lookup

Here is a better closeup image of the Forward DNS Lookup tool.

Forward DNS Lookup

To do a manual Forward DNS Lookup using Windows XP/Vista/NT, follow these steps:

  1. Go to Start, then click on Run.
  2. Enter cmd.
  3. Enter nslookup < hostname here > in the black window with the user prompt ("_"). For example: nslookup 202.237.232.72.static.reverse.ltdomains.com.
  4. The IP will be shown as:
    1. * Non-authoritative answer:
    2. Name: 202.237.232.72.static.reverse.ltdomains.com
    3. Address: 72.232.237.202

For Mac users, instead of using 'Start/Run/cmd', Mac ppl will just open a terminal.

The results of the Forward DNS Lookup provides the IP Address 72.232.237.202. With this information you can block the bot using the .htaccess code detailed in that post. However, it's recomended to perform a Reverse DNS Lookup as well.

2. Reverse DNS Lookup

To verify whether the IP 72.232.237.202 corresponds to the Hostname, the next step will be to perform a Reverse DNS Lookup.

Simply click on DNS, select Reverse DNS, enter the information requested and hit 'Test it'.

Reverse DNS Lookup

The suspicious User-Agent with IP 72.232.237.202 returns the same Hostname previously checked.

In some cases the IP will return a different Hostname or no information whatsoever. To relieve the burden of doubt, you can perform a Whois Lookup.

3. Whois Lookup

To check the Whois Record of a website, enter the Domain Name or IP Address in the box below. The results are provided by DomainTools.com. Keep in mind that WebSitePulse.com too provides this service in the tool above.

Using the IP Address of the suspicious bot, you can find all information of the company whose that IP corresponds to including email, phone number and other contact details. In case there is a rogue bot using hosting services of a US based company, you can send the hosting company a Digital Millennium Copyright Act (DMCA) notice and takedown.

Another good reason to use this tool is if, for example, you are dealing with a fake UA like 'Googlebot-pictures/v2.0' that portrays itself as Googlebot. If the IP range of this fake UA does not fall into Google's IP range, then most likely it is a rogue crawler.

4. Geo IP Location

Want to know more details of the location of the rogue bot? No problem. To lookup the geographical location of an IP Address, enter the number in the search box below. If you have more than one IP Address, enter them separated by a single space.

IP Location Lookup

After clicking on the Find IP Location button, the results will be displayed in a new page at IP2Location.com.

Sometimes locations vary depending on the Internet Service Provider (ISP) and other factors.

With all this information at hand, plus what you can find online about a bot like 'Python-urllib/1.17', you can make a final blocking decision with no hesitation.

III. Things to Keep in Mind

  • Make sure to check your log recording configuration. Not all servers are properly setup to archive logs automatically. Talk to your hosting company for assistance.
  • Not all Search Engines' IP Addresses return a Reverse DNS. Google and Baidu are good examples of that.
  • A DNS Server, a Mail Server, an FTP Server, and other services can be hosted under one IP Address.
  • One User-Agent can have more than one IP Address or Hostname.
  • Some Search Engines, like Microsoft, use non-existent Top Level Domains (TLDs). This presents a challenge when analyzing User-Agent strings.
  • Not all Search Engines have their URLs in the User-Agent string. However, the serious ones do.
  • Check once in a while the IP range for your most important Search Engines. For a list of IP Addresses of Search Engine Spiders visit iplists.com.
  • A URL in the User-Agent string should not be considered a quality signal. Some URLs are included in the string only for spoofing or misguiding reasons.
  • Compare the domain name in the User-Agent string to determine whether it truly belongs to the Search Engine by checking the IP range of that UA. For instance, if a bot identifies itself as Googlebot, but it is not well aligned according to Reverse and Forward DNS Lookups, and its IP address is not within Google IP range, most likely this is a bad bot.
  • There are times you want to block a User-Agent using just the IP address. Though don't block a User-Agent only based on the this information. Do more qualitative analysis. A good example is when proxy is used to access your site.

Last but not least, I would like to thank our friend and colleague Dmitriy Shabaev, on behalf of Spanish SEO, for making this script. Dmitriy is a highly skilled programmer from Ulyanovsk, Russia, who have been working on eCommerce solutions with us. He brings a wealth of knowledge in X-cart customization with over 5 years of experience working for software developing companies and now as a private consultant.