Thursday, August 1, 2013

July Bots Are In

The July web crawler indexing bots stats are in. Here are the top bots for a small site I run.

Top 20 Obvious Bots


These bots are nice enough to include "bot", "spider", or "crawl" in their user agent string, or access the robots.txt file. Here are the top 20, representing 89% of obvious bot hits:

  1. 18% - Mozilla/5.0 (compatible; Ezooms/1.0; ezooms.bot@gmail.com)
  2. 13% - Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
  3. 12% - Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
  4. 7% - Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_1 like Mac OS X; en-us) AppleWebKit/532.9 (KHTML, like Gecko) Version/4.0.5 Mobile/8B117 Safari/6531.22.7 (compatible; Googlebot-Mobile/2.1; +http://www.google.com/bot.html)
  5. 5% - Mozilla/5.0 (compatible; YandexImages/3.0; +http://yandex.com/bots)
  6. 5% - Mozilla/5.0 (compatible; AhrefsBot/5.0; +http://ahrefs.com/robot/)
  7. 5% - Mozilla/5.0 (compatible; WBSearchBot/1.1; +http://www.warebay.com/bot.html)
  8. 4% - Twitterbot/1.0
  9. 3% - Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)
  10. 3% - Mozilla/5.0 (compatible; AhrefsBot/4.0; +http://ahrefs.com/robot/)
  11. 2% - ShowyouBot (http://showyou.com/crawler)
  12. 2% - Mozilla/5.0 (compatible; TweetmemeBot/3.0; +http://tweetmeme.com/)
  13. 2% - Aboundex/0.3 (http://www.aboundex.com/crawler/)
  14. 2% - Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)
  15. 2% - Mozilla/5.0 (compatible; SISTRIX Crawler; http://crawler.sistrix.net/)
  16. 2% - msnbot/2.0b (+http://search.msn.com/msnbot.htm)
  17. 1% - Mozilla/5.0 (compatible; PaperLiBot/2.1; http://support.paper.li/entries/20023257-what-is-paper-li)
  18. 1% - Mozilla/5.0 (compatible; SearchmetricsBot; http://www.searchmetrics.com/en/searchmetrics-bot/)
  19. 1% - Mozilla/5.0 (compatible; Dow Jones Searchbot)
  20. 1% - Sogou web spider/4.0(+http://www.sogou.com/docs/help/webmasters.htm#07)

Top 20 Developer Packages or Proprietary Bots


These bots are built on developer packages, but don't specifically identify themselves as a bot. The top 20 represent 92% of hits from these bots.

  1. 19% - checks.panopta.com
  2. 16% - NING/1.0
  3. 13% - UnwindFetchor/1.0 (+http://www.gnip.com/)
  4. 10% - FeedBurner/1.0 (http://www.FeedBurner.com)
  5. 7% - JS-Kit URL Resolver, http://js-kit.com/
  6. 5% - UniversalFeedParser/5.0.1 +http://feedparser.org/
  7. 4% - PycURL/7.19.5
  8. 3% - Java/1.6.0_26
  9. 3% - TwitterFeed 3
  10. 2% - HTMLParser/2.0
  11. 2% - Ruby
  12. 2% - Mozilla/5.0 (Digg/1.0; support@digg.com)
  13. 1% - Java/1.7.0_21
  14. 1% - Crowsnest/0.5 (+http://www.crowsnest.tv/)
  15. 1% - curl/7.24.0
  16. 1% - Opera/7.11 (Windows NT 5.1; U) [en]
  17. 1% - MetaURI API/2.0 +metauri.com
  18. 1% - Jakarta Commons-HttpClient/3.1
  19. 1% - InAGist URL Resolver (http://inagist.com)
  20. 1% - Mozilla/5.0

Plus these other notables:

  1. Feedfetcher-Google; (+http://www.google.com/feedfetcher.html; subscribers; feed-id=)
  2. Mozilla/5.0 (compatible; Embedly/0.2; +http://support.embed.ly/)

Top 20 Sneaky Bots


These bots either don't identify themselves, mask their identity using a common real user agent, or don't include a user agent. I identify these by hits from same or similar IP addresses, complete lack of any referring URLs, or too many hits from the same IP address.

  1. From IP 168.62.192.113 (Microsoft) with user agent "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.163 Safari/535.19".
  2. More coming soon