Thursday, August 1, 2013

July Bots Are In

The July web crawler indexing bots stats are in. Here are the top bots for a small site I run.

Top 20 Obvious Bots


These bots are nice enough to include "bot", "spider", or "crawl" in their user agent string, or access the robots.txt file. Here are the top 20, representing 89% of obvious bot hits:

  1. 18% - Mozilla/5.0 (compatible; Ezooms/1.0; ezooms.bot@gmail.com)
  2. 13% - Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
  3. 12% - Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
  4. 7% - Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_1 like Mac OS X; en-us) AppleWebKit/532.9 (KHTML, like Gecko) Version/4.0.5 Mobile/8B117 Safari/6531.22.7 (compatible; Googlebot-Mobile/2.1; +http://www.google.com/bot.html)
  5. 5% - Mozilla/5.0 (compatible; YandexImages/3.0; +http://yandex.com/bots)
  6. 5% - Mozilla/5.0 (compatible; AhrefsBot/5.0; +http://ahrefs.com/robot/)
  7. 5% - Mozilla/5.0 (compatible; WBSearchBot/1.1; +http://www.warebay.com/bot.html)
  8. 4% - Twitterbot/1.0
  9. 3% - Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)
  10. 3% - Mozilla/5.0 (compatible; AhrefsBot/4.0; +http://ahrefs.com/robot/)
  11. 2% - ShowyouBot (http://showyou.com/crawler)
  12. 2% - Mozilla/5.0 (compatible; TweetmemeBot/3.0; +http://tweetmeme.com/)
  13. 2% - Aboundex/0.3 (http://www.aboundex.com/crawler/)
  14. 2% - Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)
  15. 2% - Mozilla/5.0 (compatible; SISTRIX Crawler; http://crawler.sistrix.net/)
  16. 2% - msnbot/2.0b (+http://search.msn.com/msnbot.htm)
  17. 1% - Mozilla/5.0 (compatible; PaperLiBot/2.1; http://support.paper.li/entries/20023257-what-is-paper-li)
  18. 1% - Mozilla/5.0 (compatible; SearchmetricsBot; http://www.searchmetrics.com/en/searchmetrics-bot/)
  19. 1% - Mozilla/5.0 (compatible; Dow Jones Searchbot)
  20. 1% - Sogou web spider/4.0(+http://www.sogou.com/docs/help/webmasters.htm#07)

Top 20 Developer Packages or Proprietary Bots


These bots are built on developer packages, but don't specifically identify themselves as a bot. The top 20 represent 92% of hits from these bots.

  1. 19% - checks.panopta.com
  2. 16% - NING/1.0
  3. 13% - UnwindFetchor/1.0 (+http://www.gnip.com/)
  4. 10% - FeedBurner/1.0 (http://www.FeedBurner.com)
  5. 7% - JS-Kit URL Resolver, http://js-kit.com/
  6. 5% - UniversalFeedParser/5.0.1 +http://feedparser.org/
  7. 4% - PycURL/7.19.5
  8. 3% - Java/1.6.0_26
  9. 3% - TwitterFeed 3
  10. 2% - HTMLParser/2.0
  11. 2% - Ruby
  12. 2% - Mozilla/5.0 (Digg/1.0; support@digg.com)
  13. 1% - Java/1.7.0_21
  14. 1% - Crowsnest/0.5 (+http://www.crowsnest.tv/)
  15. 1% - curl/7.24.0
  16. 1% - Opera/7.11 (Windows NT 5.1; U) [en]
  17. 1% - MetaURI API/2.0 +metauri.com
  18. 1% - Jakarta Commons-HttpClient/3.1
  19. 1% - InAGist URL Resolver (http://inagist.com)
  20. 1% - Mozilla/5.0

Plus these other notables:

  1. Feedfetcher-Google; (+http://www.google.com/feedfetcher.html; subscribers; feed-id=)
  2. Mozilla/5.0 (compatible; Embedly/0.2; +http://support.embed.ly/)

Top 20 Sneaky Bots


These bots either don't identify themselves, mask their identity using a common real user agent, or don't include a user agent. I identify these by hits from same or similar IP addresses, complete lack of any referring URLs, or too many hits from the same IP address.

  1. From IP 168.62.192.113 (Microsoft) with user agent "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.163 Safari/535.19".
  2. More coming soon


Saturday, June 15, 2013

View Click Stats on Bitly or Goo.gl Short URLs

You may already know that you can view click stats on Bitly short URLs by adding + to the end of the URL, like this:

http://bitly.com/LH9pqK+

But did you know adding + also works for any other service that shortens their URLs through Bitly, like Amazon?

http://amzn.to/wZsYV5+

You can also view data on Google short URL clicks by inserting the shortened ID of a goo.gl URL into a longer Google URL like this (I find that I have to click this link twice to see the metrics):

http://goo.gl/#analytics/goo.gl/9bHF1/week

Saturday, January 26, 2013

How to Get More Twitter Followers Legitimately

The best way to get more Twitter followers is to follow more Twitter accounts. People who may not be aware of your account or who wouldn't find it any other way will be notified that you have followed them. They then become aware of your account any may choose to follow you - or block you if they suspect you are a spammer. If too many people block your account too fast, Twitter may suspend your account. The trick is to find accounts that will follow you back and not block you.

I tested two methods to identify Twitter accounts who are more likely to follow your account, and less likely to block you when you follow them. I call the first method Collaborative Following: accounts that follow several accounts similar to yours. This method resulted in a 49% follow back rate in a test of 495 follows. I call the second method Mutual Following: accounts that follow and are followed back by an account similar to yours. This method resulted in a 31% follow back rate in a test of 491 follows.

Method 1: Collaborative Following

  1. Use Twitter's APIs to download list of your followers and people you follow. You can do this in Excel with data connections, or with web server scripts and a database.

    Download Twitter IDs of your followers (up to 5,000 IDs at a time):
    https://api.twitter.com/1/followers/ids.xml?screen_name=YourScreenName

    Download Twitter IDs of people you follow (aka friends, up to 5,000 IDs at a time):
    https://api.twitter.com/1/friends/ids.xml?screen_name=YourScreenName
  2. Import the lists of your followers and friends into Excel or a database. Augment the IDs you get from those APIs with full information about the accounts. You can do this by pasting REST URLs in your browser, saving the results, and opening the files in Excel; of with web server scripts and a database.

    Download details about accounts (up to 100 at a time):
    https://api.twitter.com/1/users/lookup.xml?include_entities=false&user_id=1234,2345,3456
  3. Identify 10 accounts you trust with between 4,000-10,000 followers each. If you don't know which accounts to choose, you can go with mutual follows (accounts you follow that follow you back), or with accounts from your competition or for similar interest groups. For example, if your account is about cats, you might use other accounts about cat lovers. These will be your targeted accounts.
  4. Identify a few accounts with 4,000-10,000 followers each whose followers would not be interested in following your account. For example, if you account is about cats, you might choose accounts about dog lovers. These will be your anti-targeted accounts and will help you exclude users who are not interested in your subject area, or spammers who follow everyone.
  5. Identify accounts that are willing to follow other accounts. Use the APIs described above to download IDs of accounts that follow your targeted and anti-targeted accounts. These are your prospects.
  6. Generate counts of how frequently each prospect ID appears in the lists of those who follow your targeted accounts. You can do this in Excel with a pivot table, or in a database with SQL. If you identified 10 targeted accounts, this will be a number from 1-10.
  7. Exclude anti-targeted account IDs from your targeted account IDs.
  8. Exclude any IDs who follow fewer than 4 of your targeted accounts. These accounts have not expressed enough interest in your subject area.
  9. Use the lookup API described above to augment the remaining IDs with full account details.
  10. Remove IDs for any account that hasn't tweeted in the past 7-14 days. These accounts are not active enough to read your updates or follow you back.
  11. Remove IDs for any account whose friends count is less than 90% of their followers count. These accounts don't show enough willingness to follow others.
  12. Twitter only lets people follow up to the greater of 2,000 accounts or 10% more accounts than their own followers count. Remove IDs for any account who will be allowed to follow fewer than 10 more people. These accounts are going to be stingy with choices of whom to follow.
  13. Remove IDs for any account that follows more than 10,000 accounts. These accounts get so many Tweets, yours will be lost in the noise. These accounts are more likely to be spam or brand accounts who may not really read Tweets anyway.
  14. Identify a few negative words that identify accounts who would not be interested in your subject area. For example, if your account is about cats, you may want to avoid people who mention "dog" in their profile. Remove IDs of accounts who use these negative words in their descriptions.
  15. These exclusions generally remove 42-48% of prospects. Of the accounts that remain, follow about 100 per day. This moderate rate of following will help keep the number of any account blocks per day at a low level.
Method 2: Mutual Following
  1. Identify an account similar to yours with 4,000-10,000 followers that also follows 4,000-10,000 accounts. Accounts with comparable counts of followers and friends may indicate that the account owner has done some pre-filtering of follower quality for you.
  2. Use the followers and friends APIs described above to download IDs of the targeted account's followers and friends.
  3. Identify IDs that appear in both the followers and friends lists. These are your prospects.
  4. Use The lookup API described above to augment prospect IDs with full account details.
  5. Follow the same exclusion steps described above to remove anti-targeted accounts, inactive accounts, low following rates, limited following, high following counts, and negative words.
  6. Of the accounts that remain, follow about 100 per day.