spot the bot

I have some overhead scripts fetching data that can cost a few seconds extra loading time. Having traffic trigger tasks saves me the trouble of using cron-jobs, but I don’t want to run overhead scripts with visitors or googlebot on the site. Apart from that, some routines can use a lot of resources which are wasted on some crawlers.

I actually want the crawlers to come around, so I will make an array with bots and allowed_bots. Whatever is not on the white-list gets a meager page with overhead jobs attached to it, the rest (iow visitors and the big search engines) get the standard page.

There are truckloads of bots (see crawltrack), for my purposes a few regulars will do.

  1.  
  2. //hook it into 'init', run when calling script
  3. add_action( 'init', 'spotabot' );
  4.  
  5. /**
  6.  * checks if visitor is a bot
  7.  *
  8.  * This method checks the http_user_agent string
  9.  * to see if the visitors is a non-essential bot
  10.  *
  11.  * @param void
  12.  * @return void
  13.  */
  14.  
  15. /*
  16.    if(IS_A_BAD_BOT) {}
  17. */
  18. function spotabot()
  19. {
  20.     $bot_list = array("Teoma", "betaBot", "alexa", "froogle", "Gigabot", "inktomi",
  21.     "looksmart", "URL_Spider_SQL", "Firefly", "NationalDirectory",
  22.     "Ask Jeeves", "TECNOSEEK", "InfoSeek", "WebFindBot", "girafabot",
  23.     "crawler", "www.galaxy.com", "Googlebot", "Scooter", "Slurp",
  24.     "msnbot", "appie", "FAST", "WebBug", "Radian6", "Spade", "ZyBorg", "rabaz",
  25.     "Baiduspider", "Feedfetcher-Google", "TechnoratiSnoop", "Rankivabot",
  26.     "Mediapartners-Google", "Sogou web spider", "WebAlta Crawler");
  27.  
  28.     $bot_allowed = array("Googlebot", "Feedfetcher-Google", "Mediapartners-Google", "Slurp", "Baiduspider", "msnbot");
  29.  
  30.     foreach($bot_list as $bot) {
  31.         if(strpos(strtolower("x".$_SERVER['HTTP_USER_AGENT']), strtolower($bot))>0)
  32.         {
  33.             foreach($bot_allowed as $okbot) {
  34.                  if($okbot==$bot) {
  35.                     define("IS_A_BAD_BOT", false);
  36.                     return;
  37.                  }
  38.            
  39.             define("IS_A_BAD_BOT", true);
  40.             return;
  41.             }
  42.         }
  43.     }
  44.    
  45.     define("IS_A_BAD_BOT", false);
  46.     return;
  47. }

In templates and functions i can use some simple code to run stuff conditional :

  1. if (defined('IS_A_BAD_BOT')) {
  2.    if(IS_A_BAD_BOT)
  3.    {
  4.     echo "hi bot<br />";
  5.                                 run_time_consuming_overhead_tasks();
  6.                                 and_omit_the_sidebar();  
  7.    } else {
  8.     echo "hello wonderful visitor<br />";
  9.    }
  10.   }
  11. //if it is not defined it is not a bot or the function ain't present,
  12. //I am lazy and sloppy and don't want a code-break

It would be nice if WordPress built in a switch to run plugins conditional.

one related smart plugin is the chennai central plugin that sends 304 not modified headers on conditional GETs, so crawlers don’t fetch the page. That can save some bandwidth and serverload.

Posted in optimisation, wordpress and tagged , .

Leave a Reply

Your email address will not be published. Required fields are marked *