spot the bot

I have some overhead scripts fetching data that can cost a few seconds extra loading time. Having traffic trigger tasks saves me the trouble of using cron-jobs, but I don’t want to run overhead scripts with visitors or googlebot on the site. Apart from that, some routines can use a lot of resources which are wasted on some crawlers.

I actually want the crawlers to come around, so I will make an array with bots and allowed_bots. Whatever is not on the white-list gets a meager page with overhead jobs attached to it, the rest (iow visitors and the big search engines) get the standard page.

There are truckloads of bots (see crawltrack), for my purposes a few regulars will do.

  1.  
  2. //hook it into 'init', run when calling script
  3. add_action( 'init', 'spotabot' );
  4.  
  5. /**
  6.  * checks if visitor is a bot
  7.  *
  8.  * This method checks the http_user_agent string
  9.  * to see if the visitors is a non-essential bot
  10.  *
  11.  * @param void
  12.  * @return void
  13.  */
  14.  
  15. /*
  16.    if(IS_A_BAD_BOT) {}
  17. */
  18. function spotabot()
  19. {
  20.     $bot_list = array("Teoma", "betaBot", "alexa", "froogle", "Gigabot", "inktomi",
  21.     "looksmart", "URL_Spider_SQL", "Firefly", "NationalDirectory",
  22.     "Ask Jeeves", "TECNOSEEK", "InfoSeek", "WebFindBot", "girafabot",
  23.     "crawler", "www.galaxy.com", "Googlebot", "Scooter", "Slurp",
  24.     "msnbot", "appie", "FAST", "WebBug", "Radian6", "Spade", "ZyBorg", "rabaz",
  25.     "Baiduspider", "Feedfetcher-Google", "TechnoratiSnoop", "Rankivabot",
  26.     "Mediapartners-Google", "Sogou web spider", "WebAlta Crawler");
  27.  
  28.     $bot_allowed = array("Googlebot", "Feedfetcher-Google", "Mediapartners-Google", "Slurp", "Baiduspider", "msnbot");
  29.  
  30.     foreach($bot_list as $bot) {
  31.         if(strpos(strtolower("x".$_SERVER['HTTP_USER_AGENT']), strtolower($bot))>0)
  32.         {
  33.             foreach($bot_allowed as $okbot) {
  34.                  if($okbot==$bot) {
  35.                     define("IS_A_BAD_BOT", false);
  36.                     return;
  37.                  }
  38.            
  39.             define("IS_A_BAD_BOT", true);
  40.             return;
  41.             }
  42.         }
  43.     }
  44.    
  45.     define("IS_A_BAD_BOT", false);
  46.     return;
  47. }

In templates and functions i can use some simple code to run stuff conditional :

  1. if (defined('IS_A_BAD_BOT')) {
  2.    if(IS_A_BAD_BOT)
  3.    {
  4.     echo "hi bot<br />";
  5.                                 run_time_consuming_overhead_tasks();
  6.                                 and_omit_the_sidebar();  
  7.    } else {
  8.     echo "hello wonderful visitor<br />";
  9.    }
  10.   }
  11. //if it is not defined it is not a bot or the function ain't present,
  12. //I am lazy and sloppy and don't want a code-break

It would be nice if WordPress built in a switch to run plugins conditional.

one related smart plugin is the chennai central plugin that sends 304 not modified headers on conditional GETs, so crawlers don’t fetch the page. That can save some bandwidth and serverload.

hands on xml-rpc : copying msql tables

I don’t have anything to blog on, so I will bore you all with a quick generic function to copy mysql tables from one host to another, using xml-rpc.

I use the Incutio xml-rpc library on both hosts, to handle the tedious stuff (xml formatting and parsing). That leaves only some snippets to send and receive table data and store it on a mysql database.

First : how to handle the table data on the sending end:

  • I take an associative array from a mysql query
  • I make an array to hold the records
  • I add each row as array
  • I make an IXR-client.
  • I add some general parameters
  • I hand these and the entire table array to my IXR-client.
  • send…
  1. //the snippet with the client is at the bottom of the post
  2. $ThisClient = New SerpClient('http://serp.trismegistos.net/db/xmlrpc.php', 'user', 'pass', 'sender');
  3.  
  4. $tablename = "serp_tags_keys";
  5. $tableid = "id";
  6. $result = $serpdb->query("SELECT * FROM ".$tablename);
  7. $recordcount = mysql_num_rows($result);
  8.  
  9. while($row=mysql_fetch_assoc($result)) {
  10.  $record=array();
  11.  foreach($row as $key => $value) $record[$key]=$value;
  12.  $records[]=$record;
  13. }
  14.  
  15. $ThisClient->putTable($tablename, $recordcount, $tableid, $records);

I consider some additional fields necessary for basic integrity checks : I add “ID” as key field, so on the receiving end the server knows which field is my table’s auto-increment field. Other fields are a username, password, tablename and the batch recordcount.

The IXR_Client then generates a tangled mess of xml-tags holding the entire prodecure call and data. (you can put the client on ‘debug’, then it dumps the generated xml to the screen).

The first part of the xml file contains the single parameters :

  • username
  • password
  • tablename
  • recordcount
  • id-field

<methodCall>
<methodName>serp.putTable</methodName>
<params>
<param><value><string>user</string></value></param>
<param><value><string>pass</string></value></param>
<param><value><string>serp_tags_keys</string></value></param>
<param><value><int>91</int></value></param>
<param><value><string>id</string></value></param>

Then the entire table is sent as one parameter in the procedure call.

That parameter is built from an array containing the table rows as ‘struct’. If I want to use the routine for any table, I need the fieldname-value pairs to compose a standard mysql insert statement. A struct type allows me to use key-value pairs in the xml-file that can be parsed back into an array.

<param><value><array>

<data>

<value><struct>
<member><name>id</name><value><string>4</string></value></member>
<member><name>tag</name><value><string>ranking</string></value></member>
<member><name>cat</name><value><string>alexa ranking seo internet ranking internet positi</string></value></member>
<member><name>date</name><value><string>200901</string></value></member>
</struct></value>

<value><struct>
<member><name>id</name><value><string>94</string></value></member>
<member><name>tag</name><value><string>firm</string></value></member>
<member><name>cat</name><value><string>firm seo</string></value></member>
<member><name>date</name><value><string>200901</string></value></member>
</struct></value>

</data>

</array></value></param>

That was the last of the param holding the table, so the entire tag-mess is closed :

</params&gt</methodCall&gt

Then the second part : on the receiving end the Incutio class parses the whole tag-mess, and hands an array of the param sections as input to my function putTable.

  1.  function putTable($args)
  2.  {
  3.   $user   = $args[0];
  4.   $pass   = $args[1];
  5.   $tname   = $args[2];
  6.   $tcount  = $args[3];
  7.   $id           = $args[4];
  8.   $table   = $args[5];

$table is a straightforward array holding as items an array ($t) created from the struct with the pairs of fieldname-value. I turn the recordsets key-value struct into a mysql INSERT query :
$query = “INSERT INTO `”.$tname.”` (” field, field… “) VALUES (” fieldvalue, fieldvalue “)”;

All I have to do is add the fieldnames and fieldvalues to the mysql insert query.

  1.   foreach($table as $t) {
  2.  
  3. //the fixed parts
  4.     $query0 = 'INSERT INTO `'.$tname.'` (';
  5.     $query2 .=") VALUES (";
  6.  
  7. //make the (`fieldname`, `fieldname`, `fieldname`) query-bit
  8. //and the ('fieldvalue', 'fieldvalue', 'fieldvalue') query-bit :
  9.  
  10.     foreach($t as $key=>$value) {
  11.      if($key!=$id) {
  12.       $query1 .="`".$key."`, ";
  13.       $query3 .="'".$value."', ";
  14.      }
  15.     }
  16.  
  17. //remove the trailing ", "
  18.     $query1=substr($query1, 0, strlen($query1)-2);
  19.     $query3=substr($query3, 0, strlen($query3)-2);
  20.  
  21. //glue em up and add the final ")"
  22.     $query0 .= $query1.$query2.$query3.")";
  23.  
  24. //query…
  25.     $this->connection->query($query0);
  26.  
  27. //reset the strings
  28.     $query0='';
  29.     $query1='';
  30.     $query2='';
  31.     $query3='';
  32.    }
  33.  }

that generates mysql queries like
INSERT INTO `serp_tags_keys` (`tag`, `cat`, `date`) VALUES (‘ranking’, ‘alexa ranking’, ‘200901’) and copies the entire table.

That is how I handle the table data.

Of course I have to define two custom classes to process the serp.putTable procedure itself, using the Incutio class.

First the class for the sending script, which is pretty straight forward :

  • make an IXR_Client instance
  • hand the record set to it
  • have it formatted and sent
  1. //include the library
  2. include('class-IXR.php');
  3.  
  4. //make a custom class that uses the IXR_client
  5. Class SerpClient
  6. {
  7.  var $rpcurl;         //endpoint
  8.  var $username;   //you go figure
  9.  var $password;
  10.  var $bClient;      //incutio ixr-client instance
  11.  var $myclient;  //machine/host-id
  12.  
  13.     function SerpClient($rpcurl, $username, $password, $myclient)
  14.     {
  15.  $this->rpcurl = $rpcurl;
  16.     if (!$this->connect()) return false;
  17.  
  18.      //Standard variables to send in the message
  19.  $this->rpcurl = (string) $rpcurl;
  20.      $this->username = (string) $username;
  21.      $this->password = (string) $password;
  22.  $this->myclient = (string) $myclient;
  23.      return $this;
  24.     }
  25.  
  26.      function connect()
  27.    {
  28. //basic client, it takes the endpoint url, tests and returns true if it exists
  29.      if($this->bClient = new IXR_Client($this->rpcurl)) return true;
  30.     }
  31.  
  32. //the function I use to send the data
  33.   function putTable($tablename, $recordcount, $tableid, $array)
  34.  {
  35. //first parameter is always the methodname, then the parameters, which are
  36. //added sequential to the xml-file (with the appropriate tags for datatypes.
  37. //the script figures that out. note : it uses htmlentities on strings.
  38.   $this->bClient->query('serp.putTable', $this->username, $this->password, $tablename, $recordcount, $tableid, $array);
  39.  }
  40.  
  41. }

I use it in the snippets above with :

  1. $ThisClient = New SerpClient('http://serp.trismegistos.net/db/xmlrpc.php', 'user', 'pass', 'sender');
  2. //…
  3. $ThisClient->putTable($tname, $tcount, $tableid, $records);

Then, on the receiving end, my program has to know how to handle the xml containing the remote procedure call.

I define an extension on IXR_server and pass serp.putTable as new ‘method’ (callback function).

  1. //go away cookie…
  2. $_COOKIE = array();
  3.  
  4. //make sure you get the posted crap, the ixr instances grabs it input from it
  5. if ( !isset( $HTTP_RAW_POST_DATA ) ) $HTTP_RAW_POST_DATA = file_get_contents( 'php://input' );
  6. if ( isset($HTTP_RAW_POST_DATA) ) $HTTP_RAW_POST_DATA = trim($HTTP_RAW_POST_DATA);
  7.  
  8. //include the library
  9. include('class-IXR.php');
  10.  
  11. //make an extended class
  12. class serp_xmlrpc_server extends IXR_Server {
  13.  
  14. //use the same function name…
  15.  
  16.  function serp_xmlrpc_server() {
  17.  
  18. //build an array of methods :
  19. //first the procedurename you use in the xml-text,
  20. //then which function in the extended class (this one) it maps to
  21. //to be used as $this->method
  22.  
  23.   $this->methods = array('serp.putTable'  => 'this:putTable');
  24.  
  25. //hand em to the IXR server instance that will map it as callback
  26.   $this->IXR_Server($this->methods);
  27.  }
  28.  
  29. //now IXR_Server instance uses ($this->)putTable
  30. //to process incoming xml-text
  31. //containing serp.putTable as methodname
  32.  
  33.   function putTable($args)
  34.  {
  35. //(for routine : see the snippet above to store the xml data in mysql)
  36.  }
  37. }
  38.  
  39. //make the class instance like any regular get-post php program,
  40. //the only actual program line, that instantiates the extended class,
  41. //which handles the posted xml
  42.  
  43. $serp_xmlrpc_server = new serp_xmlrpc_server();

That’s all. I am not going to list a cut-and-paste version. You have to build some stuff with it, then you will come up with lots of stuff you can do with it.

WordPress and iPhone built a plugin that receives pictures from iPhone. WordPress uses Incutio so you can ‘piggyback’ on that and have an iPhone plugin for your own website in two days flat using an ajax lightbox gallery script. Or go monetize small websites with some seo oriented ‘optimisation’ functions like ChangeFooterLinks(array($paidurl, $anchortext)) :) or whatever… boring, isn’t it ?

synonymizer with api

If you want to put some old content on the net and have it indexed as fresh unique content, this works wonders for seo-friendly backlinks : the automated synonymizer. I want one that makes my content unique without having to type one character.

Lucky for me, mister John Watson’s synonym database comes with a free 10.000 request a day API and boy is it sweet!

API Requests are straightforward :
http://words.bighugelabs.com/api/2/[apikey]/[keyword]/xml

A number of return formats are supported but xml is easiest, either for parsing with simplexml or regular pattern matching.

It returns on request :
black (slightly shortened)
an xml file like :
<words>
<w p=”adjective” r=”syn”>bleak</w>
<w p=”adjective” r=”syn”>sinister</w>
<w p=”adjective” r=”sim”>dark</w>
<w p=”adjective” r=”sim”>angry</w>
<w p=”noun” r=”syn”>blackness</w>
<w p=”noun” r=”syn”>inkiness</w>
<w p=”verb” r=”syn”>blacken</w>
<w p=”verb” r=”syn”>melanize</w>
</words>

…which is easiest handled with preg_match_all :

  1. function getsynonyms($keyword) {
  2.         $pick = array();
  3.  $apikey = 'get your own key';
  4.  $xml=file_get_contents('http://words.bighugelabs.com/api/2/'.$apikey.'/'.$keyword.'/xml');
  5.  
  6.  if(!$xml) return $pick; //return empty array
  7.  
  8.  preg_match_all('/<w p="adjective" r="syn">(.*?)< \/w>/', $xml, $adj_syns);
  9.  //preg_match_all('/</w><w p="adjective" r="sim">(.*?)< \/w>/', $xml, $adj_sims);
  10.  //preg_match_all('/</w><w p="noun" r="syn">(.*?)< \/w>/', $xml, $noun_syns);
  11.  //preg_match_all('/</w><w p="verb" r="syn">(.*?)< \/w>/', $xml, $verb_syns);
  12.  
  13.  foreach($adj_syns[0] as $adj_syn) $pick[]=$adj_syn;
  14.         //same for verb/noun synonyms, I just want adjectives
  15.  
  16.  return $pick;
  17. }
  18. </w>

practically applying it,
I take a slab of stale old content and…

  • strip tags
  • do a regular match on all alphanumeric sequences dropping other stuff
  • trim the resulting array elements
  • (merge all blog tags, categories, and a list of common words)
  • excluding common terms from the array with text elements
  • excluding words smaller than N characters
  • set a percentage words to be synonimized
  • attempt to retrieve synonyms for remaining terms
  • replace these words in the original text, keep count
  • when I reach the target replacement percentage, abort
  • return (hopefully) a revived text
  1. function synonymize($origtext) {
  2.  
  3. //make a copy of the original text to dissect
  4.  $content=$origtext;
  5.  //content = $this->body;
  6.  
  7.  $perc=3;   //target percentage changed terms
  8.  $minlength=4;  //minimum length candidates
  9.  $maxrequests=80; //max use of api-requests
  10.  
  11.  
  12.  //dump tags
  13.  $content =  strip_tags($content);
  14.  
  15.  //dump non-alphanumeric string characters
  16.  $content = preg_replace('/[^A-Za-z0-9\-]/', ' ', $content);
  17.  
  18.  //explode on blank space
  19.  $wrds = explode(' ', strtolower($content));
  20.  
  21.  //trim off blank spaces just in case
  22.  for($w=0;$w<count ($wrds);$w++) $words[] = trim($wrds[$w]);
  23.  
  24.  //this should be all words
  25.  $wordcount = count($words);
  26.  
  27.  //how many words do I want changed ?
  28.  $toswitch = round($wordcount*$perc/100);
  29.  
  30.  //only use uniques
  31.  $words_unique=array_unique($words);
  32.  
  33.  //sort, start with words at the end of the text
  34.  sort($words_unique);
  35.  
  36.  //merge common with tags, categories, linked_tags
  37.  $common = array("never", "about", "price");
  38. //note : setting the minlength to 4 excludes lots of common terms
  39.    
  40.  for($i=0;$i<count($words_unique);$i++) {
  41.  //if in common array, not selectable for synonymizing
  42.   if(in_array($words_unique[$i], $common)) {} else {
  43.    //only terms bigger than minlength
  44.    if(strlen($words_unique[$i])>$minlength) {
  45.    //words_select contains candidates for synonyms
  46.     $words_select[] = trim($words_unique[$i]);
  47.    }
  48.   }
  49.  }
  50.  
  51.  //terms that can be changed
  52.  $max = count($words_select);
  53.  
  54.  //no more requests than max
  55.  if($max>$maxrequests) $max=$maxrequests;
  56.  
  57.  for($i=0;$i< $max;$i++) {
  58.  //get synonyms, give server some time
  59.   usleep(100000);
  60.   //retrieve synonyms etc.
  61.   $these_words = getsynonyms($words_select[$i]);
  62.   $jmax=count($these_words);
  63.   if($jmax&lt;1) {
  64.   //no results
  65.   } else {
  66. $count=0;
  67.    $j=0;
  68. //the replacements are done in the original text
  69.    $origtext= preg_replace('/'.$words_select[$i].'/i', $these_words[$j], $origtext, -1, $count);
  70.    $total_switched+=$count;
  71.  
  72.   } //have we reached the percentage ?
  73.   if($total_switched>=$toswitch) break;
  74.  }
  75.  //okay!
  76.  return $origtext;
  77. }
  78.  
  79. function getsynonyms($keyword) {
  80.  $pick=array ();
  81.  $apikey = 'get your own key at bighugelabs.com';
  82.  $xml=@file_get_contents('http://words.bighugelabs.com/api/2/'.$apikey.'/'.urlencode($keyword).'/xml');
  83.  if(!$xml) return $pick;
  84.  preg_match_all('/<w p="adjective" r="syn">(.*?)< \/w>/', $xml, $adj_syns);
  85.  foreach($adj_syns[0] as $adj_syn) $pick[]=$adj_syn;
  86.  return $pick;
  87. }
  88. </w></count>

Nothing fancy, a straightforward search-replace routine. A 1200 word text has about 150 candidates and for 3% synonyms I need to replace 36 words, it can do that. If I were to use it for real I would build a table with non-returning terms, and store often used terms, that would speed up the synonimizing, allow the use of preferences and take a load of the api use.