spot the bot

I have some overhead scripts fetching data that can cost a few seconds extra loading time. Having traffic trigger tasks saves me the trouble of using cron-jobs, but I don’t want to run overhead scripts with visitors or googlebot on the site. Apart from that, some routines can use a lot of resources which are wasted on some crawlers.

I actually want the crawlers to come around, so I will make an array with bots and allowed_bots. Whatever is not on the white-list gets a meager page with overhead jobs attached to it, the rest (iow visitors and the big search engines) get the standard page.

There are truckloads of bots (see crawltrack), for my purposes a few regulars will do.


//hook it into 'init', run when calling script
add_action( 'init', 'spotabot' );

/**
 * checks if visitor is a bot
 *
 * This method checks the http_user_agent string
 * to see if the visitors is a non-essential bot
 *
 * @param void
 * @return void
 */

/*
   if(IS_A_BAD_BOT) {}
*/
function spotabot()
{
    $bot_list = array("Teoma", "betaBot", "alexa", "froogle", "Gigabot", "inktomi",
    "looksmart", "URL_Spider_SQL", "Firefly", "NationalDirectory",
    "Ask Jeeves", "TECNOSEEK", "InfoSeek", "WebFindBot", "girafabot",
    "crawler", "www.galaxy.com", "Googlebot", "Scooter", "Slurp",
    "msnbot", "appie", "FAST", "WebBug", "Radian6", "Spade", "ZyBorg", "rabaz",
    "Baiduspider", "Feedfetcher-Google", "TechnoratiSnoop", "Rankivabot",
    "Mediapartners-Google", "Sogou web spider", "WebAlta Crawler");

    $bot_allowed = array("Googlebot", "Feedfetcher-Google", "Mediapartners-Google", "Slurp", "Baiduspider", "msnbot");

    foreach($bot_list as $bot) {
        if(strpos(strtolower("x".$_SERVER['HTTP_USER_AGENT']), strtolower($bot))>0)
        {
            foreach($bot_allowed as $okbot) {
                 if($okbot==$bot) {
                    define("IS_A_BAD_BOT", false);
                    return;
                 }
            
            define("IS_A_BAD_BOT", true);
            return;
            }
        }
    }
    
    define("IS_A_BAD_BOT", false);
    return;
}

In templates and functions i can use some simple code to run stuff conditional :

if (defined('IS_A_BAD_BOT')) {
			if(IS_A_BAD_BOT)
			{
				echo "hi bot
"; run_time_consuming_overhead_tasks(); and_omit_the_sidebar(); } else { echo "hello wonderful visitor
"; } } //if it is not defined it is not a bot or the function ain't present, //I am lazy and sloppy and don't want a code-break

It would be nice if WordPress built in a switch to run plugins conditional.

one related smart plugin is the chennai central plugin that sends 304 not modified headers on conditional GETs, so crawlers don’t fetch the page. That can save some bandwidth and serverload.

hands on xml-rpc : copying msql tables

I don’t have anything to blog on, so I will bore you all with a quick generic function to copy mysql tables from one host to another, using xml-rpc.

I use the Incutio xml-rpc library on both hosts, to handle the tedious stuff (xml formatting and parsing). That leaves only some snippets to send and receive table data and store it on a mysql database.

First : how to handle the table data on the sending end:

  • I take an associative array from a mysql query
  • I make an array to hold the records
  • I add each row as array
  • I make an IXR-client.
  • I add some general parameters
  • I hand these and the entire table array to my IXR-client.
  • send…
//the snippet with the client is at the bottom of the post
$ThisClient = New SerpClient('http://serp.trismegistos.net/db/xmlrpc.php', 'user', 'pass', 'sender');

$tablename = "serp_tags_keys";
$tableid = "id";
$result = $serpdb->query("SELECT * FROM ".$tablename);
$recordcount = mysql_num_rows($result);

while($row=mysql_fetch_assoc($result)) {
	$record=array();
	foreach($row as $key => $value) $record[$key]=$value;
	$records[]=$record;
}

$ThisClient->putTable($tablename, $recordcount, $tableid, $records);

I consider some additional fields necessary for basic integrity checks : I add “ID” as key field, so on the receiving end the server knows which field is my table’s auto-increment field. Other fields are a username, password, tablename and the batch recordcount.

The IXR_Client then generates a tangled mess of xml-tags holding the entire prodecure call and data. (you can put the client on ‘debug’, then it dumps the generated xml to the screen).

The first part of the xml file contains the single parameters :

  • username
  • password
  • tablename
  • recordcount
  • id-field

<methodCall>
<methodName>serp.putTable</methodName>
<params>
<param><value><string>user</string></value></param>
<param><value><string>pass</string></value></param>
<param><value><string>serp_tags_keys</string></value></param>
<param><value><int>91</int></value></param>
<param><value><string>id</string></value></param>

Then the entire table is sent as one parameter in the procedure call.

That parameter is built from an array containing the table rows as ‘struct’. If I want to use the routine for any table, I need the fieldname-value pairs to compose a standard mysql insert statement. A struct type allows me to use key-value pairs in the xml-file that can be parsed back into an array.

<param><value><array>

<data>

<value><struct>
<member><name>id</name><value><string>4</string></value></member>
<member><name>tag</name><value><string>ranking</string></value></member>
<member><name>cat</name><value><string>alexa ranking seo internet ranking internet positi</string></value></member>
<member><name>date</name><value><string>200901</string></value></member>
</struct></value>

<value><struct>
<member><name>id</name><value><string>94</string></value></member>
<member><name>tag</name><value><string>firm</string></value></member>
<member><name>cat</name><value><string>firm seo</string></value></member>
<member><name>date</name><value><string>200901</string></value></member>
</struct></value>

</data>

</array></value></param>

That was the last of the param holding the table, so the entire tag-mess is closed :

</params&gt</methodCall&gt

Then the second part : on the receiving end the Incutio class parses the whole tag-mess, and hands an array of the param sections as input to my function putTable.

	function putTable($args) 
	{
		$user 	 = $args[0];
		$pass 	 = $args[1];
		$tname 	 = $args[2];
		$tcount	 = $args[3];
		$id 	         = $args[4];	
		$table 	 = $args[5];

$table is a straightforward array holding as items an array ($t) created from the struct with the pairs of fieldname-value. I turn the recordsets key-value struct into a mysql INSERT query :
$query = “INSERT INTO `”.$tname.”` (” field, field… “) VALUES (” fieldvalue, fieldvalue “)”;

All I have to do is add the fieldnames and fieldvalues to the mysql insert query.

		foreach($table as $t) {

//the fixed parts
				$query0 = 'INSERT INTO `'.$tname.'` (';
				$query2 .=") VALUES (";

//make the (`fieldname`, `fieldname`, `fieldname`) query-bit 
//and the ('fieldvalue', 'fieldvalue', 'fieldvalue') query-bit :

				foreach($t as $key=>$value) {
					if($key!=$id) {	
						$query1 .="`".$key."`, ";
						$query3 .="'".$value."', ";
					}
				}

//remove the trailing ", "
				$query1=substr($query1, 0, strlen($query1)-2);
				$query3=substr($query3, 0, strlen($query3)-2);

//glue em up and add the final ")"
				$query0 .= $query1.$query2.$query3.")";

//query...
				$this->connection->query($query0);

//reset the strings
				$query0='';
				$query1='';
				$query2='';
				$query3='';
			}	
	}

that generates mysql queries like
INSERT INTO `serp_tags_keys` (`tag`, `cat`, `date`) VALUES (‘ranking’, ‘alexa ranking’, ‘200901’) and copies the entire table.

That is how I handle the table data.

Of course I have to define two custom classes to process the serp.putTable procedure itself, using the Incutio class.

First the class for the sending script, which is pretty straight forward :

  • make an IXR_Client instance
  • hand the record set to it
  • have it formatted and sent
//include the library
include('class-IXR.php');

//make a custom class that uses the IXR_client
Class SerpClient 
{
	var $rpcurl;         //endpoint
	var $username;   //you go figure
	var $password;
	var $bClient;      //incutio ixr-client instance
	var $myclient;  //machine/host-id
	
	   function SerpClient($rpcurl, $username, $password, $myclient)
    {
	$this->rpcurl	= $rpcurl;
    if (!$this->connect()) return false; 

    	//Standard variables to send in the message
	$this->rpcurl	= (string) $rpcurl;
    	$this->username = (string) $username;
    	$this->password = (string) $password;
	$this->myclient = (string) $myclient;
    	return $this;
    }
	
   		function connect() 
   {
//basic client, it takes the endpoint url, tests and returns true if it exists
    	if($this->bClient = new IXR_Client($this->rpcurl)) return true;
    }
	
//the function I use to send the data
		function putTable($tablename, $recordcount, $tableid, $array) 
	{
//first parameter is always the methodname, then the parameters, which are
//added sequential to the xml-file (with the appropriate tags for datatypes.
//the script figures that out. note : it uses htmlentities on strings.
		$this->bClient->query('serp.putTable', $this->username, $this->password, $tablename, $recordcount, $tableid, $array);
	}

}

I use it in the snippets above with :

$ThisClient = New SerpClient('http://serp.trismegistos.net/db/xmlrpc.php', 'user', 'pass', 'sender');
//...
$ThisClient->putTable($tname, $tcount, $tableid, $records);

Then, on the receiving end, my program has to know how to handle the xml containing the remote procedure call.

I define an extension on IXR_server and pass serp.putTable as new ‘method’ (callback function).

//go away cookie...
$_COOKIE = array();

//make sure you get the posted crap, the ixr instances grabs it input from it
if ( !isset( $HTTP_RAW_POST_DATA ) ) $HTTP_RAW_POST_DATA = file_get_contents( 'php://input' );
if ( isset($HTTP_RAW_POST_DATA) ) $HTTP_RAW_POST_DATA = trim($HTTP_RAW_POST_DATA);

//include the library
include('class-IXR.php');

//make an extended class
class serp_xmlrpc_server extends IXR_Server {

//use the same function name...

	function serp_xmlrpc_server() {

//build an array of methods : 
//first the procedurename you use in the xml-text,
//then which function in the extended class (this one) it maps to 
//to be used as $this->method

		$this->methods = array('serp.putTable'	 => 'this:putTable');

//hand em to the IXR server instance that will map it as callback
		$this->IXR_Server($this->methods);
	}

//now IXR_Server instance uses ($this->)putTable 
//to process incoming xml-text 
//containing serp.putTable as methodname

		function putTable($args) 
	{
//(for routine : see the snippet above to store the xml data in mysql)
	}
}

//make the class instance like any regular get-post php program, 
//the only actual program line, that instantiates the extended class,
//which handles the posted xml 

$serp_xmlrpc_server = new serp_xmlrpc_server();

That’s all. I am not going to list a cut-and-paste version. You have to build some stuff with it, then you will come up with lots of stuff you can do with it.

WordPress and iPhone built a plugin that receives pictures from iPhone. WordPress uses Incutio so you can ‘piggyback’ on that and have an iPhone plugin for your own website in two days flat using an ajax lightbox gallery script. Or go monetize small websites with some seo oriented ‘optimisation’ functions like ChangeFooterLinks(array($paidurl, $anchortext)) :) or whatever… boring, isn’t it ?

synonymizer with api

If you want to put some old content on the net and have it indexed as fresh unique content, this works wonders for seo-friendly backlinks : the automated synonymizer. I want one that makes my content unique without having to type one character.

Lucky for me, mister John Watson’s synonym database comes with a free 10.000 request a day API and boy is it sweet!

API Requests are straightforward :
http://words.bighugelabs.com/api/2/[apikey]/[keyword]/xml

A number of return formats are supported but xml is easiest, either for parsing with simplexml or regular pattern matching.

It returns on request :
black (slightly shortened)
an xml file like :
<words>
<w p=”adjective” r=”syn”>bleak</w>
<w p=”adjective” r=”syn”>sinister</w>
<w p=”adjective” r=”sim”>dark</w>
<w p=”adjective” r=”sim”>angry</w>
<w p=”noun” r=”syn”>blackness</w>
<w p=”noun” r=”syn”>inkiness</w>
<w p=”verb” r=”syn”>blacken</w>
<w p=”verb” r=”syn”>melanize</w>
</words>

…which is easiest handled with preg_match_all :

function getsynonyms($keyword) {
        $pick = array(); 
	$apikey = 'get your own key';
	$xml=file_get_contents('http://words.bighugelabs.com/api/2/'.$apikey.'/'.$keyword.'/xml');

	if(!$xml) return $pick; //return empty array

	preg_match_all('/(.*?)< \/w>/', $xml, $adj_syns);
	//preg_match_all('/(.*?)< \/w>/', $xml, $adj_sims);
	//preg_match_all('/(.*?)< \/w>/', $xml, $noun_syns);
	//preg_match_all('/(.*?)< \/w>/', $xml, $verb_syns);

	foreach($adj_syns[0] as $adj_syn) $pick[]=$adj_syn;
        //same for verb/noun synonyms, I just want adjectives

	return $pick;
}

practically applying it,
I take a slab of stale old content and…

  • strip tags
  • do a regular match on all alphanumeric sequences dropping other stuff
  • trim the resulting array elements
  • (merge all blog tags, categories, and a list of common words)
  • excluding common terms from the array with text elements
  • excluding words smaller than N characters
  • set a percentage words to be synonimized
  • attempt to retrieve synonyms for remaining terms
  • replace these words in the original text, keep count
  • when I reach the target replacement percentage, abort
  • return (hopefully) a revived text
function synonymize($origtext) {

//make a copy of the original text to dissect
	$content=$origtext;
	//content = $this->body;
	
	$perc=3;			//target percentage changed terms
	$minlength=4;		//minimum length candidates
	$maxrequests=80;	//max use of api-requests


	//dump tags	
	$content =  strip_tags($content);
	
	//dump non-alphanumeric	string characters
	$content = preg_replace('/[^A-Za-z0-9\-]/', ' ', $content);
	
	//explode on blank space
	$wrds = explode(' ', strtolower($content));
	
	//trim off blank spaces just in case
	for($w=0;$w$minlength) {
			//words_select contains candidates for synonyms
				$words_select[] = trim($words_unique[$i]);
			}
		}
	}
	
	//terms that can be changed
	$max = count($words_select);
	
	//no more requests than max
	if($max>$maxrequests) $max=$maxrequests;
	
	for($i=0;$i< $max;$i++) {
	//get synonyms, give server some time
		usleep(100000);
		//retrieve synonyms etc.
		$these_words = getsynonyms($words_select[$i]);
		$jmax=count($these_words);
		if($jmax<1) {
		//no results
		} else {
$count=0;
			$j=0;
//the replacements are done in the original text
			$origtext= preg_replace('/'.$words_select[$i].'/i', $these_words[$j], $origtext, -1, $count);
			$total_switched+=$count;

		} //have we reached the percentage ? 
		if($total_switched>=$toswitch) break;
	}
	//okay!
	return $origtext;
}

function getsynonyms($keyword) {
	$pick=array	();
	$apikey = 'get your own key at bighugelabs.com';
	$xml=@file_get_contents('http://words.bighugelabs.com/api/2/'.$apikey.'/'.urlencode($keyword).'/xml');
	if(!$xml) return $pick;
	preg_match_all('/(.*?)< \/w>/', $xml, $adj_syns);
	foreach($adj_syns[0] as $adj_syn) $pick[]=$adj_syn;
	return $pick;
}

Nothing fancy, a straightforward search-replace routine. A 1200 word text has about 150 candidates and for 3% synonyms I need to replace 36 words, it can do that. If I were to use it for real I would build a table with non-returning terms, and store often used terms, that would speed up the synonimizing, allow the use of preferences and take a load of the api use.