curl trackbacks

I figure i’d blog a post on trackback linkbuilding. A trackback is … (post a few and you’ll get it). The trackback protocol isn’t that interesting, but the implementation of it by blog-platforms and cms’es makes it an excellent means for network development, because it uses a simple http-post. cUrl makes that easy).

To post a succesful link proposal I need some basic data :

about my page

  • url (must exist)
  • blog owner (free)
  • blog name (free)

about the other page

  • url (must exist)
  • excerpt (should be proper normal text)

my page : this is preferably a php routine that hacks some text, pictures and video’s, PLR or articles together, with a url rewrite. I prefer using xml textfiles in stead of a database, works faster when you set stuff up.

other page : don’t use “I liked your article so much…”, use text that maches text on target pages, preferably get some proper excerpts from xml-feeds like blogsearch, msn and yahoo (excerpts contain the keywords I searched for, as anchor text it works better for search engine visibility and link value).

Let’s get some stuff from the MSN rss feed :

//a generic query = 5% success
//add "(powered by) wordpress" 
      $query=urlencode('keywords+wordpress+trackback');
      $xml = @simplexml_load_file("http://search.live.com/results.aspx?q=$query&count=50&first=1&format=rss");
      $count=0;
      foreach($xml->channel->item as $i) {

           $count++;

//the data from msn
           $target['link'] = (string) $i->link;
           $target['title'] = (string) $i->title;
           $target['excerpt'] = (string) $i->description;

//some variables I'll need later on
           $target[id'] = $count;
           $target['trackback'] = '';
           $target['trackback_success'] = 0;

           $trackbacks[]=$target;
       }

25% of the cms sites in the top of the search engines are WordPress scripts and WordPress always uses /trackback/ in the rdf-url. I get the source of the urls in the search-feed and grab all link-url’s in it, if any contains /trackback/, I post a trackback to that url and see if it sticks.

(I can also spider all links and check if there is an rdf-segment in the target’s source (*1), but that takes a lot of time, I could also program a curl array and use multicurl, for my purposes this works fast enough).

for($t=0;$t]*?href[\s]?=[\s\"\']+".
           "(.*?)[\"\']+.*?>"."([^< ]+|.*?)?<\/a>/",
        $content, &$matches);
	$uri_array = $matches[1];
	foreach($uri_array as $key => $link) { 
             if(strpos($link, 'rackbac')>0) { 
                $trackbacks[$t]['trackback'] = $link;
                break; 
             }
        }
}

When I fire a trackback, the other script will try and assert if my page has a link and matching text. I have to make sure my page shows the excerpts and links, so I stuff all candidates in a cached xml file.

function cache_xml_store($trackbacks, $pagetitle) 
{
	$xml = '< ?xml version="1.0" encoding="UTF-8"?>
	';
	for($a=0;$a';
		$xml .= ''.$arr['excerpt'].'';
		$xml .= ''.$arr['link'].'';
		$xml .= ''.$arr['title'].'';
		$xml .= '';
	}
	$xml .= '';
	
	$fname = 'cache/trackback'.urlencode($pagetitle).'.xml';
	if(file_exists($fname)) unlink('cache/'.$fname);
	$fhandle = fopen($fname, 'w');
	fwrite($fhandle, $xml);
	fclose($fhandle);
	return;
}

I use simplexml to read that cached file and show the excertps and links once the page is requested.

// retrieve the cached xml and return it as array.
function cache_xml_retrieve($pagetitle)
{
	$fname = 'cache/trackback'.urlencode($pagetitle).'.xml';
	if(file_exists($fname)) {
		$xml=@simplexml_load_file($fname);
		if(!$xml) return false;
		foreach($xml->entry as $e) {
			$trackback['id'] =(string) $e->id;
			$trackback['link'] =  rid((string) $e->link);
			$trackback['title'] =  (string) $e->title;
			$trackback['description'] =  (string) $e->description;

			$trackbacks[] = $arr;
		}
		return $trackbacks;
	} 
	return false;
}

(this setup requires a subdirectory cache set to read/write with chmod 777)

I use http://www.domain.com/financial+trends.html and extract the pagetitle as “financial trends’, which has an xml-file http://www.domain.com/cache/financial+trends.xml. (In my own script I use sef urls with mod_rewrite, you can also use the $_SERVER array).

$pagetitle=preg_replace('/\+/', ' ', htmlentities($_REQUEST['title'], ENT_QUOTES, "UTF-8"));

$cached_excerpts = cache_xml_retrieve($pagetitle);

//do some stuff with, make it look nice  :
for($s=0;$s'.$cached_excerpts['title'].'';
}

Now I prepare the data for the trackback post :

for($t=0;$t "url of my page with the link to the target",
 	"title" => "title of my page",
	"blog_name" => "name of my blog",
	"excerpt" => '[...]'.trim(substr($trackbacks[$t]['description'], 0, 150).'[...]'
        );
        //...and try the trackback
        $trackbacks[$t]['trackback_success'] = trackback_ping($trackback_url, $mytrackbackdata);
    }
}

This the actual trackback post using cUrl. cUrl has a convenient timeout setting, I use three seconds. If a host does not respond in half a second it’s probably dead. Three seconds is generous.

function trackback_ping($trackback_url, $trackback)
	{

//make a string of the data array to post
	foreach($trackback as $key=>$value) $strout[]=$key."=".rawurlencode($value);
        $postfields= implode('&', $strout);
		
//create a curl instance
	$ch = curl_init();
	curl_setopt($ch, CURLOPT_URL, $trackback_url);
	curl_setopt($ch, CURLOPT_TIMEOUT, 3);
	curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)");
	curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);

//set a custom form header
	curl_setopt($ch, CURLOPT_HTTPHEADER, array('Content-type: application/x-www-form-urlencoded'));

	curl_setopt($ch, CURLOPT_NOBODY, true);

        curl_setopt($ch, CURLOPT_POST, true);
	curl_setopt($ch, CURLOPT_POSTFIELDS, $postfields);	
		
	$content = curl_exec($ch);

//if the return has a tag 'error' with as value 0 it went flawless
	$success = 0;	
	if(strpos($content, '>0')>0) $success = 1; 
	curl_close ($ch);
	unset($ch);
	return $success;
	}

Now the last routine : rewrite the cached xml file with only the successful trackbacks (seo stuff) :

for($t=0;$t0) {
        $store_trackbacks[]=$trackbacks[$t];
    }
}
cache_xml_store($store_trackbacks, $pagetitle);

voila : a page with only successful trackbacks.

Google (the backrub engine) don’t like sites that use automated link-building methods, other engines (Baidu, MSN, Yahoo) use a more normal link popularity keyword matching algorithm. Trackback linking helps getting you a clear engine profile at relative low cost.

0) for brevity and clarity, the code above is rewritten (taken from a trackback script I am developing on another site), it can contain some typo’s.

*1) If you want to spider links for rdf-segments : TYPO3v4 have some code for easy retrieval of trackback-uri’s :

/**
	 * Fetches ping url from the given url
	 *
	 * @param	string	$url	URL to probe for RDF
	 * @return	string	Ping URL
	 */
	protected function getPingURL($url) {
		$pingUrl = '';
		// Get URL content
		$urlContent = t3lib_div::getURL($url);
		if ($urlContent && ($rdfPos = strpos($urlContent, '', $rdfPos)) !== false) {
				// We will use quick regular expression to find ping URL
				$rdfContent = substr($urlContent, $rdfPos, $endPos);
				$pingUrl = preg_replace('/trackback:ping="([^"]+)"/', '\1', $rdfContent);
			}
		}
		return $pingUrl;
	}

proxies !

I got a site banned at Google so I got pissed and took a script from the blackbox @ digerati marketing to scrape proxy addresses, wired a database and curl into it, so now it scrapes proxies, random picks a proxy, prunes dead proxies and returns data.

Basic, it uses anonymous (level 2) proxies, but it works.


/* (mysql table)
CREATE TABLE IF NOT EXISTS `serp_proxies` (
  `id` int(11) NOT NULL auto_increment,
  `ip` text NOT NULL,
  `port` text NOT NULL,
  PRIMARY KEY  (`id`)
) ENGINE=MyISAM  DEFAULT CHARSET=latin1 AUTO_INCREMENT=1 ;
*/

//initialize database class, replace with own code
include('init.php');

//main class
$p=new MyProxies;

//do I have proxies in the database ?
//if not, get some and store them
if($p->GetCount() < 1) {
	$p->GetSomeAir(1);
	$p->store2database();
}

//pick one
$p->RandomProxy();

//get the page
$p->ThisProxy->DoRequest('http://www.domain.com/robots.txt');

//error handling
if($p->ThisProxy->ProxyError > 0) {
//7 		no connect
//28 		timed out
//52 		empty reply
//if it is dead, doesn't allow connections : prune it
	if($p->ThisProxy->ProxyError==7) $p->DeleteProxy($p->ThisProxy->proxy_ip);
	if($p->ThisProxy->ProxyError==52) $p->DeleteProxy($p->ThisProxy->proxy_ip);
}
//you could loop back until you get a 0-error proxy, but that ain't the point

//give me the content
echo $p->ThisProxy->Content;


Class MyProxies {

	var $Proxies = array();
	var $ThisProxy;
	var $MyCount;
	

//picks a random proxy from the database
	function RandomProxy() {

		global $serpdb;	
		$offset_result =  $serpdb->query("SELECT FLOOR(RAND() * COUNT(*)) AS `offset` FROM `serp_proxies`");
		$offset_row = mysql_fetch_object($offset_result);
		$offset = $offset_row->offset;
		$result = $serpdb->query("SELECT * FROM `serp_proxies` LIMIT $offset, 1" );
		while($row=mysql_fetch_assoc($result)) {
//make instance of Proxy, with proxy_host ip and port
			$this->ThisProxy = new Proxy($row['ip'].':'.$row['port']);
			$this->ThisProxy->proxy_ip = $row['ip'];
			$this->ThisProxy->proxy_port = $row['port'];
			break;
		}
	}
	
//visit the famous russian site 
	function GetSomeAir($pages) {
			for($index=0; $index< $pages; $index++)
			{
				$pageno = sprintf("%02d",$index+1); 
				$page_url = "http://www.samair.ru/proxy/proxy-" . $pageno . ".htm";
				$page_html = @file_get_contents($page_url);

//get rid of the crap and extract the proxies
				preg_match("/(.*)< \/td>< \/tr>/", $page_html, $matches);
				$txt = $matches[1];
				$main = split('', $txt);
				for($x=0;$x', $main[$x]);
					$this->Proxies[] = split(':', $arr[0]);
				}
			}
	}

//store the retrieved proxies (stored in this->Proxies) in the database
	function store2database() {
		global $serpdb;
		foreach($this->Proxies as $p) { 
			$result = $serpdb->query("SELECT * FROM serp_proxies WHERE ip='".$p[0]."'");
			if(mysql_num_rows($result)<1) $serpdb->query("INSERT INTO serp_proxies (`ip`, `port`) VALUES ('".$p[0]."', '".$p[1]."')");
		}
		$serpdb->query("DELETE FROM serp_proxies WHERE `ip`=''");
	}


	function DeleteProxy($ip) {
		global $serpdb;
		$serpdb->query("DELETE FROM serp_proxies WHERE `ip`='".$ip."'");			
	}
	
	
	function GetCount() 
	{
//use this to check how many proxies there are in the database
		global $serpdb;
		$this->MyCount = mysql_num_rows($serpdb->query("SELECT * FROM `serp_proxies`"));
		return $this->MyCount; 
	}
	
	
}

Class Proxy {

	var $proxy_ip;
	var $proxy_port;
	
	var $proxy_host;
	var $proxy_auth; 
	var $ch;
	var $Content;
	var $USERAGENT = "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)";
	var $ProxyError = 0;
	var $ProxyErrorMsg = '';
	var $TimeOut=3;
	var $IncludeHeaders = 0;
	
	function Proxy($host, $username='', $pwd='') {
//initialize class, set host 
         $this->proxy_host = $host;
         if (strlen($username) > 0 || strlen($pwd) > 0) {
            $this->proxy_auth = $username.":".$pwd;
         }
      }

	function CURL_PROXY($cc) {
			if (strlen($this->proxy_host) > 0) {
				curl_setopt($cc, CURLOPT_PROXY, $this->proxy_host);
				if (strlen($this->proxy_auth) > 0)
					curl_setopt($cc, CURLOPT_PROXYUSERPWD, $this->proxy_auth);
			}
	}
	
	function DoRequest($url) {
		$this->ch = curl_init();
		curl_setopt($this->ch, CURLOPT_URL,$url);
		$this->CURL_PROXY($this->ch);
		curl_setopt($this->ch, CURLOPT_HEADER, $this->IncludeHeaders); // baca header
		
		curl_setopt($this->ch, CURLOPT_USERAGENT, $this->USERAGENT);
		curl_setopt($this->ch, CURLOPT_RETURNTRANSFER, 1);
		curl_setopt($this->ch, CURLOPT_TIMEOUT, $this->TimeOut);
	    $this->Content = curl_exec($this->ch);

//if an error occurs, store the number and message
		if (curl_errno($this->ch))
			{ 
				$this->ProxyError =  curl_errno($this->ch);
				$this->ProxyErrorMsg =  curl_error($this->ch);
			}
	}

}

There is not much to say about it, just a rough outline. I would prefer elite level 1 proxies but for now it will have to do.

using ajax readystate 3 polling

This one is not news anymore, but anyway, a friend of mine asked about a function to notify the browser of changes on the backend, so with multiple online users you can notify a user of changes others make.

One way is a socket daemon, the other is using the ajax redystate 3 ‘polling’ feature (a bit like the comet server). As the ajax xhr-object is a basic http-wrapper, it has the same sequence of a normal http connection. A normal call is in readystate 3 (receiving data) until the server signals that was the end of it, readystate 4, where you can pick up the returned http-connection status (200, 404 etc.)

Using the php flush() command inside a running program, you force output to the browser, which triggers a readystate 3 change in an xhr instance. You can pick up on the triggered readystate change and read the new output in the buffer.

A demo is basic, and requires four files

  • queue.txt  (chmod 777)
  • polling.js
  • polling.php
  • polling.html

I put one queue.txt file on the backend with 777 permission so anyone can read and write it.

Then I make a javascript file containing two calls, startclock and stopclock (and makeXmlHttp() to make an xhr instance). Startclock starts an endless loop and outputs the incremental content of the output buffer to a div in the html file (for the demo I echo time(), that way you can make an ajax digital clock) :

function startclock()
{
        var index = 0;            
        xmlHttp = makeXmlHttp();
        
        xmlHttp.onreadystatechange = function()
        {
                if ( xmlHttp.readyState == 3 )
                {
//grab the new part of the output buffer and write it to a div
				var rtlen = xmlHttp.responseText.length;
			        if (index < rtlen) {
			           document.getElementById("seoresult").innerHTML =  xmlHttp.responseText.substring(index);
			           index = rtlen;
			        }
			}
        }
        xmlHttp.open("POST", "polling.php?action=start", true);
        xmlHttp.send('');
}

stopclock() just calls a php function that writes 'stop' in queue.txt :

function stopclock()
{
        xmlHttp = makeXmlHttp();
        xmlHttp.open("POST", "polling.php?action=stop", true);
        xmlHttp.send('');
}

For the sake of the demo, I added a function stopclock() that writes ‘stop’ to the queue.txt.

Then the polling.php program file : this contains a routine that runs an endless loop and three routines for the queue.txt file (write ‘start’, add ‘stop’, and read content). The endless loop reads queue.txt once every half second, if the word ‘stop’ is in there it ends, the php-program ends and the xhr call ends. Otherwise the endlessloop function outputs the time, and flushes the buffer to the browser :

if($_GET['action']=='start') {
	endlessloop();
} else {
        writestop();
}


function endlessloop() {
//truncate the queue, write 'start'
	writestart();

//get the time
	$start=time();

//using while(1) or while(true) you start an endless loop,
//and use break to end it, I tend to also use a timed end,
//to prevent the program from running on endlessly on the
//server if I break the http connection

	while(1) {
//read the file contents
		$the_Text=readsome();

//check if the word 'stop' is in there
//if so, echo a notification, end the program
		if(strpos($the_Text, "stop")>0) { 
			echo 'clock stopped';
			flush();			
			break;
		}

//after 45 seconds (arbitrary) end the program anyway
		if(time()>($start+45)) {
			echo 'time elapsed';
			flush();			
			break;
		}

//echo the time
		echo time();

//wait for a while
		usleep(100000);

//flush triggers a forced dump of the buffer to the browser
		flush();
	}
}

function writestart() {
//truncate the file, write 'start'
	$fhandle =fopen('queue.txt', 'w');
	fwrite($fhandle, 'start');
	fclose($fhandle);
}
function writestop() {
//write 'stop' 
	$fhandle =fopen('queue.txt', 'a');
	fwrite($fhandle, 'stop');
	fclose($fhandle);
}

function readsome() {
//read the file, return the text contents
	$fhandle =fopen('queue.txt', 'r');
	while($buffer = fread($fhandle, 1024)) {
		$text.=$buffer;
	}
	return $text;	
}

If you start the same polling.html in two browser windows you’ll notice that stopping one, also causes the other to stop. Very basic demo.