trackbacks

Trackbacks are brilliant stuff. I programmed a trackback module into the trends script yesterday just to see what it yields. As long as you don’t use it to spam and stick to common standards, it’s the fastest deep link building method available. I noticed another trends script is also using trackbacks.

GTrends lists an average 600 different searches per day, that makes 200K pages a year. If you put five blog excerpts with a link on a page you have 1000K backlink opportunities a year, automated, if you use trackbacks.

I got 50% success rate in the first tests, so I put it on a cronjob and it seems to level out at 30% successful links. That seemed a bit much, so I checked the PingCrawl plugin Eli (bluehatseo) and joshteam put together for WordPress. They claim a 80% success rate using Eli’s result scraper, I guess 30% is not aberrant.

For trends, I can’t narrow my search down too much. I need the most recent blogs for the trends buzz. Too narrow searches might exclude the recent news and the script would lose it’s usability. Besides, I figure 10% trackbacks would already be more than enough, a few hundred lines of code with a css template for 100K backlinks a year ain’t bad.

I don’t actually have anything to blog about today, so that’s it.

[added 3-3] ****ing brilliant, 65% trackbacks are accepted, increasing traffic, bots come crawling, finally something that works. Now add proxies.

[added 3-3] bozo style “the script got 4 uniques yesterday!”

Can I be honest ? Dude over at seounderworld gave me a vote of confidence on the trends script and I felt embarrased as the demo looks like shit and didn’t do anything. For scraper basics fine, but it lacked seo potential.

So I added some CSS, validated the source, added caching, gzip, rss-feed, sitemap, and the trackback module. It got 300 uniques yesterday and 400 uniques this morning on its first day out, so it performs better now and I don’t feel so embarrassed anymore.

(nice impression of the trends audience by the way)

I’ll add some proxies to prevent bans and some other stuff, once that’s done I’ll refresh the download.

google trends III

How to get the urls and snippets from the Google Trends details page. The news articles on the details page are listed with an ‘Ajax’ call, they are not sent to the browser in the html source. No easy way to scrape that.

The blog articles are pretty straight forward : first the ugly fast way :

$mytitle='manuel benitez';
$mydate=''; //2008-12-24
$html=file_get_contents('http://www.google.com/trends/hottrends?q='.urlencode($mytitle).'&date=&sa=X');
$start = strpos($html, '
'); $end = strpos($html, '
'); $content = substr($html, $start, $end-$start); echo $content;

That returns the blog snippets, ugly. The other way : regular pattern matching : you can grab the divs that each content item has, marked with

  • div class=”gs-title”
  • div class=”gs-relativePublishedDate”
  • div class=”gs-snippet”
  • div class=”gs-visibleUrl”

from the html-source and organize them as “Content” array, after which you can list the content items with your own markup or store them in a database.

//I assume $mytitle is taken from the $_GET array.

//array 'Content' with it's members 
Class Content {
	var $id;
	var $title;
	var $pubdate;
	var $snippet;
	var $url;
	
	public function __construct($id) {
		$this->id=$id;
	}
}

//grab the source from the google page
$html=file_get_contents('http://www.google.com/trends/hottrends?q='.urlencode($mytitle).'&date=&sa=X');

//cut out the part I want
$start = strpos($html, '
'); $end = strpos($html, '
'); $content = substr($html, $start, $end-$start); //grab the divs that contain title, publish date, snippet and url with regular pattern match preg_match_all('!
.*?< \/div>!si', $html, $titles); preg_match_all('!
.*?< \/div>!si', $html, $pubDates); preg_match_all('!
.*?< \/div>!si', $html, $snippets); preg_match_all('!
.*?< \/div>!si', $html, $urls); $Contents = array(); //organize them under Content; $count=0; foreach($titles[0] as $title) { //make a new instance of Content; $Contents[] = new Content($count); //add title $Contents[$count]->title=$title; $count++; } $count=0; foreach($pubDates[0] as $pubDate) { //add publishing date (contains some linebreak, remove it with strip_tags) $Contents[$count]->pubdate=strip_tags($pubDate); $count++; } $count=0; foreach($snippets[0] as $snippet) { //add snippet $Contents[$count]->snippet=$snippet; $count++; } $count=0; foreach($urls[0] as $url) { //add display url $Contents[$count]->url=$url; $count++; } //leave $count as is, the number of content-items with a 0-base array //add rel=nofollow to links to prevent pagerank assignment to blogs for($ct=0;$ct< $count;$ct++) { $Contents[$ct]->url = preg_replace('/ target/', ' rel="nofollow" target', $Contents[$ct]->url); $Contents[$ct]->title = preg_replace('/ target/', ' rel="nofollow" target', $Contents[$ct]->title); } //its complete, list all content-items with some markup for($ct=0;$ct< $count;$ct++) { echo '

'.$Contents[$ct]->title.''; echo '

'.$Contents[$ct]->pubdate.':'.$Contents[$ct]->snippet.'

'; echo $Contents[$ct]->url.'
'; }

It ain’t perfect, but it works. the highlighter I use gets a bit confused about the preg_match_all statements containing unclosed div’s, so copying the code of the blog may not work.

google trends II

I wanted to reply to a question elsewhere on the site, but a ‘comment’ box isn’t fit for it so I’ll put the reply here. The question was about creating ‘search engine friendly’ descriptive URL’s based on keywords from the Google Trends atom feed, listing pages a graph of the trend.

You can get a site to list http://domain.com/trend_title.html type url’s by using mod_rewrite, an apache module.

In the server directory of the application you can use an .htaccess file to set rules for file access in these folders. When the server gets request from browsers or servers it applies any rewriting rules you define in .htaccess to these requests.

I tried this one :


	RewriteEngine On
	RewriteCond %{REQUEST_FILENAME} !-f
	RewriteCond %{REQUEST_FILENAME} !-d
        RewriteRule ^(.*).html /trendinfo.php?title=$1 

RewriteEngine On
sets the rewrite mechanism on

RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d

tell the apache server that rewriteconditions apply to file-requests that are not an existing file (F) or directory (D). If the requested filename is anywhere in the servers file table, the server dishes out that file, otherwise it will try to apply a RewriteRule. Applying the rule generates a new request, if that returns anything, the server dishes that out, otherwise it returns an htpp-404 ‘file not found’.

The actual url rewrite rule is :
RewriteRule ^(.*).html /trendinfo.php?title=$1
which means :

  • if any filename is requested that satisfies the mask ^(.*).html then
  • take everything before .html
  • add that as variable $1 to trendinfo.php?title=$1
  • see if it sticks

If the browser requests http://domain.com/bob+bowersox.html, the server will assert it is not a file or directory on the server, and test the available rules. When it notices it the requested file ends with .html, it applies the rewrite rule and tries to access http://domain.com/trendinfo.php?title=bob+bowersox.

A browsing user does not notice a thing.

In trendinfo.php I wrote some code to handle the ‘new’ request :

if(!isset($_REQUEST['title'])) {
//if there is no $1, added as title, fake a 404 "file not found" message 
        echo 'the emptiness...';
} else {
//get the title from the request
  $mytitle=htmlentities($_REQUEST['title'], ENT_QUOTES, "UTF-8");
//put the google trends graph url together
  $graphurl = 'http://www.google.com/trends/viz?hl=&q=';
  $graphurl .= urlencode($mytitle);
  $graphurl .= '&date=';                        //leave date blank to get the current graph
  $graphurl .= '&graph=hot_img&sa=X';
  echo "";
}

…that outputs the Google trend graph on the url http://domain.com/bob+bowersox.html

You can also put this in index.php :

		$feed = simplexml_load_file('http://www.google.com/trends/hottrends/atom/hourly');
		$children =  $feed->children('http://www.w3.org/2005/Atom');
		$parts = $children->entry;
		foreach ($parts as $entry) {
		  	$details = $entry->children('http://www.w3.org/2005/Atom');
	 	 	 $dom = new domDocument(); 
		 	 $html=$details->content;
		 	 @$dom->loadHTML($html); 
		  	 $anchors = $dom->getElementsByTagName('a'); 
				foreach ($anchors as $anchor) { 
		 			$url = $anchor->getAttribute('href'); 
	 				$urltext = $anchor->nodeValue;
					echo ''.$urltext.' ';
				}
			}
			unset($dom);
			unset($anchors);
			unset($parts);
			unset($feed);

That lists the current 100 google trends with a link. If you use the .htaccess rewrite rules, the server reroutes all the links to trendinfo.php with descriptive urls.

I hope that helps.