parsing the google trends atom feed

Short and sweet : how to grab the google trends atom feed and parse out the links,
in 15 lines.

$feed = simplexml_load_file('http://www.google.com/trends/hottrends/atom/hourly');
$children =  $feed->children('http://www.w3.org/2005/Atom');
$parts = $children->entry;
foreach ($parts as $entry) {
	  $details = $entry->children('http://www.w3.org/2005/Atom');
	  $dom = new domDocument(); 
	  @$dom->loadHTML($details->content); 
	  $anchors = $dom->getElementsByTagName('a'); 
		foreach ($anchors as $anchor) { 
 			$url = $anchor->getAttribute('href'); 
 			$urltext = $anchor->nodeValue; 
 			echo 'Link: ' . $urltext . ' '; 
		}
}

Requires php with simplexml and dom xml. You could use it for a blogfarm script but that’s about all I can think of.

edited 18-12-08 :
$dom->loadHTML((string) $details->content);
to
@$dom->loadHTML($details->content);

zend php and google webmaster api II : wordpress mu auto-register

Part Deux of automating the registration and verification of a wordpress blog. In the previous post I showed how to add a site to google webmaster tools.

Which site you ask ? Oh dear… in the previous post I did not mention how to create a new blog in wpmu :

include_once('wp-config.php');
include_once('wp-includes/wp-db.php');
include_once('wp-includes/wpmu-functions.php');
$newblogid= wpmu_create_blog('tryout.blacknorati.com', '/', 'tryout', 1);

Very basic, assuming I am the admin user (with ID=1). After creating the blog, I post it’s url to google webmaster tools to start the registration. Then I want to

  • verify the site
  • add a sitemap
  • and blog on!

verifying a site

I can add any url to Google Webmaster Tools, but I only get to use the tools once Google are sure I ‘own’ the domain or subdomain. Verification is done by checking on the presence of a header metatag in the index file, or a specific file on the server. Once Google spots it, Google know I control the site and I can use the webmaster tools.

On a WordPress Mu install I do not, as user, get to have my own template. I currently have 100 standard templates installed to choose from, some with options and widgets and that should be enough. But editing the template itself is not possible for separate users, so I cannot verify sites with a header metatag.

The alternative is putting a file on the server with a particular codename, but users don’t have an actual separate subdomain with a wordpress Mu install, so that one also won’t work.

Eek ! Well, no problem, Google also accept a post with the filename in the url. Just blog a post with the google___.html filename as title, WordPress automatically turns the title into the url and you can use that post to have Google verify the site is yours.

getting the verification filename

A Google Webmaster Tools account has it’s own standard verification code and it’s valid for every site. Once a user registered the site with GWT, I can retrieve that code from the sites data feed :

function get_verification_title($domain, $client) {
		$myfeed = get_site($domain, $client);
  		foreach ($myfeed as $item) {
		$tags     = "";
        	$subjects = $item->{"wt:verification-method"};
        	if (is_array($subjects) and count($subjects) > 0) {
				return $subjects[1];
			}
		}
}

function get_site($domain, $client) {
		$fdata = new Zend_Gdata($client);
		$tgt="https://www.google.com/webmasters/tools/feeds/sites/".htmlentities(urlencode('http://'.$domain.'/'));
		$result=$fdata->getFeed($tgt);
		return $result;
}

With the get_site function I retrieve the site’s atom list as zend feed. The feed contains two wt:verification-method tags, one for the metatag and one for the html-file. This function loads both in the $subjects array and i pick item[1] (it’s a 0 based array), the html file name. I need that one to go post on the new blog. Here is a php routine taken from Snipplr.

function add_verify_post($domain, $verification, $logon, $pass) {
	$category='';
	$req = 'title='. $verification . '&content=' . $verification . '&category=' . $category . '&logon=' . $logon . '&pass=' . $pass;
	$header .= "POST /remote_post.php HTTP/1.0\r\n";
	$header .= "Host: ". $domain."\r\n";
	$header .= "Content-Type: application/x-www-form-urlencoded\r\n";
	$header .= "Content-Length: " . strlen ($req) . "\r\n";
	$header .= "Connection: Close\r\n\r\n";
	$fp = fsockopen($domain, 80, $errno, $errstr, 30);
	$SUCCESS = false;

	if (!$fp) {
		$status_message = "$errstr ($errno)";
		$res = "FAILED";
	}
	else {
		fputs ($fp, $header . $req);
		while (!feof($fp) && $SUCCESS==false) {
			$res = fgets ($fp, 1024);
			if (strcmp ($res, "SUCCESS") == 0) {
				$SUCCESS = true;
			}
			if(!empty($res)){
				$last_line = $res;
			}
		}
	}
	fclose($fp);

	if ($SUCCESS == true){
	}else{
		echo $last_line;
		}
	}
}

The remote_post.php code is the same as the snippet.

I am the owner of the blog so I can use the standard admin login and password in the function. For security purposes I’d use a different login and password for remote access though (this one does not use SSL).

With a simple call I send one new post to the new blog with the google verification file name as title.

add_verify_post('BlogSubdomain.blacknorati.com', 'google12345.html', 'MyLogin', 'MyPassword');

I had some doubts about google accepting blog.blacknorati.com/year/month/’google12345html but they actually accept it so I don’t have to adapt the permalink settings.

Now I have to send Google a ‘verify’ xml message,

function verify_site($domain, $client) {
	//domain without http
	$xml='
 		  http://'.$domain.'';
	$xml.="";
  	$xml.='
		   ';
		$fdata = new Zend_Gdata($client);
		$result=$fdata->post($xml,"https://www.google.com/webmasters/tools/feeds/sites/".urlencode('http://'.$domain)."/");
		return $result;
}

presto, now Google know I control the site, and I can use the webmaster tools. That means I can add the sitemap. And that in turn means my sites are indexed a lot faster.

function add_webmap($domain, $sitemap, $client) {
	//domain without http
	$xml='
 		  http://'.$sitemap.'';
    $xml.="
  		  WEB
		  ";

		$fdata = new Zend_Gdata($client);
		$myaddress= "https://www.google.com/webmasters/tools/feeds/".htmlentities(urlencode('http://'.$domain.'/'), ENT_QUOTES)."/sitemaps/";
		$result=$fdata->post($xml,$myaddress);
		return $result;
}

Happy now. Google Webmaster Tools API was top of my wish-list. Now I can register and verify 32.000 sites with sitemaps automatically, so that saves me at least 2500 hours of work. And it was actually easier than I thought, with the proper examples and snippets available online.

I am going to clean up the code a bit and stuff it in a class, and move on to developing large scale ‘grey’ ops :)

zend php and google webmaster tools api

update 2: Sandrine worked out a set of routines, as far as I know using Zend 1.7, she lists the code here.

update: Google updated their API in oktober (almost at the time I wrote these posts) and this code fails as it still based on the V1 APi. You can access the whole WT: toolset namespace (including sitemaps, verification) through the V2 API now, but you need to send a version id along with your request, that is handled in the new Zend 1.7 download.

The Problem

I can add 32.000 blogs on a standard WordPressMu install. How do I add 32.000 subdomains, verify them and add their sitemaps to Google Webmaster, without having to go to the webmaster page about 96.000 times ?

The solution

Integrating Google Webmaster Tools API into my WordPress Mu install.

What is it worth ?

If registering and verifying a site and adding a sitemap takes 5 minutes per domain, at E12,- per hour, that makes it 96.000 euros and 4 labor years for 32.000 sites. Writing a script is worth E96.000,- and saves me four years of mindless drone work, so that is well worth having a look at.

Software : Zend

Zend gData is a php framework that is programmed to handle Google Data. Their ClientLogin routine isn’t very flexible and they haven’t covered GWT Api yet, so I’ll have to hack some routines together.

After getting stonewalled by the zend program a few times, I went searching and ended up on ngoprekweb who have a nice post on ClientLogin authorization for the blogger api. Eris Ristemena uses a modified Zend ClientLogin, very nice work. I installed the adapted classes and tried that one to get through the ClientLogin, and it paid off.

The good stuff : Gwt api access

I am not interested in the blogger stuff though, I want access to GWT Google Webmaster Tools, so I worked Eris Ristemena’s blogger routine around a little.

set_include_path(dirname(__FILE__) . '/Zend_Gdata');
  require_once 'Zend.php';
  Zend::loadClass('Zend_Gdata_ClientLogin');
  Zend::loadClass('Zend_Gdata');
  Zend::loadClass('Zend_Feed');

  $username     = '';
  $password     = '';
  $service      = 'sitemaps';
  $source       = 'Zend_ZendFramework-0.1.1'; // companyName-applicationName-versionID
  $logintoken   = $_POST['captchatoken'];
  $logincaptcha = $_POST['captchaanswer'];

  try {
    $resp = Zend_Gdata_ClientLogin::getClientLoginAuth($username,$password,$service,$source,$logintoken,$logincaptcha);

    if ( $resp['response']=='authorized' )
    {
      $client = Zend_Gdata_ClientLogin::getHttpClient($resp['auth']);
      $gdata = new Zend_Gdata($client);

	  $feed = $gdata->getFeed("https://www.google.com/webmasters/tools/feeds/sites/");
         foreach ($feed as $item) {
	      echo '

'; } } elseif ( $resp['response']=='captcha' ) { echo 'Google requires you to solve this CAPTCHA image'; echo '

';
      echo '
‘; echo ‘Answer : ‘; echo ‘ ‘; echo ‘ ‘; echo ‘
';
      exit;
    }
    else
    {
      // there is no way you can go here, some exceptions must have been thrown
    }

  } catch ( Exception $e )  {
    echo $e->getMessage();
  }

(I added https://www.google.com/accounts/ to the captcha image source, otherwise it keeps drawing blanks.)

Zend uses a “HttpClient” for the connection to Google, and a gData class (usually the main ‘feed’, blogs, sites) that you use to do basic data manipulation. All feed entries are an atom format with a custom namespace.

Now I am going to add a domain. In my add_site function I put an XML Atom together to post (using the post() function of the gData class) to the sites feed url, and the Google API does the rest :

function add_site($domain, $client) {
		$xml='';
		$xml.='';
		$xml.='';
		$fdata = new Zend_Gdata($client);
		$result=$fdata->post($xml,"https://www.google.com/webmasters/tools/feeds/sites/");
		return $result;
}

In the main routine I pass the domain and the running httpclient to the add_site() function :

   if ( $resp['response']=='authorized' )
    {
      $client = Zend_Gdata_ClientLogin::getHttpClient($resp['auth']);
      echo add_site('test.blacknorati.com', $client);
    }

Cool. That saves me up to 32.000 site registrations. The rest of it is still greek to me, but this part functions. Next week : more nonsense (verify the site, add a sitemap, and integrate it in the blog creation function of wordpress mu).

1) about the blogger function : I tried to list the blogger posts with the ngoprekweb php code, but it seems blogger use a different string these days to identify the blog in gData, the id is returned as “tag:blogger.com-blabla-(blogid)” and you want the last part to access the blogs post atom feed :

	$idText = split('-', $item->id());
        $blogid = $idText[2];

(modified from the Zend 1.6.1 codebase)

      foreach ($feed as $item) {
        echo '' . $item->title() . '';

	$idText = split('-', $item->id());
        $blogid = $idText[2];

        $feed1 = $gdata->getFeed("http://www.blogger.com/feeds/$blogid/posts/summary");
//...
}