Categories
google seo tips and tricks

google suggest scraper (php & simplexml)

Today’s goal is a basic php Google Suggest scraper because I wanted traffic data and keywords for free.

Before we start :

google scraping is bad !

Good People use the Google Adwords API : 25 cents for 1000 units, 15++ units for keyword suggestion so they pay 4 or 5 dollar for 1000 keyword suggestions (if they can find a good programmer which also costs a few dollars). Or they opt for SemRush (also my preference), KeywordSpy, Spyfu, and other services like 7Search PPC programs to get keyword and traffic data and data on their competitors but these also charge about 80 dollars per month for a limited account up to a few hundred per month for seo companies. Good people pay plenty.

We tiny grey webmice of marketing however just want a few estimates, at low or better no cost : like this :

data num queries
google suggest 57800000
google suggestion box 5390000
google suggest api 5030000
google suggestion tool 3670000
google suggest a site 72700000
google suggested users 57000000
google suggestions funny 37400000
google suggest scraper 62800
google suggestions not working 87100000
google suggested user list 254000000

Suggestion autocomplete is AJAX, it outputs XML :

< ?xml version="1.0"? >
   <toplevel>
     <CompleteSuggestion>
       <suggestion data="senior quotes"/>
       <num_queries int="30000000"/>
     </CompleteSuggestion>
     <CompleteSuggestion>
       <suggestion data="senior skip day lyrics"/>
       <num_queries int="441000"/>
     </CompleteSuggestion>
   </toplevel>

Using SimpleXML, the PHP routine is as simple as querying g00gle.c0m/complete/search?, grabbing the autocomplete xml, and extracting the attribute data :

 
        if ($_SERVER['QUERY_STRING']=='') die('enter a query like http://host/filename.php?query');
	$contentstring = @file_get_contents("http://g00gle.c0m/complete/search?output=toolbar&q=".urlencode($kw));  
  	$content = simplexml_load_string($contentstring );

        foreach($content->CompleteSuggestion as $c) {
            $term = (string) $c->suggestion->attributes()->data;
            //note : traffic data is sometimes missing   
            $traffic = (string) $c->num_queries->attributes()->int;
            echo $term. " ".$traffic . "
" ;
	}

I made a quick php script that outputs the terms as a list of new queries so you can walk through the suggestions :

The source is as text file up for download overhere (rename it to suggestit.php and it should run on any server with php5.* and simplexml).

Categories
php tool wordpress

zend php and google webmaster tools api

update 2: Sandrine worked out a set of routines, as far as I know using Zend 1.7, she lists the code here.

update: Google updated their API in oktober (almost at the time I wrote these posts) and this code fails as it still based on the V1 APi. You can access the whole WT: toolset namespace (including sitemaps, verification) through the V2 API now, but you need to send a version id along with your request, that is handled in the new Zend 1.7 download.

The Problem

I can add 32.000 blogs on a standard WordPressMu install. How do I add 32.000 subdomains, verify them and add their sitemaps to Google Webmaster, without having to go to the webmaster page about 96.000 times ?

The solution

Integrating Google Webmaster Tools API into my WordPress Mu install.

What is it worth ?

If registering and verifying a site and adding a sitemap takes 5 minutes per domain, at E12,- per hour, that makes it 96.000 euros and 4 labor years for 32.000 sites. Writing a script is worth E96.000,- and saves me four years of mindless drone work, so that is well worth having a look at.

Software : Zend

Zend gData is a php framework that is programmed to handle Google Data. Their ClientLogin routine isn’t very flexible and they haven’t covered GWT Api yet, so I’ll have to hack some routines together.

After getting stonewalled by the zend program a few times, I went searching and ended up on ngoprekweb who have a nice post on ClientLogin authorization for the blogger api. Eris Ristemena uses a modified Zend ClientLogin, very nice work. I installed the adapted classes and tried that one to get through the ClientLogin, and it paid off.

The good stuff : Gwt api access

I am not interested in the blogger stuff though, I want access to GWT Google Webmaster Tools, so I worked Eris Ristemena’s blogger routine around a little.

set_include_path(dirname(__FILE__) . '/Zend_Gdata');
  require_once 'Zend.php';
  Zend::loadClass('Zend_Gdata_ClientLogin');
  Zend::loadClass('Zend_Gdata');
  Zend::loadClass('Zend_Feed');

  $username     = '';
  $password     = '';
  $service      = 'sitemaps';
  $source       = 'Zend_ZendFramework-0.1.1'; // companyName-applicationName-versionID
  $logintoken   = $_POST['captchatoken'];
  $logincaptcha = $_POST['captchaanswer'];

  try {
    $resp = Zend_Gdata_ClientLogin::getClientLoginAuth($username,$password,$service,$source,$logintoken,$logincaptcha);

    if ( $resp['response']=='authorized' )
    {
      $client = Zend_Gdata_ClientLogin::getHttpClient($resp['auth']);
      $gdata = new Zend_Gdata($client);

	  $feed = $gdata->getFeed("https://www.google.com/webmasters/tools/feeds/sites/");
         foreach ($feed as $item) {
	      echo '

'; } } elseif ( $resp['response']=='captcha' ) { echo 'Google requires you to solve this CAPTCHA image'; echo '

';
      echo '
‘; echo ‘Answer : ‘; echo ‘ ‘; echo ‘ ‘; echo ‘
';
      exit;
    }
    else
    {
      // there is no way you can go here, some exceptions must have been thrown
    }

  } catch ( Exception $e )  {
    echo $e->getMessage();
  }

(I added https://www.google.com/accounts/ to the captcha image source, otherwise it keeps drawing blanks.)

Zend uses a “HttpClient” for the connection to Google, and a gData class (usually the main ‘feed’, blogs, sites) that you use to do basic data manipulation. All feed entries are an atom format with a custom namespace.

Now I am going to add a domain. In my add_site function I put an XML Atom together to post (using the post() function of the gData class) to the sites feed url, and the Google API does the rest :

function add_site($domain, $client) {
		$xml='';
		$xml.='';
		$xml.='';
		$fdata = new Zend_Gdata($client);
		$result=$fdata->post($xml,"https://www.google.com/webmasters/tools/feeds/sites/");
		return $result;
}

In the main routine I pass the domain and the running httpclient to the add_site() function :

   if ( $resp['response']=='authorized' )
    {
      $client = Zend_Gdata_ClientLogin::getHttpClient($resp['auth']);
      echo add_site('test.blacknorati.com', $client);
    }

Cool. That saves me up to 32.000 site registrations. The rest of it is still greek to me, but this part functions. Next week : more nonsense (verify the site, add a sitemap, and integrate it in the blog creation function of wordpress mu).

1) about the blogger function : I tried to list the blogger posts with the ngoprekweb php code, but it seems blogger use a different string these days to identify the blog in gData, the id is returned as “tag:blogger.com-blabla-(blogid)” and you want the last part to access the blogs post atom feed :

	$idText = split('-', $item->id());
        $blogid = $idText[2];

(modified from the Zend 1.6.1 codebase)

      foreach ($feed as $item) {
        echo '' . $item->title() . '';

	$idText = split('-', $item->id());
        $blogid = $idText[2];

        $feed1 = $gdata->getFeed("http://www.blogger.com/feeds/$blogid/posts/summary");
//...
}

Categories
links php seo

blogger auto-poster

I needed to get my new linkdirectory’s pages indexed and crawled and google needs some stimulation.

So I take a blogger subdomain,
and a 700 category php link directory
and make a table PLD_TAGCLOUD(CAT_ID, POSTED, TAG, FULLPATH)

CREATE TABLE `PLD_TAGCLOUD` (
`ID` BIGINT( 11 ) NOT NULL ,
`CAT_ID` DOUBLE NOT NULL ,
`POSTED` DOUBLE NOT NULL ,
`LEVEL` DOUBLE NOT NULL ,
`TAG` VARCHAR( 250 ) NOT NULL ,
`FULLPATH` VARCHAR( 250 ) NOT NULL ,
PRIMARY KEY ( `ID` )
) ENGINE = MYISAM

I fill the table with a recursive tree traversal on the category table where “tag” is the ‘title’ field,
and I get the FULLPATH url by using the domain root and the path generated by traversing the tree :


function connect() {
	$DB_USER =  "";
	$DB_PASSWORD = "";
	$DB_HOST = "";
	$DB_DATA = "";
	$link =  mysql_connect($DB_HOST, $DB_USER, $DB_PASSWORD) or $error = mysql_error();
	if (!$link) {
	   	echo $error; 
		exit;
	} else {
    mysql_select_db($DB_DATA, $link) or $error = mysql_error();
	return $link;
	}
}

$link = connect();
$del="DELETE FROM `PLD_TAGCLOUD`";
$mydel = mysql_query($del, $link) or die(mysql_error());
@mysql_close($link);

$root='http://links.trismegistos.net';
$content .= read(0, $root, 1);

function read($rootid, $pathid, $thislevel) {
	$link = connect();
	$myqry = "SELECT * FROM `PLD_CATEGORY` WHERE `PARENT_ID`='".$rootid."'";
	$myres = mysql_query($myqry, $link) or die(mysql_error());
	if(mysql_num_rows($myres)<1) return;
	while($row=mysql_fetch_assoc($myres)) { 	
		$thispath= $pathid ."/".$row['TITLE_URL'];
		$link2 = connect();
		$add="INSERT INTO `PLD_TAGCLOUD` (`CAT_ID`, `LEVEL`, `TAG`, `FULLPATH`) VALUES ('".$row['ID']."', '".$thislevel."', '".htmlentities($row['TITLE'], ENT_QUOTES)."' ,'".$thispath."/')";
		$addit = mysql_query($add, $link2) or die(mysql_error());
		@mysql_close($link2);
		read($row['ID'], $thispath, $thislevel+1);
	}
	@mysql_close($link);
}

note : I also use a field level to store the depth of a category-page (o for root, 1 for main categories and mine goes down to 4)

Then we make a simple routine to grab the first record for posted=0,
grab the url
grab the title
grab 3 posts off of google-blogsearch on the TITLE, add em to an email,
add a link to the category page url,
mail(email, subject, message-body, headers)

and ofcourse the coupe-de-grace, the cronjob, 700 posts, 4 per hour, so in about 170 hours my entire site is listed on a nice juicy blog. Just for the hell of it i put the links of the blogsearch on ‘follow’ so my poor victims get a link as well.


function connect() {
	$DB_USER =  "";
	$DB_PASSWORD = "";
	$DB_HOST = "";
	$DB_DATA = "";
	$link =  mysql_connect($DB_HOST, $DB_USER, $DB_PASSWORD) or $error = mysql_error();
	if (!$link) {
	   	echo $error; 
		exit;
	} else {
    mysql_select_db($DB_DATA, $link) or $error = mysql_error();
	return $link;
	}
}

	$link = connect();
	$myqry = "SELECT * FROM `PLD_TAGCLOUD` WHERE `POSTED`='0' ORDER BY ID DESC";
	$myres = mysql_query($myqry, $link) or die(mysql_error());
	if(mysql_num_rows($myres)<1) return;
	while($row=mysql_fetch_assoc($myres)) { 	
		$myurl = $row['FULLPATH'];
		$mykey = urlencode($row['TAG']);
		$link2 = connect();
		$add="UPDATE `PLD_TAGCLOUD` SET `POSTED`='1' WHERE `ID`='".$row['ID']."'";
		$addit = mysql_query($add, $link2) or die(mysql_error());
		@mysql_close($link2);
		break;
	}
	@mysql_close($link);


$xmlSource="http://blogsearch.google.com/blogsearch_feeds?hl=en&c2coff=1&lr=&safe=active&as_drrb=q&as_qdr=d&q=".$mykey."&ie=utf-8&num=3&output=rss";
$title="";
$link="";
$description="";
$author="";
$pubDate="";
$currentElement="";
$nieuwsitems = array();

function startElement($parser,$name,$attr){
	if(strcmp($name,"item")==0){
	$GLOBALS['title']="";
	$GLOBALS['link']="";
	$GLOBALS['description']="";
	$GLOBALS['author']="";
	$GLOBALS['pubDate']="";
	}
	$GLOBALS['currentElement']=$name;	
	if(strcmp($name,"link")==0){ $GLOBALS['href']=$attr["href"]; }

}

function endElement($parser,$name){
	$elements=array('title','link','description','author','pubDate');     
	if(strcmp($name,"item")==0){
		foreach($elements as $element){
			$temp[$element] = $GLOBALS[$element];							
		}
	$GLOBALS['nieuwsitems'][]=$temp;
	$GLOBALS['title']="";
	$GLOBALS['link']="";
	$GLOBALS['description']="";
	$GLOBALS['author']="";
	$GLOBALS['pubDate']="";
	}
	if(strcmp($name,"item")==0){
		$GLOBALS['title']="";
		$GLOBALS['link']="";
		$GLOBALS['description']="";
		$GLOBALS['author']="";
		$GLOBALS['pubDate']="";
	}
}

function characterData($parser, $data) {
	$elements = array ('title', 'link', 'description','author','pubDate');
	foreach ($elements as $element) {
		if ($GLOBALS["currentElement"] == $element) {
			$GLOBALS[$element] .= $data;
		}
	}
}

function parseFile(){
	global $xmlSource,$nieuwsitems;
	$xml_parser=xml_parser_create();
	xml_set_element_handler($xml_parser,"startElement","endElement");
	xml_set_character_data_handler($xml_parser,"characterData");
	xml_parser_set_option($xml_parser,XML_OPTION_CASE_FOLDING,false);
	if(!($fp=fopen($xmlSource,"r"))){
		die("Cannot open  $xmlSource  ");
	}
	while(($data=fread($fp,4096))){
		if(!xml_parse($xml_parser,$data,feof($fp))){
			die(sprintf("XML error at line %d column %d ", 
			xml_get_current_line_number($xml_parser), 
			xml_get_current_column_number($xml_parser)));
		}
	}
	xml_parser_free($xml_parser);
	return $nieuwsitems;
}

$result = parseFile();

foreach($result as $arr){
	$strResult .= '< hr />';
	$strResult .= '< h4>'.$arr["title"].'< /h4>'.$arr["description"].'< br />"'.$arr["title"].' ('.parse_url($arr["link"], PHP_URL_HOST).')< br />< br />';	
}

$strResult .= '< br /> < a href="'.$myurl.'" title="'.$mykey.'">trismegistos links : '.$mykey.'< /a>< br />'; 

$email='juustout.linkdirectory@blogger.com';
$subject = $mykey;
mail($email,$subject,$strResult, "MIME-Version: 1.0\n"."Content-type: text/html; charset=iso-8859-1");

echo $strResult;

(note the html markup in the last lines is < br />, if you cut and paste it, remove the space or you get a mess, also note the extra header in the php mail function, makes it possible to post html-marked up text (otherwise you get flat text posted and your site looks like ****).