Categories
google seo tips and tricks

google suggest scraper (php & simplexml)

Today’s goal is a basic php Google Suggest scraper because I wanted traffic data and keywords for free.

Before we start :

google scraping is bad !

Good People use the Google Adwords API : 25 cents for 1000 units, 15++ units for keyword suggestion so they pay 4 or 5 dollar for 1000 keyword suggestions (if they can find a good programmer which also costs a few dollars). Or they opt for SemRush (also my preference), KeywordSpy, Spyfu, and other services like 7Search PPC programs to get keyword and traffic data and data on their competitors but these also charge about 80 dollars per month for a limited account up to a few hundred per month for seo companies. Good people pay plenty.

We tiny grey webmice of marketing however just want a few estimates, at low or better no cost : like this :

data num queries
google suggest 57800000
google suggestion box 5390000
google suggest api 5030000
google suggestion tool 3670000
google suggest a site 72700000
google suggested users 57000000
google suggestions funny 37400000
google suggest scraper 62800
google suggestions not working 87100000
google suggested user list 254000000

Suggestion autocomplete is AJAX, it outputs XML :

< ?xml version="1.0"? >
   <toplevel>
     <CompleteSuggestion>
       <suggestion data="senior quotes"/>
       <num_queries int="30000000"/>
     </CompleteSuggestion>
     <CompleteSuggestion>
       <suggestion data="senior skip day lyrics"/>
       <num_queries int="441000"/>
     </CompleteSuggestion>
   </toplevel>

Using SimpleXML, the PHP routine is as simple as querying g00gle.c0m/complete/search?, grabbing the autocomplete xml, and extracting the attribute data :

 
        if ($_SERVER['QUERY_STRING']=='') die('enter a query like http://host/filename.php?query');
	$contentstring = @file_get_contents("http://g00gle.c0m/complete/search?output=toolbar&q=".urlencode($kw));  
  	$content = simplexml_load_string($contentstring );

        foreach($content->CompleteSuggestion as $c) {
            $term = (string) $c->suggestion->attributes()->data;
            //note : traffic data is sometimes missing   
            $traffic = (string) $c->num_queries->attributes()->int;
            echo $term. " ".$traffic . "
" ;
	}

I made a quick php script that outputs the terms as a list of new queries so you can walk through the suggestions :

The source is as text file up for download overhere (rename it to suggestit.php and it should run on any server with php5.* and simplexml).

Categories
google

google trends III

How to get the urls and snippets from the Google Trends details page. The news articles on the details page are listed with an ‘Ajax’ call, they are not sent to the browser in the html source. No easy way to scrape that.

The blog articles are pretty straight forward : first the ugly fast way :

$mytitle='manuel benitez';
$mydate=''; //2008-12-24
$html=file_get_contents('http://www.google.com/trends/hottrends?q='.urlencode($mytitle).'&date=&sa=X');
$start = strpos($html, '
'); $end = strpos($html, '
'); $content = substr($html, $start, $end-$start); echo $content;

That returns the blog snippets, ugly. The other way : regular pattern matching : you can grab the divs that each content item has, marked with

  • div class=”gs-title”
  • div class=”gs-relativePublishedDate”
  • div class=”gs-snippet”
  • div class=”gs-visibleUrl”

from the html-source and organize them as “Content” array, after which you can list the content items with your own markup or store them in a database.

//I assume $mytitle is taken from the $_GET array.

//array 'Content' with it's members 
Class Content {
	var $id;
	var $title;
	var $pubdate;
	var $snippet;
	var $url;
	
	public function __construct($id) {
		$this->id=$id;
	}
}

//grab the source from the google page
$html=file_get_contents('http://www.google.com/trends/hottrends?q='.urlencode($mytitle).'&date=&sa=X');

//cut out the part I want
$start = strpos($html, '
'); $end = strpos($html, '
'); $content = substr($html, $start, $end-$start); //grab the divs that contain title, publish date, snippet and url with regular pattern match preg_match_all('!
.*?< \/div>!si', $html, $titles); preg_match_all('!
.*?< \/div>!si', $html, $pubDates); preg_match_all('!
.*?< \/div>!si', $html, $snippets); preg_match_all('!
.*?< \/div>!si', $html, $urls); $Contents = array(); //organize them under Content; $count=0; foreach($titles[0] as $title) { //make a new instance of Content; $Contents[] = new Content($count); //add title $Contents[$count]->title=$title; $count++; } $count=0; foreach($pubDates[0] as $pubDate) { //add publishing date (contains some linebreak, remove it with strip_tags) $Contents[$count]->pubdate=strip_tags($pubDate); $count++; } $count=0; foreach($snippets[0] as $snippet) { //add snippet $Contents[$count]->snippet=$snippet; $count++; } $count=0; foreach($urls[0] as $url) { //add display url $Contents[$count]->url=$url; $count++; } //leave $count as is, the number of content-items with a 0-base array //add rel=nofollow to links to prevent pagerank assignment to blogs for($ct=0;$ct< $count;$ct++) { $Contents[$ct]->url = preg_replace('/ target/', ' rel="nofollow" target', $Contents[$ct]->url); $Contents[$ct]->title = preg_replace('/ target/', ' rel="nofollow" target', $Contents[$ct]->title); } //its complete, list all content-items with some markup for($ct=0;$ct< $count;$ct++) { echo '

'.$Contents[$ct]->title.''; echo '

'.$Contents[$ct]->pubdate.':'.$Contents[$ct]->snippet.'

'; echo $Contents[$ct]->url.'
'; }

It ain’t perfect, but it works. the highlighter I use gets a bit confused about the preg_match_all statements containing unclosed div’s, so copying the code of the blog may not work.