google suggest scraper (php & simplexml)

Today’s goal is a basic php Google Suggest scraper because I wanted traffic data and keywords for free.

Before we start :

google scraping is bad !

Good People use the Google Adwords API : 25 cents for 1000 units, 15++ units for keyword suggestion so they pay 4 or 5 dollar for 1000 keyword suggestions (if they can find a good programmer which also costs a few dollars). Or they opt for SemRush (also my preference), KeywordSpy, Spyfu, and other services like 7Search PPC programs to get keyword and traffic data and data on their competitors but these also charge about 80 dollars per month for a limited account up to a few hundred per month for seo companies. Good people pay plenty.

We tiny grey webmice of marketing however just want a few estimates, at low or better no cost : like this :

data num queries
google suggest 57800000
google suggestion box 5390000
google suggest api 5030000
google suggestion tool 3670000
google suggest a site 72700000
google suggested users 57000000
google suggestions funny 37400000
google suggest scraper 62800
google suggestions not working 87100000
google suggested user list 254000000

Suggestion autocomplete is AJAX, it outputs XML :

< ?xml version="1.0"? >
       <suggestion data="senior quotes"/>
       <num_queries int="30000000"/>
       <suggestion data="senior skip day lyrics"/>
       <num_queries int="441000"/>

Using SimpleXML, the PHP routine is as simple as querying g00gle.c0m/complete/search?, grabbing the autocomplete xml, and extracting the attribute data :

  1.         if ($_SERVER['QUERY_STRING']=='') die('enter a query like http://host/filename.php?query');
  2.  $contentstring = @file_get_contents("http://g00gle.c0m/complete/search?output=toolbar&amp;q=".urlencode($kw));  
  3.    $content = simplexml_load_string($contentstring );
  5.         foreach($content-&gt;CompleteSuggestion as $c) {
  6.             $term = (string) $c-&gt;suggestion-&gt;attributes()-&gt;data;
  7.             //note : traffic data is sometimes missing  
  8.             $traffic = (string) $c-&gt;num_queries-&gt;attributes()-&gt;int;
  9.             echo $term. " ".$traffic . "
  10. " ;
  11.  }

I made a quick php script that outputs the terms as a list of new queries so you can walk through the suggestions :

The source is as text file up for download overhere (rename it to suggestit.php and it should run on any server with php5.* and simplexml).

Google Panda Latent Semantic Indexing Test


Latent Semantic Indexing

Queries, or concept searches, against a set of documents that have undergone LSI will return results that are conceptually similar in meaning to the search criteria even if the results don’t share a specific word or words with the search criteria.


LSI Test

O my friend, Panda is something that has to be surpassed. In {speculation|guess|supposition|surmise|surmisal|possibility|hypothesis} and keeping silence shall the friend be a master: you should not wish to see everything. (Nietzsche, Also Sprach Zarathustra)


Id the_term the_type the_value
156875 c0njecture (noun) speculation
156876 ———- (noun) hypothesis (generic term)
156877 ———- (noun) possibility
156878 ———- (noun) theory (generic term)
156879 ———- (noun) guess
156880 ———- (noun) supposition
156881 ———- (noun) surmise
156882 ———- (noun) surmisal
156883 ———- (noun) speculation
156884 ———- (noun) hypothesis
156885 ———- (noun) opinion (generic term)
156886 ———- (noun) view (generic term)
156887 ———- (noun) reasoning (generic term)
156888 ———- (noun) logical thinking (generic term)
156889 ———- (noun) abstract thought (generic term)
156890 ———- (verb) speculate
156891 ———- (verb) theorize
156892 ———- (verb) theorise
156893 ———- (verb) hypothesize
156894 ———- (verb) hypothesise
156895 ———- (verb) hypothecate
156896 ———- (verb) suppose
156897 ———- (verb) expect (generic term)
156898 ———- (verb) anticipate (generic term)

 (source : semanthesaurus)


Let’s see if the Panda gets it.



ga api sample : get pageviews

I was going to put that online : how to get the pageviews out of the google analytics api, using simplexml and php. Google use three namespaces in the output file which make it less easy accessible, so here’s a quick sample of how to get your sites pageviews out of it :

  1. //ids           = site identifier (from the site data feed)
  2. //metrics     = what i want to see
  3. //start-date
  4. //end-date
  6. $feedUri = "";    
  8.  $curl = curl_init();
  9.  curl_setopt($curl, CURLOPT_URL, $feedUri);
  10.  curl_setopt($curl, CURLOPT_CONNECTTIMEOUT, 3);
  11.  curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
  13.        $headers[] = "Authorization: GoogleLogin auth=".$Authtoken;
  15. //for authtoken : see previous post
  16.  curl_setopt($curl, CURLOPT_HTTPHEADER, $headers);
  17.  curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, 0);
  18.  curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
  19.  curl_setopt($curl, CURLOPT_VERBOSE, 1);
  21. //get the string containing the xml file
  22.  $gA = curl_exec($curl);

the feed has three namespaces (atom, opensearch and dxp/analytics), a simple way is accessing the ENTRY tags (from the Atom namespace), in that tag is one DXP: line and that has the answer to the question.

<dxp:metric confidenceInterval=’0.0′ name=’ga:pageviews’ type=’integer’ value=’755’/>

  1. //load the string into a simple xml object
  2.  $feed = simplexml_load_string($gA);
  4. //take the atom namespace
  5.  $children =  $feed->children('');
  7. //take the entry tags
  8.  $parts = $children->entry;
  9.  foreach ($parts as $entry) {
  11.         //from the entry tag,
  12.         //access the dxp namespace
  13.   $dxp = (object) $entry->children('');
  15.         //METRIC contains the answer to the question
  16.         //grab from the tag METRIC the attribute VALUE
  17.                 echo   (string) $dxp->metric->attributes()->value;
  19.         }

Important is using the (string) typecast, normally simplexml returns a simplexml object, when you force a string type, it gives the actual metric ga:pageview value attribute as number.