¤ is evil !

I ran into a funny problem last week with a url from the api of a payment gateway that contained …&currencyCode=…..

I adopted an orphaned codebase and am reprogramming it as WordPress plugin and it contains some really weird bugs. The two previous coders that worked on the codebase opted for an iframe solution through an intermediate page.

Problem is SimpleXml recognizing &curren and converting to ¤ (unicode U+000A4) with ¤cycode= as result, if you tried to relocate the page to that url the gateway rejects the url.

str_replace() does not recognize ¤, the solution I found is first running htmlentities() that replaces ¤ with &curren but also converts & to & which the gateway also rejects, so after using htmlentities() you use str_replace() to replace & with &, and then it works fine.

I am open to suggestions on that one, for now, conclusion is that SimpleXML does not handle &curren in a url correctly, you might run into the same problem with &current_user,  &current_time, et cetera.

google suggest scraper (php & simplexml)

Today’s goal is a basic php Google Suggest scraper because I wanted traffic data and keywords for free.

Before we start :

google scraping is bad !

Good People use the Google Adwords API : 25 cents for 1000 units, 15++ units for keyword suggestion so they pay 4 or 5 dollar for 1000 keyword suggestions (if they can find a good programmer which also costs a few dollars). Or they opt for SemRush (also my preference), KeywordSpy, Spyfu, and other services like 7Search PPC programs to get keyword and traffic data and data on their competitors but these also charge about 80 dollars per month for a limited account up to a few hundred per month for seo companies. Good people pay plenty.

We tiny grey webmice of marketing however just want a few estimates, at low or better no cost : like this :

data num queries
google suggest 57800000
google suggestion box 5390000
google suggest api 5030000
google suggestion tool 3670000
google suggest a site 72700000
google suggested users 57000000
google suggestions funny 37400000
google suggest scraper 62800
google suggestions not working 87100000
google suggested user list 254000000

Suggestion autocomplete is AJAX, it outputs XML :

< ?xml version="1.0"? >
   <toplevel>
     <CompleteSuggestion>
       <suggestion data="senior quotes"/>
       <num_queries int="30000000"/>
     </CompleteSuggestion>
     <CompleteSuggestion>
       <suggestion data="senior skip day lyrics"/>
       <num_queries int="441000"/>
     </CompleteSuggestion>
   </toplevel>

Using SimpleXML, the PHP routine is as simple as querying g00gle.c0m/complete/search?, grabbing the autocomplete xml, and extracting the attribute data :

  1.         if ($_SERVER['QUERY_STRING']=='') die('enter a query like http://host/filename.php?query');
  2.  $contentstring = @file_get_contents("http://g00gle.c0m/complete/search?output=toolbar&amp;q=".urlencode($kw));  
  3.    $content = simplexml_load_string($contentstring );
  4.  
  5.         foreach($content-&gt;CompleteSuggestion as $c) {
  6.             $term = (string) $c-&gt;suggestion-&gt;attributes()-&gt;data;
  7.             //note : traffic data is sometimes missing  
  8.             $traffic = (string) $c-&gt;num_queries-&gt;attributes()-&gt;int;
  9.             echo $term. " ".$traffic . "
  10. " ;
  11.  }

I made a quick php script that outputs the terms as a list of new queries so you can walk through the suggestions :

The source is as text file up for download overhere (rename it to suggestit.php and it should run on any server with php5.* and simplexml).

bing api with php and simplexml

About scraping results off of Bing : Bing use a set of about eight cookies. You can grab 200 results with php curl, as 20 pages of 10, but after the first 200 the Bing server checks for the cookie and for lack of one returns a blank page. I can fidget with the curl cookiejar, but Bing also offer a straighforward API.

Using the Bing API to list search results is easier.

Bing TOS : not for seo rank checks

In the last paragraph of the api guide, Bing give a quick recap of their TOS, you can do max 7 queries per second, and using the results for SEO rank checks is explicitly prohibited.

These following snippets (text source) are hence explicitly not to be used for bing search engine result page (‘serp’) rank checks.

bing api with simplexml

So here is one for web results using php simplexml. The web api (which uses namespaces) allows for retrieving max 1000 results per term at max 50 results per query, you can specify the number of results and the offset, where to start grabbing results.

  1. $Appid="A_VERY_LONG_STRING";
  2. $Query = "seo rank check";
  3. $Numres = 50; //max 50
  4. $Offset = 1;    //up to 1000
  5.  
  6. $url = 'http://api.search.live.net/xml.aspx?
  7. Appid='.$Appid.'
  8. &query='.$Query.'
  9. &sources=web
  10. &web.count='.$Numres.'
  11. &web.offset='.$Offset;
  12.  
  13. $feed = simplexml_load_file($url);
  14. //use the web: namespace
  15.  $children =  $feed->children('http://schemas.microsoft.com/LiveSearch/2008/04/XML/web');
  16.       foreach ($children->Web->Results->WebResult as $d) {
  17.                 echo $d->Title.'<br />';
  18.                 echo $d->Description.'<br />';
  19.                 echo $d->Url.'<br />';
  20.                 echo $d->DisplayUrl.'<br />';
  21.    }

..and one for the pictures using php simplexml :

  1. $Appid="A_VERY_LONG_STRING";
  2. $Query = "alkmaar";
  3. $Numres = 10;
  4. $Offset = 1;
  5.  
  6. $url = 'http://api.search.live.net/xml.aspx?';
  7. $url .= 'Appid='.$Appid;
  8. $url .= '&query='.$Query;
  9. $url .= '&sources=image';
  10. $url .= '&image.count='.$Numres;
  11. $url .= '&image.offset='.$Offset;
  12.  
  13. $feed = simplexml_load_file($url);
  14.  
  15. //use the mms: namespace      
  16.   $children =  $feed->children('http://schemas.microsoft.com/LiveSearch/2008/04/XML/multimedia');
  17.  
  18.     echo('<ul ID="resultList">');
  19.  
  20.     foreach ($children->Image->Results->ImageResult as $d) {
  21.                 echo('<li class="resultlistitem"><a href="' . $d->DisplayUrl . '">' . $d->Title . '</a><br />');
  22.                 echo('<img src="' . $d-/>Thumbnail->Url. '" /><br />
  23.                      '.$d->Thumbnail->ContentType.'<br />
  24.                     '.$d->Thumbnail->Height.'<br />
  25.                     '.$d->Thumbnail->Width.'<br />
  26.                     '.$d->Thumbnail->FileSize.'<br />
  27.                     </li>');
  28.        }
  29.     echo("</ul>");

I actually like that api, I am going to use that.

bing api with json

Bing seem to prefer you use json, less bandwidth usage. After their example in the api basics guide :

  1.  
  2. $Numres = 10;
  3. $Offset = 1;
  4. $Query='alkmaar';
  5.  
  6. $url = 'http://api.search.live.net/json.aspx?';
  7. $url .= 'Appid='.$Appid;
  8. $url .= '&query='.$Query;
  9. $url .= '&sources=image';
  10. $url .= '&image.count='.$Numres;
  11. $url .= '&image.offset='.$Offset;
  12.  
  13.  
  14. $response = file_get_contents($url);
  15. $jsonobj = json_decode($response);
  16. echo('<ul ID="resultList">');
  17. foreach($jsonobj->SearchResponse->Image->Results as $value)
  18. {
  19.     echo('<li class="resultlistitem"><a href="' . $value->Url . '">');
  20.     echo('<img src="' . $value-/>Thumbnail->Url. '"></a></li>');
  21. }
  22. echo("</ul>");

Of course there is the old RSS-option, which doesnt require an appid but also falls under the api 2.0 tos, and a soap option.

other sources :
There is a bing api php class made over at routecafe, and a jquery bing plugin using json over at Einar Otto Stangvik’s blog.