juust ~ php oddities

Unordered list of one element
  • rss
  • begin
  • about
    • vcard
    • WTF is BroJesus
  • php scripts
    • flickr wp widget
    • google multi key serp tool, php script
    • gwt plugin
  • php classes
    • php pagerank class
    • fibonacci class
    • robots.txt parser php class
  • serp
    • serp dashboard wordpress plugin
  • services

bing api with php and simplexml

juust | 17/09/2009

About scraping results off of Bing : Bing use a set of about eight cookies. You can grab 200 results with php curl, as 20 pages of 10, but after the first 200 the Bing server checks for the cookie and for lack of one returns a blank page. I can fidget with the curl cookiejar, but Bing also offer a straighforward API.

Using the Bing API to list search results is easier.

Bing TOS : not for seo rank checks

In the last paragraph of the api guide, Bing give a quick recap of their TOS, you can do max 7 queries per second, and using the results for SEO rank checks is explicitly prohibited.

These following snippets (text source) are hence explicitly not to be used for bing search engine result page (’serp’) rank checks.

bing api with simplexml

So here is one for web results using php simplexml. The web api (which uses namespaces) allows for retrieving max 1000 results per term at max 50 results per query, you can specify the number of results and the offset, where to start grabbing results.

  1. $Appid="A_VERY_LONG_STRING";
  2. $Query = "seo rank check";
  3. $Numres = 50; //max 50
  4. $Offset = 1;    //up to 1000
  5.  
  6. $url = 'http://api.search.live.net/xml.aspx?
  7. Appid='.$Appid.'
  8. &query='.$Query.'
  9. &sources=web
  10. &web.count='.$Numres.'
  11. &web.offset='.$Offset;
  12.  
  13. $feed = simplexml_load_file($url);
  14. //use the web: namespace
  15.  $children =  $feed->children('http://schemas.microsoft.com/LiveSearch/2008/04/XML/web');
  16.       foreach ($children->Web->Results->WebResult as $d) {
  17.                 echo $d->Title.'<br />';
  18.                 echo $d->Description.'<br />';
  19.                 echo $d->Url.'<br />';
  20.                 echo $d->DisplayUrl.'<br />';
  21.    }

..and one for the pictures using php simplexml :

  1. $Appid="A_VERY_LONG_STRING";
  2. $Query = "alkmaar";
  3. $Numres = 10;
  4. $Offset = 1;
  5.  
  6. $url = 'http://api.search.live.net/xml.aspx?';
  7. $url .= 'Appid='.$Appid;
  8. $url .= '&query='.$Query;
  9. $url .= '&sources=image';
  10. $url .= '&image.count='.$Numres;
  11. $url .= '&image.offset='.$Offset;
  12.  
  13. $feed = simplexml_load_file($url);
  14.  
  15. //use the mms: namespace      
  16.   $children =  $feed->children('http://schemas.microsoft.com/LiveSearch/2008/04/XML/multimedia');
  17.  
  18.     echo('<ul ID="resultList">');
  19.  
  20.     foreach ($children->Image->Results->ImageResult as $d) {
  21.                 echo('<li class="resultlistitem"><a href="' . $d->DisplayUrl . '">' . $d->Title . '</a><br />');
  22.                 echo('<img src="' . $d-/>Thumbnail->Url. '" /><br />
  23.                      '.$d->Thumbnail->ContentType.'<br />
  24.                     '.$d->Thumbnail->Height.'<br />
  25.                     '.$d->Thumbnail->Width.'<br />
  26.                     '.$d->Thumbnail->FileSize.'<br />
  27.                     </li>');
  28.        }
  29.     echo("</ul>");

I actually like that api, I am going to use that.

bing api with json

Bing seem to prefer you use json, less bandwidth usage. After their example in the api basics guide :

  1.  
  2. $Numres = 10;
  3. $Offset = 1;
  4. $Query='alkmaar';
  5.  
  6. $url = 'http://api.search.live.net/json.aspx?';
  7. $url .= 'Appid='.$Appid;
  8. $url .= '&query='.$Query;
  9. $url .= '&sources=image';
  10. $url .= '&image.count='.$Numres;
  11. $url .= '&image.offset='.$Offset;
  12.  
  13.  
  14. $response = file_get_contents($url);
  15. $jsonobj = json_decode($response);
  16. echo('<ul ID="resultList">');
  17. foreach($jsonobj->SearchResponse->Image->Results as $value)
  18. {
  19.     echo('<li class="resultlistitem"><a href="' . $value->Url . '">');
  20.     echo('<img src="' . $value-/>Thumbnail->Url. '"></a></li>');
  21. }
  22. echo("</ul>");

Of course there is the old RSS-option, which doesnt require an appid but also falls under the api 2.0 tos, and a soap option.

other sources :
There is a bing api php class made over at routecafe, and a jquery bing plugin using json over at Einar Otto Stangvik’s blog.

Comments
1 Comment »
Categories
php, seo
Tags
bing, php, seo
Comments rss Comments rss
Trackback Trackback

about the trackback thing

juust | 24/04/2009

The question about the trends script with trackbacks was wether a few hundred backlinks was worth the trouble, and it wasn’t. I wrote a second routine to grab the most common significant words from excerpts, and do a second search to grab better results and up to five trackbacks per page.

So I put that online, it grabbed 4000 backlinks in an hour and overloaded the host server.

Baidu, radian6 and google had stepped up indexing after I added sitewide tags and that didnt show up in analytics, the site got the trackback validations and crawlers and the server went haywire. It is a shared host, the resources are too limited to run that kind site on. I put it on hold till I find a solution for the hosting,

Google of course penalised the site with PR0 and dropped the domain from the serp on its main keywords, but in Yahoo it ranks about 20 out of 360 million result pages and in MSN it ranks no 1. I was thinking about adding a translator plugin and see if I can get some traffic from Baidu.

 

Comments
No Comments »
Categories
seo, trends
Tags
seo, trends
Comments rss Comments rss
Trackback Trackback

RedHat Seo : scraper auto-blogging

juust | 26/12/2008

Just give us your endpoint and we’ll take it from there, sparky!

I was going to make one of these tools to scrape google and conjur a full blog out of nowhere, as Christmas special, RedHat Seo. The rough sketch has arrived , far from perfect, but it does produce a blog and don’t even look too shabby. I scraped a small batch of posts off of blogs, keeping the links intact and adding a tribute links. I hope they will pardon me for it.

structure

I use three main classes,

BlogMaker the application
Target the blogs you aim for
WPContent the scraped goodies

…and two support classes

SerpResult scraped urls
Custom_RPC a simple rpc-poster

Target blogs have three texts,

file contents maintenance
blog categories category you post under manual
blog tags tags you list on the blog manual
blog urls urls already used for the blog system

routine

The BlogMaker class grabs a result list (up to 1000 urls per phrase) from Google, extracts the urls and stores them in SerpResult, scrapes the urls and extracts the entry divs, stores div-entries in the WPContent class (that has some basic functions to sanitize the text), and uses the BlogTarget-definitions to post it up blogs with xml-rpc.

usage

My highlighter tends to mess up text with div markers in it, copying off the blog may not work,
the full text source (about 500 lines) is overhere. Underneath I’ll list the main program loop :

  1.  
  2. //make main instance
  3. $Blog = new BlogMaker("keyword");
  4.  
  5. //define a target blog, you can define multiple blogs and refer with code
  6. //then add rpc-url, password and user
  7. //and for every target blog three text-files
  8.  
  9. $T=$Blog->AddTarget(
  10.  'blogcode',
  11.  'http://my.blog.com/xmlrpc.php',
  12.  'password',
  13.  'user',
  14.  'keyword.categories.txt',
  15.  'keyword.tags.txt',
  16.  'keyword.urls.txt'
  17.  );
  18.  
  19. //read the tags, cats and url text files stored on the server
  20. //all retrieved urls are tested, if the target blog already has that
  21. //scraped url, it is discarded.
  22. $T->CSV_GetTags();
  23. $T->List_GetCats();
  24. $T->ReadURL();
  25.  
  26. //grab the google result list
  27. //use params (pages, keywords) to specify search
  28. $Blog->GoogleResults();
  29.  
  30. $a=0;
  31. foreach($Blog->Results as $BlogUrl) {
  32.   $a++;
  33.   echo $BlogUrl->url;
  34. //see if the url isnt used yet
  35.  
  36.  if($T->checkURL(trim($BlogUrl->url))!=true) {
  37.    echo '…checking ';
  38.    flush();
  39. //if not used, get the source
  40.    $BlogUrl->scrape();
  41. //check for divs marked "entry", if they arent there, check "post"
  42. //some blogs use other indications for the content
  43. //but entry and post cover 40%
  44.  
  45.    $entries = $BlogUrl->get_entries();
  46.    if(count($entries)&lt;1) {
  47.     echo 'no entries…';
  48.     flush();
  49.     $entries = $BlogUrl->get_posts();
  50.      if(count($entries)&lt;1) {
  51.       echo 'no posts either…';
  52. //if no entry-post div, mark url as done
  53.  
  54.       $T->RegisterURL($BlogUrl->url);
  55.      }
  56.    }
  57.  
  58.    $ct=0;
  59.    foreach($BlogUrl->WpContentPieces as $WpContent) {
  60. //in the get_entries/get_post function the fragments are stored
  61. //as wpcontent
  62.     $ct++;
  63.  
  64.     if($WpContent->judge(2000, 200, 5)) {
  65.      $WpContent->tribute();  //add tribute link
  66.      $T->settags($WpContent->divcontent); //add tags
  67.      $T->postCustomRPC($WpContent->title, $WpContent->divcontent, 1); //1=publish, 0=draft
  68.      $T->RegisterURL($WpContent->url);  //register use of url
  69. usleep(20000000);  //20 seconds break, for sitemapping
  70.     }
  71.    }
  72.   }
  73.  }

notes

  • xml-rpc needs to be activated explicitly on the wordpress dashboard under settings/writing.
  • categories must be present in the blog
  • url file must be writeable by the server (777)

It seems wordpress builds the sitemap as background process, the standard google xml sitemap plugin wil attempt to build in the cache (takes anywhere between 2 and 10 seconds), and apart from building a sitemap the posts also get pinged around. Giving the install 10 to 20 seconds between posts allows for all the hooked in functions to be completed.

period

That’s about all,
consider it gpl, I added some comments in the source but I will not develop this any further. A mysql backed blogfarm tool (euphemistically called ‘publishing tool’) is more interesting, besides, I am off to the wharves to do some painting.

if you use it, send some feedback,
merry christmas dogheads

Comments
1 Comment »
Categories
google, seo, seo tips and tricks, tool, wordpress, xml-rpc
Tags
google, scrape, seo, seo tips and tricks, tool, wordpress, xml-rpc
Comments rss Comments rss
Trackback Trackback

« Previous Entries Next Entries »

Recent Posts

  • Pagerank sculpting session
  • wish you were here
  • interesting : seo panel
  • availability test
  • Mayday

click me!
rss
Comments rss
Blog Directory
Web Developement Blogs - BlogCatalog Blog Directory
Listed in LS Blogs the Blog Directory and Blog Search Engine
Blog Flux Directory
joopita.com free web directory and search engine
design by jide
sitemap
17240 confirmed spam kills