juust ~ php oddities

Unordered list of one element
  • rss
  • begin
  • about
    • vcard
    • WTF is BroJesus
  • php scripts
    • flickr wp widget
    • google multi key serp tool, php script
    • gwt plugin
  • php classes
    • php pagerank class
    • fibonacci class
    • robots.txt parser php class
  • serp
    • serp dashboard wordpress plugin
  • services

google trends III

juust | 25/12/2008

How to get the urls and snippets from the Google Trends details page. The news articles on the details page are listed with an ‘Ajax’ call, they are not sent to the browser in the html source. No easy way to scrape that.

The blog articles are pretty straight forward : first the ugly fast way :

  1. $mytitle='manuel benitez';
  2. $mydate=''; //2008-12-24
  3. $html=file_get_contents('http://www.google.com/trends/hottrends?q='.urlencode($mytitle).'&date=&sa=X');
  4. $start = strpos($html, '<div class="gsc-resultsbox-visible">');
  5. $end = strpos($html, '<div class="gsc-trailing-more-results">');
  6. $content = substr($html, $start, $end-$start);
  7. echo $content;
  8. </div></div>

That returns the blog snippets, ugly. The other way : regular pattern matching : you can grab the divs that each content item has, marked with

  • div class=”gs-title”
  • div class=”gs-relativePublishedDate”
  • div class=”gs-snippet”
  • div class=”gs-visibleUrl”

from the html-source and organize them as “Content” array, after which you can list the content items with your own markup or store them in a database.

  1. //I assume $mytitle is taken from the $_GET array.
  2.  
  3. //array 'Content' with it's members
  4. Class Content {
  5.  var $id;
  6.  var $title;
  7.  var $pubdate;
  8.  var $snippet;
  9.  var $url;
  10.  
  11.  public function __construct($id) {
  12.   $this->id=$id;
  13.  }
  14. }
  15.  
  16. //grab the source from the google page
  17. $html=file_get_contents('http://www.google.com/trends/hottrends?q='.urlencode($mytitle).'&date=&sa=X');
  18.  
  19. //cut out the part I want
  20. $start = strpos($html, '<div class="gsc-resultsbox-visible">');
  21. $end = strpos($html, '<div class="gsc-trailing-more-results">');
  22. $content = substr($html, $start, $end-$start);
  23.  
  24. //grab the divs that contain title, publish date, snippet and url with regular pattern match
  25. preg_match_all('!<div class=\”gs-title\”>.*?< \/div>!si', $html, $titles);
  26. preg_match_all('!<div class=\”gs-relativePublishedDate\”>.*?< \/div>!si', $html, $pubDates);
  27. preg_match_all('!<div class=\”gs-snippet\”>.*?< \/div>!si', $html, $snippets);
  28. preg_match_all('!<div class=\”gs-visibleUrl\”>.*?< \/div>!si', $html, $urls);
  29.  
  30. $Contents = array();
  31.  
  32. //organize them under Content;
  33.  
  34. $count=0;
  35. foreach($titles[0] as $title) {
  36. //make a new instance of Content;
  37.  $Contents[] = new Content($count);
  38. //add title
  39.  $Contents[$count]->title=$title;
  40.  $count++;
  41. }
  42.  
  43. $count=0;
  44. foreach($pubDates[0] as $pubDate) {
  45. //add publishing date (contains some linebreak, remove it with strip_tags)
  46.  $Contents[$count]->pubdate=strip_tags($pubDate);
  47.  $count++;
  48. }
  49.  
  50. $count=0;
  51. foreach($snippets[0] as $snippet) {
  52. //add snippet
  53.  $Contents[$count]->snippet=$snippet;
  54.  $count++;
  55. }
  56.  
  57. $count=0;
  58. foreach($urls[0] as $url) {
  59. //add display url
  60.  $Contents[$count]->url=$url;
  61.  $count++;
  62. }
  63.  
  64. //leave $count as is, the number of content-items with a 0-base array
  65. //add rel=nofollow to links to prevent pagerank assignment to blogs
  66. for($ct=0;$ct< $count;$ct++) {
  67.  $Contents[$ct]->url = preg_replace('/ target/', ' rel="nofollow" target', $Contents[$ct]->url);
  68.  $Contents[$ct]->title = preg_replace('/ target/', ' rel="nofollow" target', $Contents[$ct]->title);
  69. }
  70.  
  71. //its complete, list all content-items with some markup
  72. for($ct=0;$ct< $count;$ct++) {
  73.  echo '<h3>'.$Contents[$ct]->title.'';
  74.  echo '<p><strong>'.$Contents[$ct]->pubdate.'</strong>:<em>'.$Contents[$ct]->snippet.'</em></p>';
  75.  echo $Contents[$ct]->url.'<br />';
  76. }
  77. </div></div></div></div></div></div>

It ain’t perfect, but it works. the highlighter I use gets a bit confused about the preg_match_all statements containing unclosed div’s, so copying the code of the blog may not work, a text file with the source code is on trends.trismegistos.net. I added it that snippet to trendinfo.php, works fine.

[Post to Twitter] Tweet This  [Post to Plurk] Plurk This  [Post to Yahoo Buzz] Buzz This  [Post to Delicious] Delicious This  [Post to Reddit] Reddit This 

Categories
google
Tags
google, scrape, trends
Comments rss
Comments rss
Trackback
Trackback

« google trends II RedHat Seo : scraper auto-blogging »

4 Responses to “google trends III”

  1. SEO underWorld » Blog Archive » 9 Epic SEO Scripts says:
    28/12/2008 at 6:58 pm

    [...] Juust’s Google Trends Scraper – Juust is a new addition to my feed reader, but since he’s been there he has published a ton of good google trends scraping code. [...]

    Reply
  2. Herman says:
    17/10/2009 at 4:34 am

    Hi mate,

    Meet you again here :)

    This code work perfectly now but maybe you can add some features like keyword search result form for searching keyword trends we need and then display result like this:

    hxxp:example.com/keyword+results.html

    or if in keyword typed “google trends today”

    will become:

    hxxp:example.com/google+trends+today.html

    hoped you can make search form code if do you have time and share it for us…

    My Regards

    Reply
  3. jorge says:
    27/11/2009 at 9:16 am

    Great script and thanks for sharing it. Not really sure how efficient this script is when ran on a large amount of keywords, but its a start.

    Reply
  4. storage stockport says:
    11/03/2010 at 5:18 am

    Excellent ideas here, have emailed my mum so expect a big reply!!

    Reply

Leave a Reply

Click here to cancel reply.

Recent Posts

  • p2p with wordpress xml-rpc
  • Tweets on Google’s frontpage
  • happy new year
  • metaWeblog.newPost posting to Wordpress from Word
  • IE is retarded

click me!
rss
Comments rss
Blog Directory
Web Developement Blogs - BlogCatalog Blog Directory
Listed in LS Blogs the Blog Directory and Blog Search Engine
Blog Flux Directory
joopita.com free web directory and search engine
design by jide
sitemap
8227 confirmed spam kills