juust ~ php oddities

Unordered list of one element
  • rss
  • begin
  • about
    • vcard
    • WTF is BroJesus
  • php scripts
    • flickr wp widget
    • google multi key serp tool, php script
    • gwt plugin
  • php classes
    • php pagerank class
    • fibonacci class
    • robots.txt parser php class
  • serp
    • serp dashboard wordpress plugin
  • services

google trends III

juust | 25/12/2008

How to get the urls and snippets from the Google Trends details page. The news articles on the details page are listed with an ‘Ajax’ call, they are not sent to the browser in the html source. No easy way to scrape that.

The blog articles are pretty straight forward : first the ugly fast way :

  1. $mytitle='manuel benitez';
  2. $mydate=''; //2008-12-24
  3. $html=file_get_contents('http://www.google.com/trends/hottrends?q='.urlencode($mytitle).'&date=&sa=X');
  4. $start = strpos($html, '<div class="gsc-resultsbox-visible">');
  5. $end = strpos($html, '<div class="gsc-trailing-more-results">');
  6. $content = substr($html, $start, $end-$start);
  7. echo $content;
  8. </div></div>

That returns the blog snippets, ugly. The other way : regular pattern matching : you can grab the divs that each content item has, marked with

  • div class=”gs-title”
  • div class=”gs-relativePublishedDate”
  • div class=”gs-snippet”
  • div class=”gs-visibleUrl”

from the html-source and organize them as “Content” array, after which you can list the content items with your own markup or store them in a database.

  1. //I assume $mytitle is taken from the $_GET array.
  2.  
  3. //array 'Content' with it's members
  4. Class Content {
  5.  var $id;
  6.  var $title;
  7.  var $pubdate;
  8.  var $snippet;
  9.  var $url;
  10.  
  11.  public function __construct($id) {
  12.   $this->id=$id;
  13.  }
  14. }
  15.  
  16. //grab the source from the google page
  17. $html=file_get_contents('http://www.google.com/trends/hottrends?q='.urlencode($mytitle).'&date=&sa=X');
  18.  
  19. //cut out the part I want
  20. $start = strpos($html, '<div class="gsc-resultsbox-visible">');
  21. $end = strpos($html, '<div class="gsc-trailing-more-results">');
  22. $content = substr($html, $start, $end-$start);
  23.  
  24. //grab the divs that contain title, publish date, snippet and url with regular pattern match
  25. preg_match_all('!<div class=\”gs-title\”>.*?< \/div>!si', $html, $titles);
  26. preg_match_all('!<div class=\”gs-relativePublishedDate\”>.*?< \/div>!si', $html, $pubDates);
  27. preg_match_all('!<div class=\”gs-snippet\”>.*?< \/div>!si', $html, $snippets);
  28. preg_match_all('!<div class=\”gs-visibleUrl\”>.*?< \/div>!si', $html, $urls);
  29.  
  30. $Contents = array();
  31.  
  32. //organize them under Content;
  33.  
  34. $count=0;
  35. foreach($titles[0] as $title) {
  36. //make a new instance of Content;
  37.  $Contents[] = new Content($count);
  38. //add title
  39.  $Contents[$count]->title=$title;
  40.  $count++;
  41. }
  42.  
  43. $count=0;
  44. foreach($pubDates[0] as $pubDate) {
  45. //add publishing date (contains some linebreak, remove it with strip_tags)
  46.  $Contents[$count]->pubdate=strip_tags($pubDate);
  47.  $count++;
  48. }
  49.  
  50. $count=0;
  51. foreach($snippets[0] as $snippet) {
  52. //add snippet
  53.  $Contents[$count]->snippet=$snippet;
  54.  $count++;
  55. }
  56.  
  57. $count=0;
  58. foreach($urls[0] as $url) {
  59. //add display url
  60.  $Contents[$count]->url=$url;
  61.  $count++;
  62. }
  63.  
  64. //leave $count as is, the number of content-items with a 0-base array
  65. //add rel=nofollow to links to prevent pagerank assignment to blogs
  66. for($ct=0;$ct< $count;$ct++) {
  67.  $Contents[$ct]->url = preg_replace('/ target/', ' rel="nofollow" target', $Contents[$ct]->url);
  68.  $Contents[$ct]->title = preg_replace('/ target/', ' rel="nofollow" target', $Contents[$ct]->title);
  69. }
  70.  
  71. //its complete, list all content-items with some markup
  72. for($ct=0;$ct< $count;$ct++) {
  73.  echo '<h3>'.$Contents[$ct]->title.'';
  74.  echo '<p><strong>'.$Contents[$ct]->pubdate.'</strong>:<em>'.$Contents[$ct]->snippet.'</em></p>';
  75.  echo $Contents[$ct]->url.'<br />';
  76. }
  77. </div></div></div></div></div></div>

It ain’t perfect, but it works. the highlighter I use gets a bit confused about the preg_match_all statements containing unclosed div’s, so copying the code of the blog may not work, a text file with the source code is on trends.trismegistos.net. I added it that snippet to trendinfo.php, works fine.

Comments
5 Comments »
Categories
google
Tags
google, scrape, trends
Comments rss Comments rss
Trackback Trackback

google trends II

juust | 22/12/2008

I wanted to reply to a question elsewhere on the site, but a ‘comment’ box isn’t fit for it so I’ll put the reply here. The question was about creating ’search engine friendly’ descriptive URL’s based on keywords from the Google Trends atom feed, listing pages a graph of the trend.

I hacked a quick example together on a subdomain over at trends.trismegistos.net, just to be sure it works.

You can get a site to list http://domain.com/trend_title.html type url’s by using mod_rewrite, an apache module.

In the server directory of the application you can use an .htaccess file to set rules for file access in these folders. When the server gets request from browsers or servers it applies any rewriting rules you define in .htaccess to these requests.

I tried this one :

  1. <ifmodule mod_rewrite.c>
  2.  RewriteEngine On
  3.  RewriteCond %{REQUEST_FILENAME} !-f
  4.  RewriteCond %{REQUEST_FILENAME} !-d
  5.         RewriteRule ^(.*).html /trendinfo.php?title=$1
  6. </ifmodule>

RewriteEngine On
sets the rewrite mechanism on

RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d

tell the apache server that rewriteconditions apply to file-requests that are not an existing file (F) or directory (D). If the requested filename is anywhere in the servers file table, the server dishes out that file, otherwise it will try to apply a RewriteRule. Applying the rule generates a new request, if that returns anything, the server dishes that out, otherwise it returns an htpp-404 ‘file not found’.

The actual url rewrite rule is :
RewriteRule ^(.*).html /trendinfo.php?title=$1
which means :

  • if any filename is requested that satisfies the mask ^(.*).html then
  • take everything before .html
  • add that as variable $1 to trendinfo.php?title=$1
  • see if it sticks

If the browser requests http://domain.com/bob+bowersox.html, the server will assert it is not a file or directory on the server, and test the available rules. When it notices it the requested file ends with .html, it applies the rewrite rule and tries to access http://domain.com/trendinfo.php?title=bob+bowersox.

A browsing user does not notice a thing.

In trendinfo.php I wrote some code to handle the ‘new’ request :

  1. if(!isset($_REQUEST['title'])) {
  2. //if there is no $1, added as title, fake a 404 "file not found" message
  3.         echo 'the emptiness…';
  4. } else {
  5. //get the title from the request
  6.   $mytitle=htmlentities($_REQUEST['title'], ENT_QUOTES, "UTF-8");
  7. //put the google trends graph url together
  8.   $graphurl = 'http://www.google.com/trends/viz?hl=&q=';
  9.   $graphurl .= urlencode($mytitle);
  10.   $graphurl .= '&date=';                        //leave date blank to get the current graph
  11.   $graphurl .= '&graph=hot_img&sa=X';
  12.   echo "<img class=hotGraph width=280 height=190 src='$graphurl'/>";
  13. }

…that outputs the Google trend graph on the url http://domain.com/bob+bowersox.html

I zipped the trends.trismegistos.net program files, but that might be a bit over the top, the download file contains a class that relies on a mysql table being filled every hour with new trends (by cron.php on an apache cron-job), parsing and storing the atom feed of google trends, and listing it as a cross-table in index.php spanning the past 24 hours.

You can also put this in index.php :

  1.   $feed = simplexml_load_file('http://www.google.com/trends/hottrends/atom/hourly');
  2.   $children =  $feed->children('http://www.w3.org/2005/Atom');
  3.   $parts = $children->entry;
  4.   foreach ($parts as $entry) {
  5.      $details = $entry->children('http://www.w3.org/2005/Atom');
  6.       $dom = new domDocument();
  7.      $html=$details->content;
  8.      @$dom->loadHTML($html);
  9.       $anchors = $dom->getElementsByTagName('a');
  10.     foreach ($anchors as $anchor) {
  11.       $url = $anchor->getAttribute('href');
  12.       $urltext = $anchor->nodeValue;
  13.      echo '<a href="'.urlencode($urltext).'.html" target="_blank">'.$urltext.'</a> ';
  14.     }
  15.    }
  16.    unset($dom);
  17.    unset($anchors);
  18.    unset($parts);
  19.    unset($feed);

That lists the current 100 google trends with a link. If you use the .htaccess rewrite rules, the server reroutes all the links to trendinfo.php with descriptive urls.

I hope that helps.

Comments
4 Comments »
Categories
google, php, seo
Tags
google, php, seo, trends
Comments rss Comments rss
Trackback Trackback

google trends crosstab

juust | 22/11/2008

I was playing with Google Trends using another blog. Of the seven or eight posts I tried, five made it to the details page, which gave me some traffic. I am too lazy to follow 100 entries during the day, so I want a simple report on how trends develop the past day like this :

google trends watch

Now how do I get that done ?

First I make a cronjob that runs a routine to pull the Google trends xml-feed in every hour and store it in a database. I add a string for the period (year-month-day-hour).

That gives me a table like this one :

DATE SEARCHID POS
2008120101 baboon 78
2008120101 monkey 13
2008120102 baboon 98
2008120102 monkey 5
2008120103 monkey 3

I want an output

2008120101 2008120102 2008120103
baboon 78 98
monkey 13 5 3

The way to do that is a crosstable query, also called a pivot table .

MySql pivot table

I have to make a routine to turn that base table into a table with the dates as column, and per phrase the position in the correct column. I want a complete row, so I take the highest period from the table and the lowest, and from that intrapolate the other periods.

  1. function make_periods($begin, $end) {
  2. $BeginTime=mktime(substr($begin, 8, 2),0,0, substr($begin,4,2), substr($begin,6,2), substr($begin,0,4));
  3. $EndTime=mktime(substr($end,8,2),0,0, substr($end,4,2), substr($end,6,2), substr($end,0,4));
  4. //divide the difference by 60minutes*60seconds
  5. $periods = ($EndTime-$BeginTime) / 3600;
  6. //make a row of hour-periods with "+N hour",
  7. for($i=0;$i< ($periods+1);$i++) $myperiods[] = strftime('%Y%m%d%H', strtotime("+$i hour", $BeginTime));
  8. //return an array with all periods
  9. //that are to be the column headers in the crosstable
  10.  return $myperiods;
  11. }

At some point I have to clean up the table, and I want the data to fit in one page,
so I will use 24 periods

  1. $plength=25;
  2. if(count($dates)>$plength) {
  3.  for($j=0;$j< ($plength+1);$j++) $usedates[] = $dates[count($dates) - ($plength-$j)];
  4. } else {
  5. //note : this one needs some work.
  6.      $usedates[]=$dates;
  7. }

... and delete all older records :

  1. //make the periods
  2. $dates=array();
  3. $dates = make_periods($begin, $end);
  4.  
  5. //use the $dates array, see if I have more than 24 periods
  6. if(count($dates) > 24) {
  7. //get period number-last-minus-24
  8. //delete all records before that period
  9.   mysql_query("DELETE FROM `serp_trends`  WHERE `date`< '".$dates[count($dates)-24] ."'", $link);
  10. }

...that keeps the table small enough for quick querying. I have 24 periods to make columns, how do I get the position as value in each column ? I read the 'wizard' article on mysql pivot tables and that sounded difficult.

I don't get it, but I got it running :

  • group the records by phrase ("searchid")
  • use the array of periods as columns
  • see if the phrase has a position in a specific period,
    • if so, take the position as value (it only has one per period)
    • otherwise take 0 as value
  • sum the values
  • name the column "T"+period
  1. $usedates=$dates;
  2. $CROSTAB = "SELECT `searchid`, AVG(`pos`) AS AP ";
  3.  for($j=0;$j<count ($usedates);$j++) {
  4.   $CROSTAB.=", SUM(IF(date='".$usedates[$j]."', pos, 0)) AS T".$usedates[$j];
  5.  }
  6. $CROSTAB .= " FROM `serp_trends` GROUP BY `searchid` ORDER BY AP ASC";

I take the average position from the grouped records, order it ascending and output that as table, first the two columns phrase (searchid) and average position, then the periods as column names.

  1. $link=connect_data();
  2.  
  3. $result=mysql_query($CROSTAB, $link) or die(mysql_error());
  4.  
  5. //add the zebra class marker for mootool
  6. $CTB.= '<table class="zebra" cellpadding="0" cellspacing="0"><tbody>';
  7.  
  8. //make the header row
  9. $CTB.='<tr><td>avg</td><td>search</td>';
  10.  
  11. for($j=0;$j<count ($usedates);$j++) {
  12.   $CTB.='<td >'.substr($usedates[$j], -2).'';
  13. }
  14. $CTB.='</count></tr>';
  15.  
  16. //output the crosstable query result
  17. //which has the same format
  18. //searchid, average position, period(01)-period(24).
  19.  
  20. while($row=mysql_fetch_assoc($result)) {
  21.  $CTB.='<tr><td>'.number_format($row['AP'], '0', '0', '1').'</td><td>'.$row['searchid'].'</td>';
  22.  for($j=0;$j<count ($usedates);$j++) {
  23.   $v=$row['T'.$usedates[$j]];
  24.   if($v=='0') {
  25.    $CTB.='<td>'; } else {
  26.    $CTB.='<td>'.$v.'</td>'; }
  27.  }
  28.  $CTB.='</count></tr>';
  29. }
  30. $CTB.='</tbody></table>';

When I echo $CTB, the crosstable, this comes out :

google trends watch

Seeing rows per phrase means I can check a lot easier which trends are running, rising, dropping, or even returning. And that was what I wanted.

I zipped the files for download.

Comments
No Comments »
Categories
google, php
Tags
google, mysql, php
Comments rss Comments rss
Trackback Trackback

« Previous Entries Next Entries »

Recent Posts

  • geert wilders
  • gone till september
  • socialize me
  • Pagerank sculpting session
  • wish you were here

click me!
rss
Comments rss
Blog Directory
Web Developement Blogs - BlogCatalog Blog Directory
Listed in LS Blogs the Blog Directory and Blog Search Engine
Blog Flux Directory
joopita.com free web directory and search engine
design by jide
sitemap
22258 confirmed spam kills