juust ~ php oddities

Unordered list of one element
  • rss
  • begin
  • about
    • vcard
    • WTF is BroJesus
  • php scripts
    • flickr wp widget
    • google multi key serp tool, php script
    • gwt plugin
  • php classes
    • php pagerank class
    • fibonacci class
    • robots.txt parser php class
  • serp
    • serp dashboard wordpress plugin
  • services

RedHat Seo : scraper auto-blogging

juust | 26/12/2008

Just give us your endpoint and we’ll take it from there, sparky!

I was going to make one of these tools to scrape google and conjur a full blog out of nowhere, as Christmas special, RedHat Seo. The rough sketch has arrived , far from perfect, but it does produce a blog and don’t even look too shabby. I scraped a small batch of posts off of blogs, keeping the links intact and adding a tribute links. I hope they will pardon me for it.

structure

I use three main classes,

BlogMaker the application
Target the blogs you aim for
WPContent the scraped goodies

…and two support classes

SerpResult scraped urls
Custom_RPC a simple rpc-poster

Target blogs have three texts,

file contents maintenance
blog categories category you post under manual
blog tags tags you list on the blog manual
blog urls urls already used for the blog system

routine

The BlogMaker class grabs a result list (up to 1000 urls per phrase) from Google, extracts the urls and stores them in SerpResult, scrapes the urls and extracts the entry divs, stores div-entries in the WPContent class (that has some basic functions to sanitize the text), and uses the BlogTarget-definitions to post it up blogs with xml-rpc.

usage

My highlighter tends to mess up text with div markers in it, copying off the blog may not work,
the full text source (about 500 lines) is overhere. Underneath I’ll list the main program loop :

  1.  
  2. //make main instance
  3. $Blog = new BlogMaker("keyword");
  4.  
  5. //define a target blog, you can define multiple blogs and refer with code
  6. //then add rpc-url, password and user
  7. //and for every target blog three text-files
  8.  
  9. $T=$Blog->AddTarget(
  10.  'blogcode',
  11.  'http://my.blog.com/xmlrpc.php',
  12.  'password',
  13.  'user',
  14.  'keyword.categories.txt',
  15.  'keyword.tags.txt',
  16.  'keyword.urls.txt'
  17.  );
  18.  
  19. //read the tags, cats and url text files stored on the server
  20. //all retrieved urls are tested, if the target blog already has that
  21. //scraped url, it is discarded.
  22. $T->CSV_GetTags();
  23. $T->List_GetCats();
  24. $T->ReadURL();
  25.  
  26. //grab the google result list
  27. //use params (pages, keywords) to specify search
  28. $Blog->GoogleResults();
  29.  
  30. $a=0;
  31. foreach($Blog->Results as $BlogUrl) {
  32.   $a++;
  33.   echo $BlogUrl->url;
  34. //see if the url isnt used yet
  35.  
  36.  if($T->checkURL(trim($BlogUrl->url))!=true) {
  37.    echo '…checking ';
  38.    flush();
  39. //if not used, get the source
  40.    $BlogUrl->scrape();
  41. //check for divs marked "entry", if they arent there, check "post"
  42. //some blogs use other indications for the content
  43. //but entry and post cover 40%
  44.  
  45.    $entries = $BlogUrl->get_entries();
  46.    if(count($entries)<1) {
  47.     echo 'no entries…';
  48.     flush();
  49.     $entries = $BlogUrl->get_posts();
  50.      if(count($entries)<1) {
  51.       echo 'no posts either…';
  52. //if no entry-post div, mark url as done
  53.  
  54.       $T->RegisterURL($BlogUrl->url);
  55.      }
  56.    }
  57.  
  58.    $ct=0;
  59.    foreach($BlogUrl->WpContentPieces as $WpContent) {
  60. //in the get_entries/get_post function the fragments are stored
  61. //as wpcontent
  62.     $ct++;
  63.  
  64.     if($WpContent->judge(2000, 200, 5)) {
  65.      $WpContent->tribute();  //add tribute link
  66.      $T->settags($WpContent->divcontent); //add tags
  67.      $T->postCustomRPC($WpContent->title, $WpContent->divcontent, 1); //1=publish, 0=draft
  68.      $T->RegisterURL($WpContent->url);  //register use of url
  69. usleep(20000000);  //20 seconds break, for sitemapping
  70.     }
  71.    }
  72.   }
  73.  }

notes

  • xml-rpc needs to be activated explicitly on the wordpress dashboard under settings/writing.
  • categories must be present in the blog
  • url file must be writeable by the server (777)

It seems wordpress builds the sitemap as background process, the standard google xml sitemap plugin wil attempt to build in the cache (takes anywhere between 2 and 10 seconds), and apart from building a sitemap the posts also get pinged around. Giving the install 10 to 20 seconds between posts allows for all the hooked in functions to be completed.

period

That’s about all,
consider it gpl, I added some comments in the source but I will not develop this any further. A mysql backed blogfarm tool (euphemistically called ‘publishing tool’) is more interesting, besides, I am off to the wharves to do some painting.

if you use it, send some feedback,
merry christmas dogheads

Comments
1 Comment »
Categories
google, seo, seo tips and tricks, tool, wordpress, xml-rpc
Tags
google, scrape, seo, seo tips and tricks, tool, wordpress, xml-rpc
Comments rss Comments rss
Trackback Trackback

google trends III

juust | 25/12/2008

How to get the urls and snippets from the Google Trends details page. The news articles on the details page are listed with an ‘Ajax’ call, they are not sent to the browser in the html source. No easy way to scrape that.

The blog articles are pretty straight forward : first the ugly fast way :

  1. $mytitle='manuel benitez';
  2. $mydate=''; //2008-12-24
  3. $html=file_get_contents('http://www.google.com/trends/hottrends?q='.urlencode($mytitle).'&date=&sa=X');
  4. $start = strpos($html, '<div class="gsc-resultsbox-visible">');
  5. $end = strpos($html, '<div class="gsc-trailing-more-results">');
  6. $content = substr($html, $start, $end-$start);
  7. echo $content;
  8. </div></div>

That returns the blog snippets, ugly. The other way : regular pattern matching : you can grab the divs that each content item has, marked with

  • div class=”gs-title”
  • div class=”gs-relativePublishedDate”
  • div class=”gs-snippet”
  • div class=”gs-visibleUrl”

from the html-source and organize them as “Content” array, after which you can list the content items with your own markup or store them in a database.

  1. //I assume $mytitle is taken from the $_GET array.
  2.  
  3. //array 'Content' with it's members
  4. Class Content {
  5.  var $id;
  6.  var $title;
  7.  var $pubdate;
  8.  var $snippet;
  9.  var $url;
  10.  
  11.  public function __construct($id) {
  12.   $this->id=$id;
  13.  }
  14. }
  15.  
  16. //grab the source from the google page
  17. $html=file_get_contents('http://www.google.com/trends/hottrends?q='.urlencode($mytitle).'&date=&sa=X');
  18.  
  19. //cut out the part I want
  20. $start = strpos($html, '<div class="gsc-resultsbox-visible">');
  21. $end = strpos($html, '<div class="gsc-trailing-more-results">');
  22. $content = substr($html, $start, $end-$start);
  23.  
  24. //grab the divs that contain title, publish date, snippet and url with regular pattern match
  25. preg_match_all('!<div class=\”gs-title\”>.*?< \/div>!si', $html, $titles);
  26. preg_match_all('!<div class=\”gs-relativePublishedDate\”>.*?< \/div>!si', $html, $pubDates);
  27. preg_match_all('!<div class=\”gs-snippet\”>.*?< \/div>!si', $html, $snippets);
  28. preg_match_all('!<div class=\”gs-visibleUrl\”>.*?< \/div>!si', $html, $urls);
  29.  
  30. $Contents = array();
  31.  
  32. //organize them under Content;
  33.  
  34. $count=0;
  35. foreach($titles[0] as $title) {
  36. //make a new instance of Content;
  37.  $Contents[] = new Content($count);
  38. //add title
  39.  $Contents[$count]->title=$title;
  40.  $count++;
  41. }
  42.  
  43. $count=0;
  44. foreach($pubDates[0] as $pubDate) {
  45. //add publishing date (contains some linebreak, remove it with strip_tags)
  46.  $Contents[$count]->pubdate=strip_tags($pubDate);
  47.  $count++;
  48. }
  49.  
  50. $count=0;
  51. foreach($snippets[0] as $snippet) {
  52. //add snippet
  53.  $Contents[$count]->snippet=$snippet;
  54.  $count++;
  55. }
  56.  
  57. $count=0;
  58. foreach($urls[0] as $url) {
  59. //add display url
  60.  $Contents[$count]->url=$url;
  61.  $count++;
  62. }
  63.  
  64. //leave $count as is, the number of content-items with a 0-base array
  65. //add rel=nofollow to links to prevent pagerank assignment to blogs
  66. for($ct=0;$ct< $count;$ct++) {
  67.  $Contents[$ct]->url = preg_replace('/ target/', ' rel="nofollow" target', $Contents[$ct]->url);
  68.  $Contents[$ct]->title = preg_replace('/ target/', ' rel="nofollow" target', $Contents[$ct]->title);
  69. }
  70.  
  71. //its complete, list all content-items with some markup
  72. for($ct=0;$ct< $count;$ct++) {
  73.  echo '<h3>'.$Contents[$ct]->title.'';
  74.  echo '<p><strong>'.$Contents[$ct]->pubdate.'</strong>:<em>'.$Contents[$ct]->snippet.'</em></p>';
  75.  echo $Contents[$ct]->url.'<br />';
  76. }
  77. </div></div></div></div></div></div>

It ain’t perfect, but it works. the highlighter I use gets a bit confused about the preg_match_all statements containing unclosed div’s, so copying the code of the blog may not work, a text file with the source code is on trends.trismegistos.net. I added it that snippet to trendinfo.php, works fine.

Comments
5 Comments »
Categories
google
Tags
google, scrape, trends
Comments rss Comments rss
Trackback Trackback

google trends II

juust | 22/12/2008

I wanted to reply to a question elsewhere on the site, but a ‘comment’ box isn’t fit for it so I’ll put the reply here. The question was about creating ’search engine friendly’ descriptive URL’s based on keywords from the Google Trends atom feed, listing pages a graph of the trend.

I hacked a quick example together on a subdomain over at trends.trismegistos.net, just to be sure it works.

You can get a site to list http://domain.com/trend_title.html type url’s by using mod_rewrite, an apache module.

In the server directory of the application you can use an .htaccess file to set rules for file access in these folders. When the server gets request from browsers or servers it applies any rewriting rules you define in .htaccess to these requests.

I tried this one :

  1. <ifmodule mod_rewrite.c>
  2.  RewriteEngine On
  3.  RewriteCond %{REQUEST_FILENAME} !-f
  4.  RewriteCond %{REQUEST_FILENAME} !-d
  5.         RewriteRule ^(.*).html /trendinfo.php?title=$1
  6. </ifmodule>

RewriteEngine On
sets the rewrite mechanism on

RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d

tell the apache server that rewriteconditions apply to file-requests that are not an existing file (F) or directory (D). If the requested filename is anywhere in the servers file table, the server dishes out that file, otherwise it will try to apply a RewriteRule. Applying the rule generates a new request, if that returns anything, the server dishes that out, otherwise it returns an htpp-404 ‘file not found’.

The actual url rewrite rule is :
RewriteRule ^(.*).html /trendinfo.php?title=$1
which means :

  • if any filename is requested that satisfies the mask ^(.*).html then
  • take everything before .html
  • add that as variable $1 to trendinfo.php?title=$1
  • see if it sticks

If the browser requests http://domain.com/bob+bowersox.html, the server will assert it is not a file or directory on the server, and test the available rules. When it notices it the requested file ends with .html, it applies the rewrite rule and tries to access http://domain.com/trendinfo.php?title=bob+bowersox.

A browsing user does not notice a thing.

In trendinfo.php I wrote some code to handle the ‘new’ request :

  1. if(!isset($_REQUEST['title'])) {
  2. //if there is no $1, added as title, fake a 404 "file not found" message
  3.         echo 'the emptiness…';
  4. } else {
  5. //get the title from the request
  6.   $mytitle=htmlentities($_REQUEST['title'], ENT_QUOTES, "UTF-8");
  7. //put the google trends graph url together
  8.   $graphurl = 'http://www.google.com/trends/viz?hl=&q=';
  9.   $graphurl .= urlencode($mytitle);
  10.   $graphurl .= '&date=';                        //leave date blank to get the current graph
  11.   $graphurl .= '&graph=hot_img&sa=X';
  12.   echo "<img class=hotGraph width=280 height=190 src='$graphurl'/>";
  13. }

…that outputs the Google trend graph on the url http://domain.com/bob+bowersox.html

I zipped the trends.trismegistos.net program files, but that might be a bit over the top, the download file contains a class that relies on a mysql table being filled every hour with new trends (by cron.php on an apache cron-job), parsing and storing the atom feed of google trends, and listing it as a cross-table in index.php spanning the past 24 hours.

You can also put this in index.php :

  1.   $feed = simplexml_load_file('http://www.google.com/trends/hottrends/atom/hourly');
  2.   $children =  $feed->children('http://www.w3.org/2005/Atom');
  3.   $parts = $children->entry;
  4.   foreach ($parts as $entry) {
  5.      $details = $entry->children('http://www.w3.org/2005/Atom');
  6.       $dom = new domDocument();
  7.      $html=$details->content;
  8.      @$dom->loadHTML($html);
  9.       $anchors = $dom->getElementsByTagName('a');
  10.     foreach ($anchors as $anchor) {
  11.       $url = $anchor->getAttribute('href');
  12.       $urltext = $anchor->nodeValue;
  13.      echo '<a href="'.urlencode($urltext).'.html" target="_blank">'.$urltext.'</a> ';
  14.     }
  15.    }
  16.    unset($dom);
  17.    unset($anchors);
  18.    unset($parts);
  19.    unset($feed);

That lists the current 100 google trends with a link. If you use the .htaccess rewrite rules, the server reroutes all the links to trendinfo.php with descriptive urls.

I hope that helps.

Comments
4 Comments »
Categories
google, php, seo
Tags
google, php, seo, trends
Comments rss Comments rss
Trackback Trackback

« Previous Entries Next Entries »

Recent Posts

  • Pagerank sculpting session
  • wish you were here
  • interesting : seo panel
  • availability test
  • Mayday

click me!
rss
Comments rss
Blog Directory
Web Developement Blogs - BlogCatalog Blog Directory
Listed in LS Blogs the Blog Directory and Blog Search Engine
Blog Flux Directory
joopita.com free web directory and search engine
design by jide
sitemap
17240 confirmed spam kills