RedHat Seo : scraper auto-blogging
juust | December 26, 2008Just give us your endpoint and we’ll take it from there, sparky!
I was going to make one of these tools to scrape google and conjur a full blog out of nowhere, as Christmas special, RedHat Seo. The rough sketch has arrived , far from perfect, but it does produce a blog and don’t even look too shabby. I scraped a small batch of posts off of blogs, keeping the links intact and adding a tribute links. I hope they will pardon me for it.
structure
I use three main classes,
| BlogMaker | the application |
| Target | the blogs you aim for |
| WPContent | the scraped goodies |
…and two support classes
| SerpResult | scraped urls |
| Custom_RPC | a simple rpc-poster |
Target blogs have three texts,
| file | contents | maintenance |
| blog categories | category you post under | manual |
| blog tags | tags you list on the blog | manual |
| blog urls | urls already used for the blog | system |
routine
The BlogMaker class grabs a result list (up to 1000 urls per phrase) from Google, extracts the urls and stores them in SerpResult, scrapes the urls and extracts the entry divs, stores div-entries in the WPContent class (that has some basic functions to sanitize the text), and uses the BlogTarget-definitions to post it up blogs with xml-rpc.
usage
My highlighter tends to mess up text with div markers in it, copying off the blog may not work,
the full text source (about 500 lines) is overhere. Underneath I’ll list the main program loop :
-
-
//make main instance
-
$Blog = new BlogMaker("keyword");
-
-
//define a target blog, you can define multiple blogs and refer with code
-
//then add rpc-url, password and user
-
//and for every target blog three text-files
-
-
$T=$Blog->AddTarget(
-
'blogcode',
-
'http://my.blog.com/xmlrpc.php',
-
'password',
-
'user',
-
'keyword.categories.txt',
-
'keyword.tags.txt',
-
'keyword.urls.txt'
-
);
-
-
//read the tags, cats and url text files stored on the server
-
//all retrieved urls are tested, if the target blog already has that
-
//scraped url, it is discarded.
-
$T->CSV_GetTags();
-
$T->List_GetCats();
-
$T->ReadURL();
-
-
//grab the google result list
-
//use params (pages, keywords) to specify search
-
$Blog->GoogleResults();
-
-
$a=0;
-
foreach($Blog->Results as $BlogUrl) {
-
$a++;
-
echo $BlogUrl->url;
-
//see if the url isnt used yet
-
-
if($T->checkURL(trim($BlogUrl->url))!=true) {
-
echo '…checking ';
-
flush();
-
//if not used, get the source
-
$BlogUrl->scrape();
-
//check for divs marked "entry", if they arent there, check "post"
-
//some blogs use other indications for the content
-
//but entry and post cover 40%
-
-
$entries = $BlogUrl->get_entries();
-
if(count($entries)<1) {
-
echo 'no entries…';
-
flush();
-
$entries = $BlogUrl->get_posts();
-
if(count($entries)<1) {
-
echo 'no posts either…';
-
//if no entry-post div, mark url as done
-
-
$T->RegisterURL($BlogUrl->url);
-
}
-
}
-
-
$ct=0;
-
foreach($BlogUrl->WpContentPieces as $WpContent) {
-
//in the get_entries/get_post function the fragments are stored
-
//as wpcontent
-
$ct++;
-
-
if($WpContent->judge(2000, 200, 5)) {
-
$WpContent->tribute(); //add tribute link
-
$T->settags($WpContent->divcontent); //add tags
-
$T->postCustomRPC($WpContent->title, $WpContent->divcontent, 1); //1=publish, 0=draft
-
$T->RegisterURL($WpContent->url); //register use of url
-
usleep(20000000); //20 seconds break, for sitemapping
-
}
-
}
-
}
-
}
notes
- xml-rpc needs to be activated explicitly on the wordpress dashboard under settings/writing.
- categories must be present in the blog
- url file must be writeable by the server (777)
It seems wordpress builds the sitemap as background process, the standard google xml sitemap plugin wil attempt to build in the cache (takes anywhere between 2 and 10 seconds), and apart from building a sitemap the posts also get pinged around. Giving the install 10 to 20 seconds between posts allows for all the hooked in functions to be completed.
period
That’s about all,
consider it gpl, I added some comments in the source but I will not develop this any further. A mysql backed blogfarm tool (euphemistically called ‘publishing tool’) is more interesting, besides, I am off to the wharves to do some painting.
if you use it, send some feedback,
merry christmas dogheads






