Blacknorati
For our next exercise, dark ones, I present :
Blacknorati
Give us your xml-rpc endpoint, password and username, chose your tags or your favorite author, and we’ll take it from there, Sparky!
A xml-rpc client wired to a streaming blog pipe, 400.000 posts a day. I could build an aggregator with
that, Technorati II, or what I like more, wire it to a blog-farm and reproduce the entire stream, Blacknorati !
I reckon one should program a ‘frame’, front to end, and once you got a working concept you go tune the components. So how does the frame look ?
- 1 take a streaming blog pipe
- 2 parse out the feed fragments with php
- 3 store em in a mysql database
- 4 run entries through a synonym table
- 5 re-post with xml-rpc
That doesn’t seem too hard ?
1 take a blog pipe
I did some work on the sixapart atom stream a few months ago : a constant updating stream of blogposts in Atom format, you can pick up on the details on their page on the sixapart atom format from their website. Sixapart use a basic Atom format with an added time notification and a header.
2 parse out the feed fragments
I got the pipe, now I need a pipe reader. With my meager mental facilities I fantasized a new stream client together, this time with the SimpleXml parser.
What does it do ?
- log onto the pipe
- strip the initial header
- store the output in a string
- on /feed (end tag) parse out time notifications
- add a standard xml-header
- hand the string to the simplexml parser.
- (store in database)
- continue
Using php stream_set_blocking() commands makes it 1:1 with the pipe, don’t ask me how it works, it just does that, simply accepting that makes it easier to handle the stream ;)
-
-
$fnr = fopen('http://updates.sixapart.com/atom-stream.xml', "r");
-
$mytime = time();
-
$inStream = 0;
-
-
while(1) {
-
$buffer='';
-
stream_set_blocking($fnr, FALSE);
-
$buffer = fread($fnr, 1600);
-
stream_set_blocking($fnr, TRUE);
-
-
if(trim($buffer)=='') {
-
} else {
-
if($inStream<1) {
-
$tpos = strpos(trim($buffer),'–>');
-
if($tpos>0) {
-
$buffer=substr($buffer, $tpos+4);
-
$inStream=1;
-
}
-
}
-
-
if($inStream==1) {
-
$string = $string.$buffer;
-
flush();
-
if(strpos($string,'')>0) {
-
$mystring=substr($string,0,strpos($string,''));
-
-
$remstring=substr($string, strpos($string,'')+7);
-
$string=$remstring;
-
$tstring = preg_replace('/<time (.*?)\/time>/', '', $mystring);
-
$xstring="< ?xml version='1.0' encoding='utf-8' ?>".$tstring.'';
-
-
-
$xml = simplexml_load_string($xstring);
-
if(!$xml) { } else {
-
echo '<strong>has xml</strong>';
-
foreach($xml->entry as $e){
-
echo $e->title;
-
echo $e->link['href'];
-
echo $e->published;
-
echo $e->updated;
-
//echo $e->content;
-
}
-
flush();
-
unset($xml);
-
unset($mystring);
-
unset($xstring);
-
unset($tstring);
-
unset($upto);
-
}
-
}
-
}
-
}
-
}
-
-
</time>
(tune, tune, tune….) okay that seems to work… so now we’re gonna add a database where we store author-url, title, content :
3 store em in a database
We make one mysql table for the feed data
-
CREATE TABLE `feed` (
-
`id` bigint(11) NOT NULL auto_increment,
-
`date` varchar(12) NOT NULL,
-
`url` varchar(120) NOT NULL,
-
`entries` varchar(3) NOT NULL,
-
PRIMARY KEY (`id`)
-
) ENGINE=MyISAM
I’m not using that yet, that’s for later on…
and one mysql table for the feed entries
-
CREATE TABLE `entry` (
-
`id` BIGINT( 11 ) NOT NULL ,
-
`title` VARCHAR( 150 ) NOT NULL ,
-
`feedid` BIGINT( 11 ) NOT NULL ,
-
`url` VARCHAR( 150 ) NOT NULL ,
-
`content` BLOB NOT NULL ,
-
`tags` VARCHAR( 150 ) NOT NULL ,
-
`datep` VARCHAR( 12 ) NOT NULL ,
-
`dateu` VARCHAR( 12 ) NOT NULL ,
-
PRIMARY KEY ( `id` )
-
) ENGINE = MYISAM
Then we cut and paste a simple connect function
-
function connect_data() {
-
$DB_USER = "";
-
$DB_PASSWORD = "";
-
$DB_HOST = "";
-
$DB_DATA = "";
-
$link = mysql_connect($DB_HOST, $DB_USER, $DB_PASSWORD) or $error = mysql_error();
-
if (!$link) {
-
return $error;
-
}
-
mysql_select_db($DB_DATA, $link) or $error = mysql_error();
-
return $link;
-
}
and make a query and store the stuff from the simplexml $xml object in the database :
-
$entrylink=connect_data();
-
foreach($xml->entry as $e){
-
$myqry="INSERT INTO entry (`feedid`, `title`,`url`,`datep`,`dateu`,`content`) VALUES (
-
'0',
-
'".htmlentities($e->title, ENT_QUOTES, 'UTF-8')."',
-
'".htmlentities($e->link['href'], ENT_QUOTES, 'UTF-8')."',
-
'".htmlentities($e->published, ENT_QUOTES, 'UTF-8')."',
-
'".htmlentities($e->updated, ENT_QUOTES, 'UTF-8')."',
-
'".htmlentities($e->content, ENT_QUOTES, 'UTF-8')."')";
-
$myresult=mysql_query($myqry, $entrylink) or die ('entry error '.mysql_error());
-
}
-
mysql_close($entrylink);
Wow dude, ten posts a second (don’t forget to put a timer in the routine this one got no breaks)
4 run it through a synonym table
Then we get a synonym database, and make a routine to edit a post (3 out of 100 words) and generate unique new content.
nice : mister John Watson made a synonym database with an API, 10.000 requests a day for free, halleluja. There is WordWeb and FreeThesaurus.net if you are into scraping, but I’ll just take the one with the API.
(not interesting for now, first I wanna make a post, if I cannot post why bother with the rest ?)
5 bombs away! long range xml-rpc payload
Let’s pick a remote blog and fire the post with XML-RPC (if that is installed).
(code from nickycakes.com)
-
function wpPostXMLRPC($title,$body,$rpcurl,$username,$password,$categories=array(1)){
-
$categories = implode(",", $categories);
-
$XML = "<title>$title</title>"."<category>$categories</category>".$body;
-
$params = array('','',$username,$password,$XML,1);
-
$request = xmlrpc_encode_request('blogger.newPost',$params);
-
-
$ch = curl_init();
-
curl_setopt($ch, CURLOPT_POSTFIELDS, $request);
-
curl_setopt($ch, CURLOPT_URL, $rpcurl);
-
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
-
curl_setopt($ch, CURLOPT_TIMEOUT, 1);
-
curl_exec($ch);
-
curl_close($ch);
-
}
my ‘target’ : pagerank.livejournal.com.
woops, oh dear, no php xmlrpc installed, what now ? Darn lets do it ‘donkey’-style, lets just code an xml payload the old fashioned way : short and sweet :
-
-
$ch = curl_init('http://www.livejournal.com/interface/blogger');
-
curl_setopt($ch, CURLOPT_HTTPHEADER, array('Content-Type: text/xml'));
-
curl_setopt($ch, CURLOPT_POSTFIELDS, getmyxml());
-
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
-
curl_setopt($ch, CURLOPT_TIMEOUT, 1);
-
$return = curl_exec($ch);
-
curl_close($ch);
-
print_r($return);
-
-
function getmyxml() {
-
$myxml='< ?xml version="1.0" encoding="UTF-8"?>';
-
$myxml.='<methodcall>';
-
$myxml.='<methodname>blogger.newPost</methodname>';
-
$myxml.='<params>';
-
$myxml.='<param>';
-
$myxml.='<value><string>blogname</string></value>';
-
$myxml.='</param>';
-
$myxml.='<param>';
-
$myxml.='<value><int></int></value>';
-
$myxml.='</param>';
-
$myxml.='<param>';
-
$myxml.='<value><string>blogname</string></value>';
-
$myxml.='</param>';
-
$myxml.='<param>';
-
$myxml.='<value><string>password</string></value>';
-
$myxml.='</param>';
-
$myxml.='<param>';
-
$myxml.='<value><string>This is not a lovesong</string></value>';
-
$myxml.='</param>';
-
$myxml.='<param>';
-
$myxml.='<value><int>1</int></value>';
-
$myxml.='</param>';
-
$myxml.='</params>';
-
$myxml.='</methodcall>';
-
return $myxml;
-
}
LiveJournal Is Your Friend! One line posted successful on the LiveJournal bloggy.
Now let us wire it to the database :)
-
function connect_data() {
-
$DB_USER = "";
-
$DB_PASSWORD = "";
-
$DB_HOST = "";
-
$DB_DATA = "";
-
$link = mysql_connect($DB_HOST, $DB_USER, $DB_PASSWORD) or $error = mysql_error();
-
if (!$link) {
-
return $error;
-
}
-
mysql_select_db($DB_DATA, $link) or $error = mysql_error();
-
return $link;
-
}
-
-
$mylink = connect_data();
-
$myrst = "SELECT * FROM entry";
-
$mytbl = mysql_query($myrst, $mylink);
-
-
while($row=mysql_fetch_assoc($mytbl)) {
-
-
$ch = curl_init('http://www.livejournal.com/interface/blogger');
-
curl_setopt($ch, CURLOPT_HTTPHEADER, array('Content-Type: text/xml'));
-
curl_setopt($ch, CURLOPT_POSTFIELDS, getmyxml($row['content']));
-
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
-
curl_setopt($ch, CURLOPT_TIMEOUT, 1);
-
-
-
$ret = curl_exec($ch);
-
curl_close($ch);
-
print_r($return);
-
flush();
-
}
-
-
function getmyxml($content) {
-
-
$myxml='< ?xml version="1.0" encoding="UTF-8"?>';
-
$myxml.='<methodcall>';
-
$myxml.='<methodname>blogger.newPost</methodname>';
-
$myxml.='<params>';
-
$myxml.='<param>';
-
$myxml.='<value><string>the_blogid</string></value>';
-
$myxml.='</param>';
-
$myxml.='<param>';
-
$myxml.='<value><int></int></value>';
-
$myxml.='</param>';
-
$myxml.='<param>';
-
$myxml.='<value><string>the_bloguser</string></value>';
-
$myxml.='</param>';
-
$myxml.='<param>';
-
$myxml.='<value><string>the_password</string></value>';
-
$myxml.='</param>';
-
$myxml.='<param>';
-
$myxml.='<value><string>'.$content.'</string></value>';
-
$myxml.='</param>';
-
$myxml.='<param>';
-
$myxml.='<value><int>1</int></value>';
-
$myxml.='</param>';
-
$myxml.='</params>';
-
$myxml.='</methodcall>';
-
return $myxml;
-
}
Wow, dude, massive online blog reproduction, we got a live one…
That ends today’s discourse on how to build a basic massive xml-rpc real time care free blog-farming tool. Now over the next few weeks I will tune the system and found and one day you shall witness the Rise of Blacknorati.
So what’s on the menu ?
* synonyms on tags
* excluding non-english posts
* acquiring new blogs
* adding some nice anchors here and there
* replace the blogger.newPost with MetaWeblog.newPost
* etcetera.
I got the xml-rpc remote post snippet from nickycakes
I got the basic ideas on pipes from Chris Chabot (don’t ask me how?)
I stumbled on Joseph Scott’s blog, seems one of the developers of the Wordpress XML-RPC module.






