blogger auto-poster

I needed to get my new linkdirectory’s pages indexed and crawled and google needs some stimulation.

So I take a blogger subdomain,
and a 700 category php link directory
and make a table PLD_TAGCLOUD(CAT_ID, POSTED, TAG, FULLPATH)

CREATE TABLE `PLD_TAGCLOUD` (
`ID` BIGINT( 11 ) NOT NULL ,
`CAT_ID` DOUBLE NOT NULL ,
`POSTED` DOUBLE NOT NULL ,
`LEVEL` DOUBLE NOT NULL ,
`TAG` VARCHAR( 250 ) NOT NULL ,
`FULLPATH` VARCHAR( 250 ) NOT NULL ,
PRIMARY KEY ( `ID` )
) ENGINE = MYISAM

I fill the table with a recursive tree traversal on the category table where “tag” is the ‘title’ field,
and I get the FULLPATH url by using the domain root and the path generated by traversing the tree :


function connect() {
	$DB_USER =  "";
	$DB_PASSWORD = "";
	$DB_HOST = "";
	$DB_DATA = "";
	$link =  mysql_connect($DB_HOST, $DB_USER, $DB_PASSWORD) or $error = mysql_error();
	if (!$link) {
	   	echo $error; 
		exit;
	} else {
    mysql_select_db($DB_DATA, $link) or $error = mysql_error();
	return $link;
	}
}

$link = connect();
$del="DELETE FROM `PLD_TAGCLOUD`";
$mydel = mysql_query($del, $link) or die(mysql_error());
@mysql_close($link);

$root='http://links.trismegistos.net';
$content .= read(0, $root, 1);

function read($rootid, $pathid, $thislevel) {
	$link = connect();
	$myqry = "SELECT * FROM `PLD_CATEGORY` WHERE `PARENT_ID`='".$rootid."'";
	$myres = mysql_query($myqry, $link) or die(mysql_error());
	if(mysql_num_rows($myres)<1) return;
	while($row=mysql_fetch_assoc($myres)) { 	
		$thispath= $pathid ."/".$row['TITLE_URL'];
		$link2 = connect();
		$add="INSERT INTO `PLD_TAGCLOUD` (`CAT_ID`, `LEVEL`, `TAG`, `FULLPATH`) VALUES ('".$row['ID']."', '".$thislevel."', '".htmlentities($row['TITLE'], ENT_QUOTES)."' ,'".$thispath."/')";
		$addit = mysql_query($add, $link2) or die(mysql_error());
		@mysql_close($link2);
		read($row['ID'], $thispath, $thislevel+1);
	}
	@mysql_close($link);
}

note : I also use a field level to store the depth of a category-page (o for root, 1 for main categories and mine goes down to 4)

Then we make a simple routine to grab the first record for posted=0,
grab the url
grab the title
grab 3 posts off of google-blogsearch on the TITLE, add em to an email,
add a link to the category page url,
mail(email, subject, message-body, headers)

and ofcourse the coupe-de-grace, the cronjob, 700 posts, 4 per hour, so in about 170 hours my entire site is listed on a nice juicy blog. Just for the hell of it i put the links of the blogsearch on ‘follow’ so my poor victims get a link as well.


function connect() {
	$DB_USER =  "";
	$DB_PASSWORD = "";
	$DB_HOST = "";
	$DB_DATA = "";
	$link =  mysql_connect($DB_HOST, $DB_USER, $DB_PASSWORD) or $error = mysql_error();
	if (!$link) {
	   	echo $error; 
		exit;
	} else {
    mysql_select_db($DB_DATA, $link) or $error = mysql_error();
	return $link;
	}
}

	$link = connect();
	$myqry = "SELECT * FROM `PLD_TAGCLOUD` WHERE `POSTED`='0' ORDER BY ID DESC";
	$myres = mysql_query($myqry, $link) or die(mysql_error());
	if(mysql_num_rows($myres)<1) return;
	while($row=mysql_fetch_assoc($myres)) { 	
		$myurl = $row['FULLPATH'];
		$mykey = urlencode($row['TAG']);
		$link2 = connect();
		$add="UPDATE `PLD_TAGCLOUD` SET `POSTED`='1' WHERE `ID`='".$row['ID']."'";
		$addit = mysql_query($add, $link2) or die(mysql_error());
		@mysql_close($link2);
		break;
	}
	@mysql_close($link);


$xmlSource="http://blogsearch.google.com/blogsearch_feeds?hl=en&c2coff=1&lr=&safe=active&as_drrb=q&as_qdr=d&q=".$mykey."&ie=utf-8&num=3&output=rss";
$title="";
$link="";
$description="";
$author="";
$pubDate="";
$currentElement="";
$nieuwsitems = array();

function startElement($parser,$name,$attr){
	if(strcmp($name,"item")==0){
	$GLOBALS['title']="";
	$GLOBALS['link']="";
	$GLOBALS['description']="";
	$GLOBALS['author']="";
	$GLOBALS['pubDate']="";
	}
	$GLOBALS['currentElement']=$name;	
	if(strcmp($name,"link")==0){ $GLOBALS['href']=$attr["href"]; }

}

function endElement($parser,$name){
	$elements=array('title','link','description','author','pubDate');     
	if(strcmp($name,"item")==0){
		foreach($elements as $element){
			$temp[$element] = $GLOBALS[$element];							
		}
	$GLOBALS['nieuwsitems'][]=$temp;
	$GLOBALS['title']="";
	$GLOBALS['link']="";
	$GLOBALS['description']="";
	$GLOBALS['author']="";
	$GLOBALS['pubDate']="";
	}
	if(strcmp($name,"item")==0){
		$GLOBALS['title']="";
		$GLOBALS['link']="";
		$GLOBALS['description']="";
		$GLOBALS['author']="";
		$GLOBALS['pubDate']="";
	}
}

function characterData($parser, $data) {
	$elements = array ('title', 'link', 'description','author','pubDate');
	foreach ($elements as $element) {
		if ($GLOBALS["currentElement"] == $element) {
			$GLOBALS[$element] .= $data;
		}
	}
}

function parseFile(){
	global $xmlSource,$nieuwsitems;
	$xml_parser=xml_parser_create();
	xml_set_element_handler($xml_parser,"startElement","endElement");
	xml_set_character_data_handler($xml_parser,"characterData");
	xml_parser_set_option($xml_parser,XML_OPTION_CASE_FOLDING,false);
	if(!($fp=fopen($xmlSource,"r"))){
		die("Cannot open  $xmlSource  ");
	}
	while(($data=fread($fp,4096))){
		if(!xml_parse($xml_parser,$data,feof($fp))){
			die(sprintf("XML error at line %d column %d ", 
			xml_get_current_line_number($xml_parser), 
			xml_get_current_column_number($xml_parser)));
		}
	}
	xml_parser_free($xml_parser);
	return $nieuwsitems;
}

$result = parseFile();

foreach($result as $arr){
	$strResult .= '< hr />';
	$strResult .= '< h4>'.$arr["title"].'< /h4>'.$arr["description"].'< br />"'.$arr["title"].' ('.parse_url($arr["link"], PHP_URL_HOST).')< br />< br />';	
}

$strResult .= '< br /> < a href="'.$myurl.'" title="'.$mykey.'">trismegistos links : '.$mykey.'< /a>< br />'; 

$email='juustout.linkdirectory@blogger.com';
$subject = $mykey;
mail($email,$subject,$strResult, "MIME-Version: 1.0\n"."Content-type: text/html; charset=iso-8859-1");

echo $strResult;

(note the html markup in the last lines is < br />, if you cut and paste it, remove the space or you get a mess, also note the extra header in the php mail function, makes it possible to post html-marked up text (otherwise you get flat text posted and your site looks like ****).

Spidering

Someone asked about the ‘pagerank spider’, I put the code online as is, it isn’t finished and if I wanted to finish it I would make a few changes.

the main remaining issues are

  • 1 memory usage
  • 2 how to handle the www.-prefix
  • 3 indexed pages at google
  • 4 http codes

1 a big class uses a lot of memory, a mysql backed version has an extra dependency, takes longer to develop and is slower. I needed a fast spider for a quick feedback on a small site.

Check out phpDig, they have a mature open-source(?) spider with a mysql backend, and a usergroup and forum.

2 google have a section where you can choose to have all domain pages indexed represented as either juust.org or www.juust.org. It hints on that having an influence on page ranking but no actual straight forward ‘rule’. I have no idea what the actual impact is.

3 google index and cache pages when spidering other sites that link to yours. If the page the link points to was valid at the time, the page it links to is indexed and cached. Especially with files you dumped, or query-result pages, search pages, you cannot remove the cached page but it is counted to your site.

Putting search pages on ‘noindex’ is smart, especially if you use one of these funky search box gadgets in your template that can list any result, if someone queries your site for (nasty+term) and puts the query as link to your search page, once the link is followed a page from your site loaded with (nasty+term) is indexed and you cannot erase it from the cache, so then you have a problem. Put the file on robots=”noindex”, and try and confine the search to your own domain, or use a profanity filter.

4 http-codes, I checked them out for a link-validator routine two weeks ago, I might be adding that mysql backend after all, and make a more sturdy version, but not for the next few weeks.

———
Some background info
searchtools.com /robots /robot-checklist

phpDig