blogger auto-poster

I needed to get my new linkdirectory’s pages indexed and crawled and google needs some stimulation.

So I take a blogger subdomain,
and a 700 category php link directory
and make a table PLD_TAGCLOUD(CAT_ID, POSTED, TAG, FULLPATH)

CREATE TABLE `PLD_TAGCLOUD` (
`ID` BIGINT( 11 ) NOT NULL ,
`CAT_ID` DOUBLE NOT NULL ,
`POSTED` DOUBLE NOT NULL ,
`LEVEL` DOUBLE NOT NULL ,
`TAG` VARCHAR( 250 ) NOT NULL ,
`FULLPATH` VARCHAR( 250 ) NOT NULL ,
PRIMARY KEY ( `ID` )
) ENGINE = MYISAM

I fill the table with a recursive tree traversal on the category table where “tag” is the ‘title’ field,
and I get the FULLPATH url by using the domain root and the path generated by traversing the tree :


function connect() {
	$DB_USER =  "";
	$DB_PASSWORD = "";
	$DB_HOST = "";
	$DB_DATA = "";
	$link =  mysql_connect($DB_HOST, $DB_USER, $DB_PASSWORD) or $error = mysql_error();
	if (!$link) {
	   	echo $error; 
		exit;
	} else {
    mysql_select_db($DB_DATA, $link) or $error = mysql_error();
	return $link;
	}
}

$link = connect();
$del="DELETE FROM `PLD_TAGCLOUD`";
$mydel = mysql_query($del, $link) or die(mysql_error());
@mysql_close($link);

$root='http://links.trismegistos.net';
$content .= read(0, $root, 1);

function read($rootid, $pathid, $thislevel) {
	$link = connect();
	$myqry = "SELECT * FROM `PLD_CATEGORY` WHERE `PARENT_ID`='".$rootid."'";
	$myres = mysql_query($myqry, $link) or die(mysql_error());
	if(mysql_num_rows($myres)<1) return;
	while($row=mysql_fetch_assoc($myres)) { 	
		$thispath= $pathid ."/".$row['TITLE_URL'];
		$link2 = connect();
		$add="INSERT INTO `PLD_TAGCLOUD` (`CAT_ID`, `LEVEL`, `TAG`, `FULLPATH`) VALUES ('".$row['ID']."', '".$thislevel."', '".htmlentities($row['TITLE'], ENT_QUOTES)."' ,'".$thispath."/')";
		$addit = mysql_query($add, $link2) or die(mysql_error());
		@mysql_close($link2);
		read($row['ID'], $thispath, $thislevel+1);
	}
	@mysql_close($link);
}

note : I also use a field level to store the depth of a category-page (o for root, 1 for main categories and mine goes down to 4)

Then we make a simple routine to grab the first record for posted=0,
grab the url
grab the title
grab 3 posts off of google-blogsearch on the TITLE, add em to an email,
add a link to the category page url,
mail(email, subject, message-body, headers)

and ofcourse the coupe-de-grace, the cronjob, 700 posts, 4 per hour, so in about 170 hours my entire site is listed on a nice juicy blog. Just for the hell of it i put the links of the blogsearch on ‘follow’ so my poor victims get a link as well.


function connect() {
	$DB_USER =  "";
	$DB_PASSWORD = "";
	$DB_HOST = "";
	$DB_DATA = "";
	$link =  mysql_connect($DB_HOST, $DB_USER, $DB_PASSWORD) or $error = mysql_error();
	if (!$link) {
	   	echo $error; 
		exit;
	} else {
    mysql_select_db($DB_DATA, $link) or $error = mysql_error();
	return $link;
	}
}

	$link = connect();
	$myqry = "SELECT * FROM `PLD_TAGCLOUD` WHERE `POSTED`='0' ORDER BY ID DESC";
	$myres = mysql_query($myqry, $link) or die(mysql_error());
	if(mysql_num_rows($myres)<1) return;
	while($row=mysql_fetch_assoc($myres)) { 	
		$myurl = $row['FULLPATH'];
		$mykey = urlencode($row['TAG']);
		$link2 = connect();
		$add="UPDATE `PLD_TAGCLOUD` SET `POSTED`='1' WHERE `ID`='".$row['ID']."'";
		$addit = mysql_query($add, $link2) or die(mysql_error());
		@mysql_close($link2);
		break;
	}
	@mysql_close($link);


$xmlSource="http://blogsearch.google.com/blogsearch_feeds?hl=en&c2coff=1&lr=&safe=active&as_drrb=q&as_qdr=d&q=".$mykey."&ie=utf-8&num=3&output=rss";
$title="";
$link="";
$description="";
$author="";
$pubDate="";
$currentElement="";
$nieuwsitems = array();

function startElement($parser,$name,$attr){
	if(strcmp($name,"item")==0){
	$GLOBALS['title']="";
	$GLOBALS['link']="";
	$GLOBALS['description']="";
	$GLOBALS['author']="";
	$GLOBALS['pubDate']="";
	}
	$GLOBALS['currentElement']=$name;	
	if(strcmp($name,"link")==0){ $GLOBALS['href']=$attr["href"]; }

}

function endElement($parser,$name){
	$elements=array('title','link','description','author','pubDate');     
	if(strcmp($name,"item")==0){
		foreach($elements as $element){
			$temp[$element] = $GLOBALS[$element];							
		}
	$GLOBALS['nieuwsitems'][]=$temp;
	$GLOBALS['title']="";
	$GLOBALS['link']="";
	$GLOBALS['description']="";
	$GLOBALS['author']="";
	$GLOBALS['pubDate']="";
	}
	if(strcmp($name,"item")==0){
		$GLOBALS['title']="";
		$GLOBALS['link']="";
		$GLOBALS['description']="";
		$GLOBALS['author']="";
		$GLOBALS['pubDate']="";
	}
}

function characterData($parser, $data) {
	$elements = array ('title', 'link', 'description','author','pubDate');
	foreach ($elements as $element) {
		if ($GLOBALS["currentElement"] == $element) {
			$GLOBALS[$element] .= $data;
		}
	}
}

function parseFile(){
	global $xmlSource,$nieuwsitems;
	$xml_parser=xml_parser_create();
	xml_set_element_handler($xml_parser,"startElement","endElement");
	xml_set_character_data_handler($xml_parser,"characterData");
	xml_parser_set_option($xml_parser,XML_OPTION_CASE_FOLDING,false);
	if(!($fp=fopen($xmlSource,"r"))){
		die("Cannot open  $xmlSource  ");
	}
	while(($data=fread($fp,4096))){
		if(!xml_parse($xml_parser,$data,feof($fp))){
			die(sprintf("XML error at line %d column %d ", 
			xml_get_current_line_number($xml_parser), 
			xml_get_current_column_number($xml_parser)));
		}
	}
	xml_parser_free($xml_parser);
	return $nieuwsitems;
}

$result = parseFile();

foreach($result as $arr){
	$strResult .= '< hr />';
	$strResult .= '< h4>'.$arr["title"].'< /h4>'.$arr["description"].'< br />"'.$arr["title"].' ('.parse_url($arr["link"], PHP_URL_HOST).')< br />< br />';	
}

$strResult .= '< br /> < a href="'.$myurl.'" title="'.$mykey.'">trismegistos links : '.$mykey.'< /a>< br />'; 

$email='juustout.linkdirectory@blogger.com';
$subject = $mykey;
mail($email,$subject,$strResult, "MIME-Version: 1.0\n"."Content-type: text/html; charset=iso-8859-1");

echo $strResult;

(note the html markup in the last lines is < br />, if you cut and paste it, remove the space or you get a mess, also note the extra header in the php mail function, makes it possible to post html-marked up text (otherwise you get flat text posted and your site looks like ****).

Spidering

Someone asked about the ‘pagerank spider’, I put the code online as is, it isn’t finished and if I wanted to finish it I would make a few changes.

the main remaining issues are

  • 1 memory usage
  • 2 how to handle the www.-prefix
  • 3 indexed pages at google
  • 4 http codes

1 a big class uses a lot of memory, a mysql backed version has an extra dependency, takes longer to develop and is slower. I needed a fast spider for a quick feedback on a small site.

Check out phpDig, they have a mature open-source(?) spider with a mysql backend, and a usergroup and forum.

2 google have a section where you can choose to have all domain pages indexed represented as either juust.org or www.juust.org. It hints on that having an influence on page ranking but no actual straight forward ‘rule’. I have no idea what the actual impact is.

3 google index and cache pages when spidering other sites that link to yours. If the page the link points to was valid at the time, the page it links to is indexed and cached. Especially with files you dumped, or query-result pages, search pages, you cannot remove the cached page but it is counted to your site.

Putting search pages on ‘noindex’ is smart, especially if you use one of these funky search box gadgets in your template that can list any result, if someone queries your site for (nasty+term) and puts the query as link to your search page, once the link is followed a page from your site loaded with (nasty+term) is indexed and you cannot erase it from the cache, so then you have a problem. Put the file on robots=”noindex”, and try and confine the search to your own domain, or use a profanity filter.

4 http-codes, I checked them out for a link-validator routine two weeks ago, I might be adding that mysql backend after all, and make a more sturdy version, but not for the next few weeks.

———
Some background info
searchtools.com /robots /robot-checklist

phpDig

monkey business

monkey business

I cached my monkeys, so it don’t take so long to retrieve pictures every time, and added a size switch. where can you retrieve the width of the sidebar the widget instance is in ?

Anyways, when I figured the cache thing out, i reckoned hey, lets cache the serp, so then I cached the serp, now it keeps a ‘current’ file and an archive. works half.

at least my pages don’t take an aeon to load anymore.

tomorrow : more monkey business.

ape beasty …that is a cute ape

tag links

SeoQuake indicates my second blog has 52 inside links and 14 outbound, 66 links, but my blog is not ranking.

I ran a spider on the site that counts all the anchors (also double links), that reports 120 anchors.

Google possibly exclude anything with 100+ anchors.

I just counted the links manually and found about 50 links, so i wondered ‘where are the other 70?’. So i counted the tag-cloud and realized ‘damn.. the tags…’. What causes it are the tags per post (the main cause) and the number of posts listed on the frontpage. If i use 10 posts per page I easily have (10*4) 40 tag-links, and if google counts ’em all for onr link and ditches my page I loose the page linking the two blogs. The whole second blog becomes a closed ring with no way out so the entire blog is excluded from page ranking.

I turned the posts per page down to max 3 posts (settings, reading), and removed the ‘recent posts’ widget from the sidebar. Now its down to about 80 links.

If that is all true then tomorrow or by sunday both blogs should start ranking again.

A quick spidering shows the main blog is on 2.20, pages on 1.80, posts on 0.60, and the second blog on 0.60, only the rss feed still has the feed url on ‘follow’ but that’s a minor detail.

That should score quite nicely.