google suggest scraper (php & simplexml)

Today’s goal is a basic php Google Suggest scraper because I wanted traffic data and keywords for free.

Before we start :

google scraping is bad !

Good People use the Google Adwords API : 25 cents for 1000 units, 15++ units for keyword suggestion so they pay 4 or 5 dollar for 1000 keyword suggestions (if they can find a good programmer which also costs a few dollars). Or they opt for SemRush (also my preference), KeywordSpy, Spyfu, and other services like 7Search PPC programs to get keyword and traffic data and data on their competitors but these also charge about 80 dollars per month for a limited account up to a few hundred per month for seo companies. Good people pay plenty.

We tiny grey webmice of marketing however just want a few estimates, at low or better no cost : like this :

data num queries
google suggest 57800000
google suggestion box 5390000
google suggest api 5030000
google suggestion tool 3670000
google suggest a site 72700000
google suggested users 57000000
google suggestions funny 37400000
google suggest scraper 62800
google suggestions not working 87100000
google suggested user list 254000000

Suggestion autocomplete is AJAX, it outputs XML :

< ?xml version="1.0"? >
   <toplevel>
     <CompleteSuggestion>
       <suggestion data="senior quotes"/>
       <num_queries int="30000000"/>
     </CompleteSuggestion>
     <CompleteSuggestion>
       <suggestion data="senior skip day lyrics"/>
       <num_queries int="441000"/>
     </CompleteSuggestion>
   </toplevel>

Using SimpleXML, the PHP routine is as simple as querying g00gle.c0m/complete/search?, grabbing the autocomplete xml, and extracting the attribute data :

 
        if ($_SERVER['QUERY_STRING']=='') die('enter a query like http://host/filename.php?query');
	$contentstring = @file_get_contents("http://g00gle.c0m/complete/search?output=toolbar&q=".urlencode($kw));  
  	$content = simplexml_load_string($contentstring );

        foreach($content->CompleteSuggestion as $c) {
            $term = (string) $c->suggestion->attributes()->data;
            //note : traffic data is sometimes missing   
            $traffic = (string) $c->num_queries->attributes()->int;
            echo $term. " ".$traffic . "
" ;
	}

I made a quick php script that outputs the terms as a list of new queries so you can walk through the suggestions :

The source is as text file up for download overhere (rename it to suggestit.php and it should run on any server with php5.* and simplexml).

proxies !

I got a site banned at Google so I got pissed and took a script from the blackbox @ digerati marketing to scrape proxy addresses, wired a database and curl into it, so now it scrapes proxies, random picks a proxy, prunes dead proxies and returns data.

Basic, it uses anonymous (level 2) proxies, but it works.


/* (mysql table)
CREATE TABLE IF NOT EXISTS `serp_proxies` (
  `id` int(11) NOT NULL auto_increment,
  `ip` text NOT NULL,
  `port` text NOT NULL,
  PRIMARY KEY  (`id`)
) ENGINE=MyISAM  DEFAULT CHARSET=latin1 AUTO_INCREMENT=1 ;
*/

//initialize database class, replace with own code
include('init.php');

//main class
$p=new MyProxies;

//do I have proxies in the database ?
//if not, get some and store them
if($p->GetCount() < 1) {
	$p->GetSomeAir(1);
	$p->store2database();
}

//pick one
$p->RandomProxy();

//get the page
$p->ThisProxy->DoRequest('http://www.domain.com/robots.txt');

//error handling
if($p->ThisProxy->ProxyError > 0) {
//7 		no connect
//28 		timed out
//52 		empty reply
//if it is dead, doesn't allow connections : prune it
	if($p->ThisProxy->ProxyError==7) $p->DeleteProxy($p->ThisProxy->proxy_ip);
	if($p->ThisProxy->ProxyError==52) $p->DeleteProxy($p->ThisProxy->proxy_ip);
}
//you could loop back until you get a 0-error proxy, but that ain't the point

//give me the content
echo $p->ThisProxy->Content;


Class MyProxies {

	var $Proxies = array();
	var $ThisProxy;
	var $MyCount;
	

//picks a random proxy from the database
	function RandomProxy() {

		global $serpdb;	
		$offset_result =  $serpdb->query("SELECT FLOOR(RAND() * COUNT(*)) AS `offset` FROM `serp_proxies`");
		$offset_row = mysql_fetch_object($offset_result);
		$offset = $offset_row->offset;
		$result = $serpdb->query("SELECT * FROM `serp_proxies` LIMIT $offset, 1" );
		while($row=mysql_fetch_assoc($result)) {
//make instance of Proxy, with proxy_host ip and port
			$this->ThisProxy = new Proxy($row['ip'].':'.$row['port']);
			$this->ThisProxy->proxy_ip = $row['ip'];
			$this->ThisProxy->proxy_port = $row['port'];
			break;
		}
	}
	
//visit the famous russian site 
	function GetSomeAir($pages) {
			for($index=0; $index< $pages; $index++)
			{
				$pageno = sprintf("%02d",$index+1); 
				$page_url = "http://www.samair.ru/proxy/proxy-" . $pageno . ".htm";
				$page_html = @file_get_contents($page_url);

//get rid of the crap and extract the proxies
				preg_match("/(.*)< \/td>< \/tr>/", $page_html, $matches);
				$txt = $matches[1];
				$main = split('', $txt);
				for($x=0;$x', $main[$x]);
					$this->Proxies[] = split(':', $arr[0]);
				}
			}
	}

//store the retrieved proxies (stored in this->Proxies) in the database
	function store2database() {
		global $serpdb;
		foreach($this->Proxies as $p) { 
			$result = $serpdb->query("SELECT * FROM serp_proxies WHERE ip='".$p[0]."'");
			if(mysql_num_rows($result)<1) $serpdb->query("INSERT INTO serp_proxies (`ip`, `port`) VALUES ('".$p[0]."', '".$p[1]."')");
		}
		$serpdb->query("DELETE FROM serp_proxies WHERE `ip`=''");
	}


	function DeleteProxy($ip) {
		global $serpdb;
		$serpdb->query("DELETE FROM serp_proxies WHERE `ip`='".$ip."'");			
	}
	
	
	function GetCount() 
	{
//use this to check how many proxies there are in the database
		global $serpdb;
		$this->MyCount = mysql_num_rows($serpdb->query("SELECT * FROM `serp_proxies`"));
		return $this->MyCount; 
	}
	
	
}

Class Proxy {

	var $proxy_ip;
	var $proxy_port;
	
	var $proxy_host;
	var $proxy_auth; 
	var $ch;
	var $Content;
	var $USERAGENT = "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)";
	var $ProxyError = 0;
	var $ProxyErrorMsg = '';
	var $TimeOut=3;
	var $IncludeHeaders = 0;
	
	function Proxy($host, $username='', $pwd='') {
//initialize class, set host 
         $this->proxy_host = $host;
         if (strlen($username) > 0 || strlen($pwd) > 0) {
            $this->proxy_auth = $username.":".$pwd;
         }
      }

	function CURL_PROXY($cc) {
			if (strlen($this->proxy_host) > 0) {
				curl_setopt($cc, CURLOPT_PROXY, $this->proxy_host);
				if (strlen($this->proxy_auth) > 0)
					curl_setopt($cc, CURLOPT_PROXYUSERPWD, $this->proxy_auth);
			}
	}
	
	function DoRequest($url) {
		$this->ch = curl_init();
		curl_setopt($this->ch, CURLOPT_URL,$url);
		$this->CURL_PROXY($this->ch);
		curl_setopt($this->ch, CURLOPT_HEADER, $this->IncludeHeaders); // baca header
		
		curl_setopt($this->ch, CURLOPT_USERAGENT, $this->USERAGENT);
		curl_setopt($this->ch, CURLOPT_RETURNTRANSFER, 1);
		curl_setopt($this->ch, CURLOPT_TIMEOUT, $this->TimeOut);
	    $this->Content = curl_exec($this->ch);

//if an error occurs, store the number and message
		if (curl_errno($this->ch))
			{ 
				$this->ProxyError =  curl_errno($this->ch);
				$this->ProxyErrorMsg =  curl_error($this->ch);
			}
	}

}

There is not much to say about it, just a rough outline. I would prefer elite level 1 proxies but for now it will have to do.

RedHat Seo : scraper auto-blogging

Just give us your endpoint and we’ll take it from there, sparky!

I was going to make one of these tools to scrape google and conjur a full blog out of nowhere, as Christmas special, RedHat Seo. The rough sketch has arrived , far from perfect, but it does produce a blog and don’t even look too shabby. I scraped a small batch of posts off of blogs, keeping the links intact and adding a tribute links. I hope they will pardon me for it.

structure

I use three main classes,

BlogMaker the application
Target the blogs you aim for
WPContent the scraped goodies

…and two support classes

SerpResult scraped urls
Custom_RPC a simple rpc-poster

Target blogs have three texts,

file contents maintenance
blog categories category you post under manual
blog tags tags you list on the blog manual
blog urls urls already used for the blog system

routine

The BlogMaker class grabs a result list (up to 1000 urls per phrase) from Google, extracts the urls and stores them in SerpResult, scrapes the urls and extracts the entry divs, stores div-entries in the WPContent class (that has some basic functions to sanitize the text), and uses the BlogTarget-definitions to post it up blogs with xml-rpc.

usage


//make main instance
$Blog = new BlogMaker("keyword");

//define a target blog, you can define multiple blogs and refer with code
//then add rpc-url, password and user
//and for every target blog three text-files 

$T=$Blog->AddTarget(
	'blogcode', 
	'http://my.blog.com/xmlrpc.php', 
	'password', 
	'user', 
	'keyword.categories.txt', 
	'keyword.tags.txt',
	'keyword.urls.txt'
	);

//read the tags, cats and url text files stored on the server	
//all retrieved urls are tested, if the target blog already has that
//scraped url, it is discarded.
$T->CSV_GetTags();
$T->List_GetCats();
$T->ReadURL();

//grab the google result list
//use params (pages, keywords) to specify search
$Blog->GoogleResults();

$a=0;
foreach($Blog->Results as $BlogUrl) { 
		$a++;
		echo $BlogUrl->url;
//see if the url isnt used yet

	if($T->checkURL(trim($BlogUrl->url))!=true) {
			echo '...checking ';
			flush();
//if not used, get the source
			$BlogUrl->scrape();
//check for divs marked "entry", if they arent there, check "post"
//some blogs use other indications for the content
//but entry and post cover 40%

			$entries = $BlogUrl->get_entries();
			if(count($entries)<1) {
				echo 'no entries...';
				flush();
				$entries = $BlogUrl->get_posts();
					if(count($entries)<1) {
						echo 'no posts either...';
//if no entry-post div, mark url as done

						$T->RegisterURL($BlogUrl->url);
					}
			}

			$ct=0;
			foreach($BlogUrl->WpContentPieces as $WpContent) {
//in the get_entries/get_post function the fragments are stored
//as wpcontent
				$ct++;
	
				if($WpContent->judge(2000, 200, 5)) {
					$WpContent->tribute();  //add tribute link
					$T->settags($WpContent->divcontent); //add tags
					$T->postCustomRPC($WpContent->title, $WpContent->divcontent, 1); //1=publish, 0=draft
					$T->RegisterURL($WpContent->url);  //register use of url
usleep(20000000);  //20 seconds break, for sitemapping
				}
			}
		}
	}

notes

  • xml-rpc needs to be activated explicitly on the wordpress dashboard under settings/writing.
  • categories must be present in the blog
  • url file must be writeable by the server (777)

It seems wordpress builds the sitemap as background process, the standard google xml sitemap plugin wil attempt to build in the cache (takes anywhere between 2 and 10 seconds), and apart from building a sitemap the posts also get pinged around. Giving the install 10 to 20 seconds between posts allows for all the hooked in functions to be completed.

period

That’s about all,
consider it gpl, I added some comments in the source but I will not develop this any further. A mysql backed blogfarm tool (euphemistically called ‘publishing tool’) is more interesting, besides, I am off to the wharves to do some painting.

if you use it, send some feedback,
merry christmas dogheads