synonymizer with api
juust | December 28, 2008If you want to put some old content on the net and have it indexed as fresh unique content, this works wonders for seo-friendly backlinks : the automated synonymizer. I want one that makes my content unique without having to type one character.
Lucky for me, mister John Watson’s synonym database comes with a free 10.000 request a day API and boy is it sweet!
API Requests are straightforward :
http://words.bighugelabs.com/api/2/[apikey]/[keyword]/xml
A number of return formats are supported but xml is easiest, either for parsing with simplexml or regular pattern matching.
It returns on request :
black (slightly shortened)
an xml file like :
<words>
<w p=”adjective” r=”syn”>bleak</w>
<w p=”adjective” r=”syn”>sinister</w>
<w p=”adjective” r=”sim”>dark</w>
<w p=”adjective” r=”sim”>angry</w>
<w p=”noun” r=”syn”>blackness</w>
<w p=”noun” r=”syn”>inkiness</w>
<w p=”verb” r=”syn”>blacken</w>
<w p=”verb” r=”syn”>melanize</w>
</words>
…which is easiest handled with preg_match_all :
-
function getsynonyms($keyword) {
-
$pick = array();
-
$apikey = 'get your own key';
-
$xml=file_get_contents('http://words.bighugelabs.com/api/2/'.$apikey.'/'.$keyword.'/xml');
-
-
if(!$xml) return $pick; //return empty array
-
-
preg_match_all('/<w p="adjective" r="syn">(.*?)< \/w>/', $xml, $adj_syns);
-
//preg_match_all('/</w><w p="adjective" r="sim">(.*?)< \/w>/', $xml, $adj_sims);
-
//preg_match_all('/</w><w p="noun" r="syn">(.*?)< \/w>/', $xml, $noun_syns);
-
//preg_match_all('/</w><w p="verb" r="syn">(.*?)< \/w>/', $xml, $verb_syns);
-
-
foreach($adj_syns[0] as $adj_syn) $pick[]=$adj_syn;
-
//same for verb/noun synonyms, I just want adjectives
-
-
return $pick;
-
}
-
</w>
practically applying it,
I take a slab of stale old content and…
- strip tags
- do a regular match on all alphanumeric sequences dropping other stuff
- trim the resulting array elements
- (merge all blog tags, categories, and a list of common words)
- excluding common terms from the array with text elements
- excluding words smaller than N characters
- set a percentage words to be synonimized
- attempt to retrieve synonyms for remaining terms
- replace these words in the original text, keep count
- when I reach the target replacement percentage, abort
- return (hopefully) a revived text
-
function synonymize($origtext) {
-
-
//make a copy of the original text to dissect
-
$content=$origtext;
-
//content = $this->body;
-
-
$perc=3; //target percentage changed terms
-
$minlength=4; //minimum length candidates
-
$maxrequests=80; //max use of api-requests
-
-
-
//dump tags
-
$content = strip_tags($content);
-
-
//dump non-alphanumeric string characters
-
$content = preg_replace('/[^A-Za-z0-9\-]/', ' ', $content);
-
-
//explode on blank space
-
$wrds = explode(' ', strtolower($content));
-
-
//trim off blank spaces just in case
-
for($w=0;$w<count ($wrds);$w++) $words[] = trim($wrds[$w]);
-
-
//this should be all words
-
$wordcount = count($words);
-
-
//how many words do I want changed ?
-
$toswitch = round($wordcount*$perc/100);
-
-
//only use uniques
-
$words_unique=array_unique($words);
-
-
//sort, start with words at the end of the text
-
sort($words_unique);
-
-
//merge common with tags, categories, linked_tags
-
$common = array("never", "about", "price");
-
//note : setting the minlength to 4 excludes lots of common terms
-
-
for($i=0;$i<count($words_unique);$i++) {
-
//if in common array, not selectable for synonymizing
-
if(in_array($words_unique[$i], $common)) {} else {
-
//only terms bigger than minlength
-
if(strlen($words_unique[$i])>$minlength) {
-
//words_select contains candidates for synonyms
-
$words_select[] = trim($words_unique[$i]);
-
}
-
}
-
}
-
-
//terms that can be changed
-
$max = count($words_select);
-
-
//no more requests than max
-
if($max>$maxrequests) $max=$maxrequests;
-
-
for($i=0;$i< $max;$i++) {
-
//get synonyms, give server some time
-
usleep(100000);
-
//retrieve synonyms etc.
-
$these_words = getsynonyms($words_select[$i]);
-
$jmax=count($these_words);
-
if($jmax<1) {
-
//no results
-
} else {
-
$count=0;
-
$j=0;
-
//the replacements are done in the original text
-
$origtext= preg_replace('/'.$words_select[$i].'/i', $these_words[$j], $origtext, -1, $count);
-
$total_switched+=$count;
-
-
} //have we reached the percentage ?
-
if($total_switched>=$toswitch) break;
-
}
-
//okay!
-
return $origtext;
-
}
-
-
function getsynonyms($keyword) {
-
$pick=array ();
-
$apikey = 'get your own key at bighugelabs.com';
-
$xml=@file_get_contents('http://words.bighugelabs.com/api/2/'.$apikey.'/'.urlencode($keyword).'/xml');
-
if(!$xml) return $pick;
-
preg_match_all('/<w p="adjective" r="syn">(.*?)< \/w>/', $xml, $adj_syns);
-
foreach($adj_syns[0] as $adj_syn) $pick[]=$adj_syn;
-
return $pick;
-
}
-
</w></count>
Nothing fancy, a straightforward search-replace routine. A 1200 word text has about 150 candidates and for 3% synonyms I need to replace 36 words, it can do that. If I were to use it for real I would build a table with non-returning terms, and store often used terms, that would speed up the synonimizing, allow the use of preferences and take a load of the api use.






