synonymizer with api

If you want to put some old content on the net and have it indexed as fresh unique content, this works wonders for seo-friendly backlinks : the automated synonymizer. I want one that makes my content unique without having to type one character.

Lucky for me, mister John Watson’s synonym database comes with a free 10.000 request a day API and boy is it sweet!

API Requests are straightforward :
http://words.bighugelabs.com/api/2/[apikey]/[keyword]/xml

A number of return formats are supported but xml is easiest, either for parsing with simplexml or regular pattern matching.

It returns on request :
black (slightly shortened)
an xml file like :
<words>
<w p=”adjective” r=”syn”>bleak</w>
<w p=”adjective” r=”syn”>sinister</w>
<w p=”adjective” r=”sim”>dark</w>
<w p=”adjective” r=”sim”>angry</w>
<w p=”noun” r=”syn”>blackness</w>
<w p=”noun” r=”syn”>inkiness</w>
<w p=”verb” r=”syn”>blacken</w>
<w p=”verb” r=”syn”>melanize</w>
</words>

…which is easiest handled with preg_match_all :

  1. function getsynonyms($keyword) {
  2.         $pick = array();
  3.  $apikey = 'get your own key';
  4.  $xml=file_get_contents('http://words.bighugelabs.com/api/2/'.$apikey.'/'.$keyword.'/xml');
  5.  
  6.  if(!$xml) return $pick; //return empty array
  7.  
  8.  preg_match_all('/<w p="adjective" r="syn">(.*?)< \/w>/', $xml, $adj_syns);
  9.  //preg_match_all('/</w><w p="adjective" r="sim">(.*?)< \/w>/', $xml, $adj_sims);
  10.  //preg_match_all('/</w><w p="noun" r="syn">(.*?)< \/w>/', $xml, $noun_syns);
  11.  //preg_match_all('/</w><w p="verb" r="syn">(.*?)< \/w>/', $xml, $verb_syns);
  12.  
  13.  foreach($adj_syns[0] as $adj_syn) $pick[]=$adj_syn;
  14.         //same for verb/noun synonyms, I just want adjectives
  15.  
  16.  return $pick;
  17. }
  18. </w>

practically applying it,
I take a slab of stale old content and…

  • strip tags
  • do a regular match on all alphanumeric sequences dropping other stuff
  • trim the resulting array elements
  • (merge all blog tags, categories, and a list of common words)
  • excluding common terms from the array with text elements
  • excluding words smaller than N characters
  • set a percentage words to be synonimized
  • attempt to retrieve synonyms for remaining terms
  • replace these words in the original text, keep count
  • when I reach the target replacement percentage, abort
  • return (hopefully) a revived text
  1. function synonymize($origtext) {
  2.  
  3. //make a copy of the original text to dissect
  4.  $content=$origtext;
  5.  //content = $this->body;
  6.  
  7.  $perc=3;   //target percentage changed terms
  8.  $minlength=4;  //minimum length candidates
  9.  $maxrequests=80; //max use of api-requests
  10.  
  11.  
  12.  //dump tags
  13.  $content =  strip_tags($content);
  14.  
  15.  //dump non-alphanumeric string characters
  16.  $content = preg_replace('/[^A-Za-z0-9\-]/', ' ', $content);
  17.  
  18.  //explode on blank space
  19.  $wrds = explode(' ', strtolower($content));
  20.  
  21.  //trim off blank spaces just in case
  22.  for($w=0;$w<count ($wrds);$w++) $words[] = trim($wrds[$w]);
  23.  
  24.  //this should be all words
  25.  $wordcount = count($words);
  26.  
  27.  //how many words do I want changed ?
  28.  $toswitch = round($wordcount*$perc/100);
  29.  
  30.  //only use uniques
  31.  $words_unique=array_unique($words);
  32.  
  33.  //sort, start with words at the end of the text
  34.  sort($words_unique);
  35.  
  36.  //merge common with tags, categories, linked_tags
  37.  $common = array("never", "about", "price");
  38. //note : setting the minlength to 4 excludes lots of common terms
  39.    
  40.  for($i=0;$i<count($words_unique);$i++) {
  41.  //if in common array, not selectable for synonymizing
  42.   if(in_array($words_unique[$i], $common)) {} else {
  43.    //only terms bigger than minlength
  44.    if(strlen($words_unique[$i])>$minlength) {
  45.    //words_select contains candidates for synonyms
  46.     $words_select[] = trim($words_unique[$i]);
  47.    }
  48.   }
  49.  }
  50.  
  51.  //terms that can be changed
  52.  $max = count($words_select);
  53.  
  54.  //no more requests than max
  55.  if($max>$maxrequests) $max=$maxrequests;
  56.  
  57.  for($i=0;$i< $max;$i++) {
  58.  //get synonyms, give server some time
  59.   usleep(100000);
  60.   //retrieve synonyms etc.
  61.   $these_words = getsynonyms($words_select[$i]);
  62.   $jmax=count($these_words);
  63.   if($jmax&lt;1) {
  64.   //no results
  65.   } else {
  66. $count=0;
  67.    $j=0;
  68. //the replacements are done in the original text
  69.    $origtext= preg_replace('/'.$words_select[$i].'/i', $these_words[$j], $origtext, -1, $count);
  70.    $total_switched+=$count;
  71.  
  72.   } //have we reached the percentage ?
  73.   if($total_switched>=$toswitch) break;
  74.  }
  75.  //okay!
  76.  return $origtext;
  77. }
  78.  
  79. function getsynonyms($keyword) {
  80.  $pick=array ();
  81.  $apikey = 'get your own key at bighugelabs.com';
  82.  $xml=@file_get_contents('http://words.bighugelabs.com/api/2/'.$apikey.'/'.urlencode($keyword).'/xml');
  83.  if(!$xml) return $pick;
  84.  preg_match_all('/<w p="adjective" r="syn">(.*?)< \/w>/', $xml, $adj_syns);
  85.  foreach($adj_syns[0] as $adj_syn) $pick[]=$adj_syn;
  86.  return $pick;
  87. }
  88. </w></count>

Nothing fancy, a straightforward search-replace routine. A 1200 word text has about 150 candidates and for 3% synonyms I need to replace 36 words, it can do that. If I were to use it for real I would build a table with non-returning terms, and store often used terms, that would speed up the synonimizing, allow the use of preferences and take a load of the api use.

Posted in optimization, seo tips and tricks and tagged , .

One Comment

  1. unique content being key whenever re-using content, i’ve seen that to be sure that you’ll always pass any human ‘spot’ check that might occur, take a part of any article out and/or reformat the paragraph structure. By that I mean spacing/line break or any other formatting tags you may have used.

    I’ll do that once every 5-6 posts typically with pretty good results.

Leave a Reply

Your email address will not be published. Required fields are marked *