juust ~ php oddities

Unordered list of one element
  • rss
  • begin
  • about
    • vcard
    • WTF is BroJesus
  • php scripts
    • flickr wp widget
    • google multi key serp tool, php script
    • gwt plugin
  • php classes
    • php pagerank class
    • fibonacci class
    • robots.txt parser php class
  • serp
    • serp dashboard wordpress plugin
  • services

RedHat Seo : scraper auto-blogging

juust | 26/12/2008

Just give us your endpoint and we’ll take it from there, sparky!

I was going to make one of these tools to scrape google and conjur a full blog out of nowhere, as Christmas special, RedHat Seo. The rough sketch has arrived , far from perfect, but it does produce a blog and don’t even look too shabby. I scraped a small batch of posts off of blogs, keeping the links intact and adding a tribute links. I hope they will pardon me for it.

structure

I use three main classes,

BlogMaker the application
Target the blogs you aim for
WPContent the scraped goodies

…and two support classes

SerpResult scraped urls
Custom_RPC a simple rpc-poster

Target blogs have three texts,

file contents maintenance
blog categories category you post under manual
blog tags tags you list on the blog manual
blog urls urls already used for the blog system

routine

The BlogMaker class grabs a result list (up to 1000 urls per phrase) from Google, extracts the urls and stores them in SerpResult, scrapes the urls and extracts the entry divs, stores div-entries in the WPContent class (that has some basic functions to sanitize the text), and uses the BlogTarget-definitions to post it up blogs with xml-rpc.

usage

My highlighter tends to mess up text with div markers in it, copying off the blog may not work,
the full text source (about 500 lines) is overhere. Underneath I’ll list the main program loop :

  1.  
  2. //make main instance
  3. $Blog = new BlogMaker("keyword");
  4.  
  5. //define a target blog, you can define multiple blogs and refer with code
  6. //then add rpc-url, password and user
  7. //and for every target blog three text-files
  8.  
  9. $T=$Blog->AddTarget(
  10.  'blogcode',
  11.  'http://my.blog.com/xmlrpc.php',
  12.  'password',
  13.  'user',
  14.  'keyword.categories.txt',
  15.  'keyword.tags.txt',
  16.  'keyword.urls.txt'
  17.  );
  18.  
  19. //read the tags, cats and url text files stored on the server
  20. //all retrieved urls are tested, if the target blog already has that
  21. //scraped url, it is discarded.
  22. $T->CSV_GetTags();
  23. $T->List_GetCats();
  24. $T->ReadURL();
  25.  
  26. //grab the google result list
  27. //use params (pages, keywords) to specify search
  28. $Blog->GoogleResults();
  29.  
  30. $a=0;
  31. foreach($Blog->Results as $BlogUrl) {
  32.   $a++;
  33.   echo $BlogUrl->url;
  34. //see if the url isnt used yet
  35.  
  36.  if($T->checkURL(trim($BlogUrl->url))!=true) {
  37.    echo '…checking ';
  38.    flush();
  39. //if not used, get the source
  40.    $BlogUrl->scrape();
  41. //check for divs marked "entry", if they arent there, check "post"
  42. //some blogs use other indications for the content
  43. //but entry and post cover 40%
  44.  
  45.    $entries = $BlogUrl->get_entries();
  46.    if(count($entries)<1) {
  47.     echo 'no entries…';
  48.     flush();
  49.     $entries = $BlogUrl->get_posts();
  50.      if(count($entries)<1) {
  51.       echo 'no posts either…';
  52. //if no entry-post div, mark url as done
  53.  
  54.       $T->RegisterURL($BlogUrl->url);
  55.      }
  56.    }
  57.  
  58.    $ct=0;
  59.    foreach($BlogUrl->WpContentPieces as $WpContent) {
  60. //in the get_entries/get_post function the fragments are stored
  61. //as wpcontent
  62.     $ct++;
  63.  
  64.     if($WpContent->judge(2000, 200, 5)) {
  65.      $WpContent->tribute();  //add tribute link
  66.      $T->settags($WpContent->divcontent); //add tags
  67.      $T->postCustomRPC($WpContent->title, $WpContent->divcontent, 1); //1=publish, 0=draft
  68.      $T->RegisterURL($WpContent->url);  //register use of url
  69. usleep(20000000);  //20 seconds break, for sitemapping
  70.     }
  71.    }
  72.   }
  73.  }

notes

  • xml-rpc needs to be activated explicitly on the wordpress dashboard under settings/writing.
  • categories must be present in the blog
  • url file must be writeable by the server (777)

It seems wordpress builds the sitemap as background process, the standard google xml sitemap plugin wil attempt to build in the cache (takes anywhere between 2 and 10 seconds), and apart from building a sitemap the posts also get pinged around. Giving the install 10 to 20 seconds between posts allows for all the hooked in functions to be completed.

period

That’s about all,
consider it gpl, I added some comments in the source but I will not develop this any further. A mysql backed blogfarm tool (euphemistically called ‘publishing tool’) is more interesting, besides, I am off to the wharves to do some painting.

if you use it, send some feedback,
merry christmas dogheads

Comments
1 Comment »
Categories
google, seo, seo tips and tricks, tool, wordpress, xml-rpc
Tags
google, scrape, seo, seo tips and tricks, tool, wordpress, xml-rpc
Comments rss Comments rss
Trackback Trackback

some thoughts on search engine marketing

juust | 17/11/2008

I want a quick way to check the competitors in a niche, and estimate what can I spend on a search engine marketing campaign to make money in the segment, using the search engines (read: Google).

For a quick test I dive into the money segment, got the top 70 search phrases of October off of 7Search and for each get the top 100 results, resulting in a mysql database with 7.000 url’s on 3500 domains.

Then I try three cross-sections to extract the top 25 domains in the segment :

  • (100 – position) / 100 per result
  • (100 – position) / 100 per result, +1 for first 10, +1 for first 6, +1 for first three
  • AOL experience based percentage of search volume

rating by rank

If I rate the sites’ urls based on (100 – position) / 100 per result.

place 1 = 0.99, place 100 = 0, that yields this table :

domain score
www.amazon.com 51.05
www.youtube.com 34.82
en.wikipedia.org 34.04
wordpress.com 30.93
www.streetdirectory.com 29.62
money.cnn.com 24.5
www.wikihow.com 22.84
www.experienced-people.co.uk 22.29
ezinearticles.com 22.26
books.google.com 19.22
www.problogger.net 17.52
entrepreneurs.about.com 16.53
www.43things.com 14.75
moneycentral.msn.com 14.74
answers.yahoo.com 14.41

That makes Amazon top dog for all phrases containing ‘money’. I know the first pages get most traffic, and that first method doesn’t express that, so I change the routine and do my old trick.

rating by rank, bonus for front page

(100 – pos) / 100, and
for place 1-3 3 points
for place 4-6 2 points
for place 7-10 1 points

That gives this table :

domain score
en.wikipedia.org 91.04
www.wikihow.com 67.84
www.amazon.com 65.05
www.youtube.com 56.82
money.cnn.com 50.5
entrepreneurs.about.com 45.53
ezinearticles.com 45.26
www.experienced-people.co.uk 42.29
www.freemoneyfinance.com 31.9
www.streetdirectory.com 31.62
wordpress.com 30.93
www.43things.com 30.75
abcnews.go.com 30.15
moneycentral.msn.com 29.74
moneymakerinfo.blogspot.com 29.55
www.thisismoney.co.uk 27.06
www.moneymakingmommy.com

That still does not show what domains actually get the traffic I need for conversion.

It is easy to score in a low traffic niche and anyone can be a winner on long tails, “alabama ski resort”, but long tails only get you a few hits (and wonder if they are serious?). And these results doesn’t give me a clue what I can spend (or what my competition would be willing to spend) on the actual traffic spots.

So I am going to estimate what traffic every site gets. I need search volumes and percentages for the serp ranks. I grabbed the search volumes per phrase from 7Search. For percentages, Aaron Wall quotes an old AOL source on the average click through rate of search engine result pages per spot on the first page :

Overall Percent of Clicks

Relative Click Volume

  1. 42.13%, 2,075,765 clicks
  2. 11.90%, 586,100 clicks
  3. 8.50%, 418,643 clicks
  4. 6.06%, 298,532 clicks
  5. 4.92%, 242,169 clicks
  6. 4.05%, 199,541 clicks
  7. 3.41%, 168,080 clicks
  8. 3.01%, 148,489 clicks
  9. 2.85%, 140,356 clicks
  10. 2.99%, 147,551 clicks
  1. 3.5x less
  2. 4.9x less
  3. 6.9x less
  4. 8.5x less
  5. 10.4x less
  6. 12.3x less
  7. 14.0x less
  8. 14.8x less
  9. 14.1x less

1st page totals: 89.82%, 4,425,226 clicks
2nd page totals: 10.18%, 501,397 clicks

That was what i was looking for. Given these percentages I can estimate the traffic any spot on the search engine result front page in Google would generate, and that yields a more realistic table of the money segment :

domain traffic result result+(1,2,3)
www.bidvertiser.com 737478 6.65 17.65
money.cnn.com 294316 24.5 50.5
en.wikipedia.org 160370 34.04 91.04
www.wikihow.com 132332 22.84 67.84
www.wealthsuccess.usana.com 123503 9.4 24.4
entrepreneurs.about.com 117069 16.53 45.53
www.moneymakingmommy.com 112507 11.65 26.65
www.freemoneyfinance.com 102122 12.9 31.9
www.experienced-people.co.uk 86498 22.29 42.29
www.forbes.com 84393 7.22 19.22
www.netjobs4all.com 75868 4.61 12.61
makemoneyforbeginners.blogspot.com 75399 6.53 15.53
www.missingmoney.com 65326 3.63 12.63
www.youtube.com 60586 34.82 56.82
technology.timesonline.co.uk 55009 3.53 8.53
moneymakerinfo.blogspot.com 54431 13.55 29.55
www.moneyclaim.gov.uk 50544 1.9 5.9
moneycentral.msn.com 39399 14.74 29.74

That shows which sites actively target and get the traffic in the segment, a site like amazon doesn’t show in the top of the last table.

One way of testing the validity of the estimates is comparing with alexa rank, doing a quick SeoQuake toolbar check :

domain est. search traffic alexa rank
www.wealthsuccess.usana.com 200000 20000
moneymakingmommy 110000 70000
freemoneyfinance 100000 60000
makemoneyforbeginners.blogspot.com 75000 140000
problogger.com 6500 43000

That’s roughly correct based on old AOL percentages (well done, Aaron) and Oktober’s search volumes. It does not work for youtube, msn and others that have an alexa rank based on the entire domain. Being a search engine traffic estimate it doesn’t cover your ‘audience’ (returning visitors on bookmarks and referrals off of other sites (which are included in the alexa ranking) or other segments.

Problogger for example, according to that calculation would get about 6K hits through Google but the Alexa rank indicates it has ~~160K hits a month. That indicates high return visits and direct traffic, a high usability and I think they get traffic from other segments (’blog’ ?).

The use of it

Why did I start doing this ? To estimate a budget for niche penetration.

If I were interested in a niche and have a 1% conversion at $100,- per year per conversion, what would I be able to spend to get Bidvertisers’ traffic in the search engines ? With their traffic and my conversion, 730.000/100 = 7300 conversions per month =y 88.000 a year @100/conversion = 8.8 million dollar gross revenue on search engine traffic conversions.

If my marketing budget is 20% of gross sales, for a year project that amounts to 1.7 million dollar. If I already had 300K hits my budget would be a million.

That (roughly) answers my question.

I have my doubts about the use of a tool like this, it could function as quick way to scan a segment, make a ‘heatmap’ and pinpoint the soft spots that are easy to penetrate and get a foothold, but that would require some stiff programming and a lot of switches, as well as processing a lot more segments. Nice project for the winter.

Comments
1 Comment »
Categories
google, sem, tool
Tags
sem, tool
Comments rss Comments rss
Trackback Trackback

serp tool

juust | 04/11/2008

I was putting together a serp tool with a database and an emailer on a cronjob, mainly because I am lazy, I always get cranky when I have to type in these keywords again, I ain’t a teletubby.

I wanted an automated one with a history that sends me an email every day, so I started putting on together but I got distracted because the gethost server was shut down and my moms Mary Chapel event blog was on it. My old site was also on it, with my crappy blog, but losing that ain’t half as bad as losing yer moms Mary Chapel blog, that is bad karma.

So I got her a domain and put it on one of my accounts, installed a new blog and made sure it ranks number one in google again. Apart from that I was putting together a scraper and still needed some cron jobs for the scraper scripts.

I found a nice free cron job site, standard 5 jobs a day. It’s free and it works flawless, no hassles. Very nice.

Once I had that I remembered I still had to finish that serp thing as well, so I finished the basic routines today. Still needs some testing, and it could use the yahoo and msn serps. I’ll grab them from a wordpress widget and add some language options, after that it should be a fine tool.

I’ll put the scripts up for download in two or three weeks or something.

Comments
2 Comments »
Categories
serp, tool
Tags
serp, tool
Comments rss Comments rss
Trackback Trackback

« Previous Entries

Recent Posts

  • Pagerank sculpting session
  • wish you were here
  • interesting : seo panel
  • availability test
  • Mayday

click me!
rss
Comments rss
Blog Directory
Web Developement Blogs - BlogCatalog Blog Directory
Listed in LS Blogs the Blog Directory and Blog Search Engine
Blog Flux Directory
joopita.com free web directory and search engine
design by jide
sitemap
17238 confirmed spam kills