juust ~ php oddities

Unordered list of one element
  • rss
  • begin
  • about
    • vcard
    • WTF is BroJesus
  • php scripts
    • flickr wp widget
    • google multi key serp tool, php script
    • gwt plugin
  • php classes
    • php pagerank class
    • fibonacci class
    • robots.txt parser php class
  • serp
    • serp dashboard wordpress plugin
  • services

Spidering

juust | 28/08/2008

Someone asked about the ‘pagerank spider’, I put the code online as is, it isn’t finished and if I wanted to finish it I would make a few changes.

the main remaining issues are

  • 1 memory usage
  • 2 how to handle the www.-prefix
  • 3 indexed pages at google
  • 4 http codes

1 a big class uses a lot of memory, a mysql backed version has an extra dependency, takes longer to develop and is slower. I needed a fast spider for a quick feedback on a small site.

Check out phpDig, they have a mature open-source(?) spider with a mysql backend, and a usergroup and forum.

2 google have a section where you can choose to have all domain pages indexed represented as either juust.org or www.juust.org. It hints on that having an influence on page ranking but no actual straight forward ‘rule’. I have no idea what the actual impact is.

3 google index and cache pages when spidering other sites that link to yours. If the page the link points to was valid at the time, the page it links to is indexed and cached. Especially with files you dumped, or query-result pages, search pages, you cannot remove the cached page but it is counted to your site.

Putting search pages on ‘noindex’ is smart, especially if you use one of these funky search box gadgets in your template that can list any result, if someone queries your site for (nasty+term) and puts the query as link to your search page, once the link is followed a page from your site loaded with (nasty+term) is indexed and you cannot erase it from the cache, so then you have a problem. Put the file on robots=”noindex”, and try and confine the search to your own domain, or use a profanity filter.

4 http-codes, I checked them out for a link-validator routine two weeks ago, I might be adding that mysql backend after all, and make a more sturdy version, but not for the next few weeks.

———
Some background info
searchtools.com /robots /robot-checklist

phpDig

Categories
pagerank, php
Tags
pagerank, php
Comments rss
Comments rss
Trackback
Trackback

« phpLD mod : context sensitive RSS blogger auto-poster »

Leave a Reply

Click here to cancel reply.

Recent Posts

  • geert wilders
  • gone till september
  • socialize me
  • Pagerank sculpting session
  • wish you were here

click me!
rss
Comments rss
Blog Directory
Web Developement Blogs - BlogCatalog Blog Directory
Listed in LS Blogs the Blog Directory and Blog Search Engine
Blog Flux Directory
joopita.com free web directory and search engine
design by jide
sitemap
22258 confirmed spam kills