Someone asked about the ‘pagerank spider’, I put the code online as is, it isn’t finished and if I wanted to finish it I would make a few changes.
the main remaining issues are
- 1 memory usage
- 2 how to handle the www.-prefix
- 3 indexed pages at google
- 4 http codes
1 a big class uses a lot of memory, a mysql backed version has an extra dependency, takes longer to develop and is slower. I needed a fast spider for a quick feedback on a small site.
Check out phpDig, they have a mature open-source(?) spider with a mysql backend, and a usergroup and forum.
2 google have a section where you can choose to have all domain pages indexed represented as either juust.org or www.juust.org. It hints on that having an influence on page ranking but no actual straight forward ‘rule’. I have no idea what the actual impact is.
3 google index and cache pages when spidering other sites that link to yours. If the page the link points to was valid at the time, the page it links to is indexed and cached. Especially with files you dumped, or query-result pages, search pages, you cannot remove the cached page but it is counted to your site.
Putting search pages on ‘noindex’ is smart, especially if you use one of these funky search box gadgets in your template that can list any result, if someone queries your site for (nasty+term) and puts the query as link to your search page, once the link is followed a page from your site loaded with (nasty+term) is indexed and you cannot erase it from the cache, so then you have a problem. Put the file on robots=”noindex”, and try and confine the search to your own domain, or use a profanity filter.
4 http-codes, I checked them out for a link-validator routine two weeks ago, I might be adding that mysql backend after all, and make a more sturdy version, but not for the next few weeks.
Some background info
searchtools.com /robots /robot-checklist