php pagerank class

I covered the background of this on another blog, it’s a simple php class script to calculate the pagerank of pages in a site.

2.1.1 Description of PageRank Calculation

Academic citation literature has been applied to the web, largely by counting citations or backlinks to a given page. This gives some approximation of a page’s importance or quality. PageRank extends this idea by not counting links from all pages equally, and by normalizing by the number of links on a page. PageRank is defined as follows:

We assume page A has pages T1…Tn which point to it (i.e., are citations). The parameter d is a damping factor which can be set between 0 and 1. We usually set d to 0.85. There are more details about d in the next section. Also C(A) is defined as the number of links going out of page A. The PageRank of a page A is given as follows:

PR(A) = (1-d) + d (PR(T1)/C(T1) + … + PR(Tn)/C(Tn))

Note that the PageRanks form a probability distribution over web pages, so the sum of all web pages’ PageRanks will be one.

PageRank or PR(A) can be calculated using a simple iterative algorithm, and corresponds to the principal eigenvector of the normalized link matrix of the web. Also, a PageRank for 26 million web pages can be computed in a few hours on a medium size workstation. There are many other details which are beyond the scope of this paper.

from : The Anatomy of a Large-Scale Hypertextual Web Search Engine
(Sergey Brin and Lawrence Page, page hosted at Stanford)

I checked ian rogers site (he seems to be a dancer, cool), he made some real nice examples on pagerank calculation that are excellent to start with.

I checked things like guaranix but that was table based and i wanted one i can link into a spider and run one simple calculation regardless of the amount of pages.

  1. for($ii=0;$ii<40;$ii++) {
  2.  foreach($Web->Pages as $Page) {
  3.   foreach($Page->IncomingLinks as $Link) {
  4.    $ValueInlinkingPage = $Web->Pages($Link->url)->Value;
  5.    $LinksInlinkingPage = $Web->Pages($Link->url)->OutgoinglinksCount;
  6.    $NewValue = $NewValue +  $ValueInlinkingPage / $LinksInlinkingPage;
  7.   }
  8.   $Page->Value = .15 + .85 * $NewValue;
  9.   $NewValue=0;
  10.  }
  11. }

You get per linking page (pagevalue / pagelinks), your fair share. Every page starts out at 1, and then you run the calculation 20 to 40 times, receiving per page all shares and distributing them on. The pagevalues change towards a new balance, in math the ‘eigenvector of the normalised matrix’ (duh?). I am too stupid to do it with matrix-math so I use php classes.

This one calculates the rank of site files in a small model, the extended version is fed by a spider. I put it on it’s own page.

This one is for concepts.

pagerank example

  1.  
  2. class Spider {
  3.  
  4.  var $index;
  5.  var $MyFiles = array();
  6.  
  7.  public function MyFiles($code) {
  8.   if(!$this->MyFiles[$code]) {
  9.    $this->MyFiles[$code] = new MyFile($code);
  10.   }
  11.   return $this->MyFiles[$code];
  12.  }
  13. }
  14.  
  15. Class MyFile {
  16.  
  17.  var $url;
  18.  var $Pagerank = 0;
  19.  var $Value=0;
  20.  var $MyLinksIn = array();
  21.  var $MyLinksOut = array();
  22.  var $LinkOutCount = 0;
  23.  public function __construct($index) {
  24.   $this->url = $index;
  25.  }
  26.  
  27.         public function MyLinksIn($code) {
  28.   if(!$this->MyLinksIn[$code]) {
  29.    $this->MyLinksIn[$code] = new MyLinkIn($code, $this->index);
  30.   }
  31.   return $this->MyLinksIn[$code];
  32.  }
  33.  
  34.         public function MyLinksOut($code) {
  35.   if(!$this->MyLinksOut[$code]) {
  36.    $this->MyLinksOut[$code] = new MyLinkOut($code, $this->index);
  37.   }
  38.   return $this->MyLinksOut[$code];
  39.  }
  40.  
  41. }
  42.  
  43. Class MyLinkOut {
  44.  var $url = array();
  45.  var $count;
  46.  var $nofollow;
  47.  var $title;
  48.  var $rel;
  49.  public function __construct($index) {
  50.   $this->url = $index;
  51.  }
  52. }
  53.  
  54. Class MyLinkIn {
  55.  var $url = array();
  56.  var $count;
  57.  var $nofollow;
  58.  var $title;
  59.  var $rel;
  60.  public function __construct($index) {
  61.   $this->url = $index;
  62.  }
  63.  
  64. }
  65.  
  66. //here we go : make a spider, call it 'core'
  67. $core = new Spider;
  68.  
  69. //first the outgoing links, a to b, a to c, 2 links, etcetera
  70.  
  71. $myfl = $core->MyFiles("A");
  72. $myL = $myfl->MyLinksOut("B");
  73. $myL = $myfl->MyLinksOut("C");
  74. $myfl->LinkOutCount=2;
  75.  
  76. $myfl = $core->MyFiles("B");
  77. $myL = $myfl->MyLinksOut("C");
  78. $myfl->LinkOutCount=1;
  79.  
  80. $myfl = $core->MyFiles("C");
  81. $myL = $myfl->MyLinksOut("A");
  82. $myfl->LinkOutCount=1;
  83.  
  84. $myfl = $core->MyFiles("D");
  85. $myL = $myfl->MyLinksOut("A");
  86. $myfl->LinkOutCount=1;
  87.  
  88.  
  89. //then the incoming links, a collection that holds the page-ID's connected to the page.
  90. //later on, i query per page the inlinking pages for value and linkcount
  91. //and take my fair share
  92.  
  93. $myfl = $core->MyFiles("A");
  94. $myL = $myfl->MyLinksIn("C");
  95.  
  96. $myfl = $core->MyFiles("B");
  97. $myL = $myfl->MyLinksIn("A");
  98.  
  99. $myfl = $core->MyFiles("C");
  100. $myL = $myfl->MyLinksIn("A");
  101. $myL = $myfl->MyLinksIn("B");
  102. $myL = $myfl->MyLinksIn("D");
  103.  
  104. //calculate pageranks, here I take 40 iterations, but 20 will do as well
  105. for($ii=0;$ii<40;$ii++) {
  106.  
  107.         //take the page collection, and for each page…
  108.  foreach($core->MyFiles as $Fl) {
  109.  
  110.                 //take the incoming links collection
  111.  
  112.   foreach($Fl->MyLinksIn as $Li) {
  113.                         //retrieve the inlinking page by url (a b c d) and get value and linkcount
  114.    $In = $core->MyFiles($Li->url)->Value;
  115.    $InT = $core->MyFiles($Li->url)->LinkOutCount;
  116.                         //keep adding the fair shares  
  117.    $val = $val +  $In / $InT;
  118.   }
  119.                 //set the new page-value to the sum-of-shares
  120.   $Fl->Value = .15 + .85 * $val;
  121.                //..and reset the 'val' variable to zero
  122.   $val=0;
  123.  }
  124.  
  125.         //print the result
  126.  foreach($core->MyFiles as $Fl) {
  127.   echo $Fl->Value." ".$Fl->url." ";
  128.  }
  129.  echo "<br />";
  130.  
  131.  }

I got one of these toolbar-query snippet (which only returns the 0-10 result, pagerank itself is calculated in a different scale) so now I can start comparing the spider/calculation vs the standing ranks as assigned by google and tune the model.

3 Comments

  1. I’m sorry, I lost a domain last summer and haven’t had time to completely straighten out the website, by the end of the week I’ll have it fixed.

Leave a Reply

Your email address will not be published. Required fields are marked *