Is a pdf file an importance sink ? In backrub (the basic pagerank calculation) an importance sink is a page that is linked to but does not link back to the site itself and thus drains the ‘juice’ in the site, much like broken links.
Here’s an excerpt from Eric Enge interviewing Matt Cutts. In the entire interview Eric Enge focusses on linkjuice. When he mentions the PDF, Matt Cutts indicates he does not want to talk about pdf and pagerank. He does encourage us to prefer html versions over pdf’s.
Eric Enge: What about PDF files?
Matt Cutts: We absolutely do process PDF files. I am not going to talk about whether links in PDF files pass PageRank. But, a good way to think about PDFs is that they are kind of like Flash in that they aren’t a file format that’s inherent and native to the web, but they can be very useful. In the same way that we try to find useful content within a Flash file, we try to find the useful content within a PDF file. At the same time, users don’t always like being sent to a PDF. If you can make your content in a Web-Native format, such as pure HTML, that’s often a little more useful to users than just a pure PDF file.
Let’s look at Google Search : if you type inurl:.pdf in Google Search (something Pat Marcello also mentions on his seo news blog article on pdf links) you get a list with pdf-documents, for instance The Not So Short Introduction to LATEX2ε which has PR6 assigned.
It shows PDF files are indexed (as Matt Cutts indicates), but more important, they have a pagerank. If so, the pdf document receives linkjuice. Experience of other seo’s shows GoogleBot does not follow or index links in pdf documents (V7N). If the links in the document are not indexed or followed, it does not pass on the linkjuice, it forms an importance sink and you loose linkjuice.
That makes me wonder how Google handled these PDF’s and how that affects the assignment of importance and indexing of the content.