How Google (and other SEs) crawl and ranks PDF files

Posted By October 28th, 2009

There is one common thing that link web 2.0 users: the necessity to take part of this new fantastic world contributing to it, inserting web pages, picture, documents, comments and so on. So it’s not rare to see emerging web sites containing tons of new material rather than a forum at the top of the SERPs. And it’s not rare to see different type of documents rather than a standard web page. Documents like a Word file, Power point presentation, PDF etc.

Search engines love text

pdf-files-rankingSo if you write a document with some special formatting, that doesn’t fit well into a web page or contains some graphics that must be preserved, you can publish over the web converting it to a PDF file, and let it accessible to the entire world.

Search engines are smart enough to crawl your web pages and index (normally) all the link and documents contained. Google started to index PDF documents later in 2001 so they are not completely new to this kind of stuff, but recently they enhanced the quality and the user experience introducing the “Quick View” PDFs feature.

The reason why Google developed “Quick View” was due to poor quality of the “View as HTML” feature, originally developed to “translate” a file into a document readable directly into the browser (unless searchers weren’t interested in opening it into different applications after downloading it).

Unfortunately the “View as HTML” feature isn’t perfect and often the layout proposed doesn’t respect the original one. These kind of problems no longer exist thanks to Quick View which has changed its approach to opening PDF files, opening the documents directly into the browser whilst keeping the formatting intact.

But, well formatted or not, a PDF document – as any other web page – should be optimized before being shown in the SERP.

How can I optimize my PDF file for ranking?

Having looked into research papers, that contained possible indicators on how to properly optimize a PDF document, I was unable to find anything really useful. Since I’m aware (are you?) Google is particularly interested in details and quality, I decided to spend some time to create a test case to evaluate many different combinations of the same PDF document to understand which factors really influences the PDF ranking.

The test has been published on a recently registered domain, so I can entirely appreciate the results that I will collect from this test.

To stay updated about this test, or to simply have a look at it, point your web browser to my “PDF ranking and indexing test”.

Tags: , ,

2 Responses to “How Google (and other SEs) crawl and ranks PDF files”

  1. Chris L says:

    Very interesting. Are their any stats/opinions on the likelihood of someone clicking on a pdf result in a search engine results page? Personally, I think I am much less likely to click on a pdf result in the search engine results page. I guess the likelihood of clicking on a pdf reslut depends on what you are searching for and the quality of the other results. Acedemic Paper research is one area where I am presuming people are much more likely to click on a pdf result.

  2. Carl Thomas says:

    Be interesting to know id normal web pages outrank PDFs as they used to take a long time to load before broadband and were a pain if you clicked them by mistake.

Leave a Reply