MENU

A PDF Harvester in 25 Lines of Python

This article was first published on The Pleasure of Finding Things Out: A blog by James Triveri , and kindly contributed to python-bloggers. (You can report issue about the content on this page here)
Want to share your content on python-bloggers? click here.

The goal of this article is to develop a utility that handles the following:

  1. Retrieve HTML from a webpage.
  2. Parse the HTML and extract all references to embedded PDF links.
  3. For each PDF link, download the document and save it locally.

Plenty of 3rd-party libraries can query and retrieve a webpage’s links. However, the purpose of this post is to highlight the fact that by combining elements of the Python Standard Library with the Requests package, we can roll our own, and learn something while we’re at it.

Step I: Acquire HTML

This is straightforward using requests. Let’s query the Singular Value Decomposition page on Wikipedia:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
import requests
url = "https://en.wikipedia.org/wiki/Singular_value_decomposition"
# instruct requests object to return HTML as plain text.
html = requests.get(url).text
html[:50]
import requests url = "https://en.wikipedia.org/wiki/Singular_value_decomposition" # instruct requests object to return HTML as plain text. html = requests.get(url).text html[:50]
import requests

url = "https://en.wikipedia.org/wiki/Singular_value_decomposition"

# instruct requests object to return HTML as plain text.
html = requests.get(url).text

html[:50]
Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
'<!DOCTYPE html>\n<html class="client-nojs vector-fe'
'<!DOCTYPE html>\n<html class="client-nojs vector-fe'
'<!DOCTYPE html>\n<html class="client-nojs vector-fe'

The HTML has been obtained. Next we’ll identify and extract references to all embedded PDF links.

Step II: Extract PDF URLs from HTML

A cursory review of the HTML from webpages with embedded PDF links revealed the following:

  • Valid PDF URLs will in almost always be embedded within an
    href
    href tag.
  • Valid PDF URLs will in all cases be preceded by
    http
    http or
    https
    https.
  • Valid PDF URLs will in all cases be enclosed by a trailing
    >
    >.
  • Valid PDF URLs cannot contain whitespace.

After some trial and error, the following regular expression was found to have acceptable performance for our test cases:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
"(?=href=).*(https?://\S+.pdf).*?>"
"(?=href=).*(https?://\S+.pdf).*?>"
"(?=href=).*(https?://\S+.pdf).*?>"

An excellent site to practice building and testing regular expressions is Pythex . The app allows you to construct regular expressions and determine how they match against the target text. I find myself using it on a regular basis.

Here is the logic associated with steps I and II combined:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
import re
import requests
url = "https://en.wikipedia.org/wiki/Singular_value_decomposition"
# instruct requests object to return HTML as plain text.
html = requests.get(url).text
# Search html and compile PDF URLs in a list.
pdf_links = re.findall(r"(?=href=).*(https?://\S+.pdf).*?>", html)
for link in pdf_links:
print(link)
import re import requests url = "https://en.wikipedia.org/wiki/Singular_value_decomposition" # instruct requests object to return HTML as plain text. html = requests.get(url).text # Search html and compile PDF URLs in a list. pdf_links = re.findall(r"(?=href=).*(https?://\S+.pdf).*?>", html) for link in pdf_links: print(link)
import re
import requests

url = "https://en.wikipedia.org/wiki/Singular_value_decomposition"

# instruct requests object to return HTML as plain text.
html = requests.get(url).text

# Search html and compile PDF URLs in a list.
pdf_links = re.findall(r"(?=href=).*(https?://\S+.pdf).*?>", html)

for link in pdf_links:
    print(link)
Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
http://www.wou.edu/~beavers/Talks/Willamette1106.pdf
http://www.alterlab.org/research/highlights/pone.0078913_Highlight.pdf
http://math.mit.edu/~edelman/publications/distribution_of_a_scaled.pdf
http://files.grouplens.org/papers/webKDD00.pdf
https://stanford.edu/~rezab/papers/dimsum.pdf
http://faculty.missouri.edu/uhlmannj/UC-SIMAX-Final.pdf
http://www.wou.edu/~beavers/Talks/Willamette1106.pdf http://www.alterlab.org/research/highlights/pone.0078913_Highlight.pdf http://math.mit.edu/~edelman/publications/distribution_of_a_scaled.pdf http://files.grouplens.org/papers/webKDD00.pdf https://stanford.edu/~rezab/papers/dimsum.pdf http://faculty.missouri.edu/uhlmannj/UC-SIMAX-Final.pdf
http://www.wou.edu/~beavers/Talks/Willamette1106.pdf
http://www.alterlab.org/research/highlights/pone.0078913_Highlight.pdf
http://math.mit.edu/~edelman/publications/distribution_of_a_scaled.pdf
http://files.grouplens.org/papers/webKDD00.pdf
https://stanford.edu/~rezab/papers/dimsum.pdf
http://faculty.missouri.edu/uhlmannj/UC-SIMAX-Final.pdf

Note that the regular expression is prepended with an

r
r when passed to
re.findall
re.findall. This instructs Python to interpret what follows as a raw string and to ignore escape sequences.

re.findall
re.findall returns a list of matches extracted from the source text. In our case, it returns a list of URLs referencing the PDF documents found on the page.

For the last step we need to retrieve the documents associated with our collection of links and write them to file locally. We introduce another module from the Python Standard Library,

os.path
os.path, which facilitates the partitioning of absolute filepaths into components in order to retain filenames when saving documents to file.

For example, consider the following url:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
https://stanford.edu/~rezab/papers/dimsum.pdf
https://stanford.edu/~rezab/papers/dimsum.pdf
https://stanford.edu/~rezab/papers/dimsum.pdf

To capture dimsum.pdf, we pass the absolute URL to

os.path.split
os.path.split, which returns a tuple of everything preceding the filename as the first element, along with the filename and extension as the second element:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
import os
url = "https://stanford.edu/~rezab/papers/dimsum.pdf"
os.path.split(url)
import os url = "https://stanford.edu/~rezab/papers/dimsum.pdf" os.path.split(url)
import os

url = "https://stanford.edu/~rezab/papers/dimsum.pdf"
os.path.split(url)
Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
('https://stanford.edu/~rezab/papers', 'dimsum.pdf')
('https://stanford.edu/~rezab/papers', 'dimsum.pdf')
('https://stanford.edu/~rezab/papers', 'dimsum.pdf')

This will be used to preserve the filename of the documents we save locally.

Step III: Write PDFs to File

This step differs from the initial HTML retrieval in that we need to request the content as bytes, not text. By calling

requests.get(url).content
requests.get(url).content, we’re accessing the raw bytes that comprise the PDF, then writing those bytes to file. Here’s the logic for the third and final step:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
import os
import re
import requests
url = "https://en.wikipedia.org/wiki/Singular_value_decomposition"
html = requests.get(url).text
pdf_links = re.findall(r"(?=href=).*(https?://\S+.pdf).*?>", html)
# Request PDF content and write to file for all entries.
for pdf in pdf_links:
# Get filename from url for naming file locally.
pdf_name = os.path.split(pdf)[1].strip()
try:
r = requests.get(pdf).content
with open(pdf_name, "wb") as f:
f.write(r)
except:
print(f"Unable to download {pdf_name}.")
else:
print(f"Saved {pdf_name}.")
import os import re import requests url = "https://en.wikipedia.org/wiki/Singular_value_decomposition" html = requests.get(url).text pdf_links = re.findall(r"(?=href=).*(https?://\S+.pdf).*?>", html) # Request PDF content and write to file for all entries. for pdf in pdf_links: # Get filename from url for naming file locally. pdf_name = os.path.split(pdf)[1].strip() try: r = requests.get(pdf).content with open(pdf_name, "wb") as f: f.write(r) except: print(f"Unable to download {pdf_name}.") else: print(f"Saved {pdf_name}.")
import os
import re
import requests

url = "https://en.wikipedia.org/wiki/Singular_value_decomposition"
html = requests.get(url).text
pdf_links = re.findall(r"(?=href=).*(https?://\S+.pdf).*?>", html)


# Request PDF content and write to file for all entries.
for pdf in pdf_links:

    # Get filename from url for naming file locally.
    pdf_name = os.path.split(pdf)[1].strip()
    
    try:
        r = requests.get(pdf).content
        with open(pdf_name, "wb") as f: 
            f.write(r)
    except:
        print(f"Unable to download {pdf_name}.")
    else:
        print(f"Saved {pdf_name}.")
Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
Saved Willamette1106.pdf.
Saved pone.0078913_Highlight.pdf.
Saved distribution_of_a_scaled.pdf.
Saved webKDD00.pdf.
Saved dimsum.pdf.
Unable to download UC-SIMAX-Final.pdf.
Saved Willamette1106.pdf. Saved pone.0078913_Highlight.pdf. Saved distribution_of_a_scaled.pdf. Saved webKDD00.pdf. Saved dimsum.pdf. Unable to download UC-SIMAX-Final.pdf.
Saved Willamette1106.pdf.
Saved pone.0078913_Highlight.pdf.
Saved distribution_of_a_scaled.pdf.
Saved webKDD00.pdf.
Saved dimsum.pdf.
Unable to download UC-SIMAX-Final.pdf.

Notice that we surround

with open(pdfname, "wb")...
with open(pdfname, "wb")... in a try-except block: This handles situations that would prevent our code from downloading a document, such as broken redirects or invalid links.

All-in we end up with 16 lines of code excluding comments. We next present the full implementation of the PDF Harvester after a little reorganization:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
import os.path
import re
import requests
def pdf_harvester(url):
"""
Retrieve URLs html and extract references to PDFs. Download PDFs,
writing to current working directory.
Parameters
----------
url: str
Web address to serach for PDF links.
"""
html = requests.get(url).text
pdf_links = re.findall(r"(?=href=).*(https?://\S+.pdf).*?>", html)
for pdf in pdf_links:
# Get filename from url for naming file locally.
pdf_name = os.path.split(pdf)[1].strip()
try:
r = requests.get(pdf).content
with open(pdf_name, "wb") as f:
f.write(r)
except:
print(f"Unable to download {pdf_name}.")
else:
print(f"Saved {pdf_name}.")
import os.path import re import requests def pdf_harvester(url): """ Retrieve URLs html and extract references to PDFs. Download PDFs, writing to current working directory. Parameters ---------- url: str Web address to serach for PDF links. """ html = requests.get(url).text pdf_links = re.findall(r"(?=href=).*(https?://\S+.pdf).*?>", html) for pdf in pdf_links: # Get filename from url for naming file locally. pdf_name = os.path.split(pdf)[1].strip() try: r = requests.get(pdf).content with open(pdf_name, "wb") as f: f.write(r) except: print(f"Unable to download {pdf_name}.") else: print(f"Saved {pdf_name}.")
import os.path
import re
import requests


def pdf_harvester(url):
    """
    Retrieve URLs html and extract references to PDFs. Download PDFs, 
    writing to current working directory. 

    Parameters
    ----------
    url: str
        Web address to serach for PDF links.
    """
    html = requests.get(url).text
    pdf_links = re.findall(r"(?=href=).*(https?://\S+.pdf).*?>", html)

    for pdf in pdf_links:
        
        # Get filename from url for naming file locally.
        pdf_name = os.path.split(pdf)[1].strip()

        try:
            r = requests.get(pdf).content
            with open(pdf_name, "wb") as f: 
                f.write(r)
        except:
            print(f"Unable to download {pdf_name}.")
        else:
            print(f"Saved {pdf_name}.")
To leave a comment for the author, please follow the link and comment on their blog: The Pleasure of Finding Things Out: A blog by James Triveri .

Want to share your content on python-bloggers? click here.