Extract Text from PDF using Python

This article was first published on PyShark , and kindly contributed to python-bloggers. (You can report issue about the content on this page here)
Want to share your content on python-bloggers? click here.

In this tutorial we will explore how to extract text from PDF files using Python.

Table of Contents


Introduction

Extracting text from PDF files is a very common task that’s often performed when working with reports and research papers.

It’s a tedious task if you do it manually for every file using the available software and online tools.

In this tutorial we will explore how to extract text from PDF files using Python with a few lines of code.

To continue following this tutorial we will need the following Python library: PyPDF2.

If you don’t have them installed, please open “Command Prompt” (on Windows) and install them using the following code:

pip install PyPDF2

Sample PDF file

Here is the PDF files we will use in this tutorial:

This PDF file will reside in the same folder as the main.py with our code.

Here is how the structure of my files looks like:


Extract text from PDF using Python

Now we have everything we need and can easily extract text from image using Python:

from PyPDF2 import PdfFileReader

#Define path to PDF file
pdf_file_name = 'sample_file.pdf'

#Open the file in binary mode for reading
with open(pdf_file_name, 'rb') as pdf_file:
    #Read the PDF file
    pdf_reader = PdfFileReader(pdf_file)
    #Get number of pages in the PDF file
    page_nums = pdf_reader.numPages
    #Iterate over each page number
    for page_num in range(page_nums):
        #Read the given PDF file page
        page = pdf_reader.getPage(page_num)
        #Extract text from the given PDF file page
        text = page.extractText()
        #Print text
        print(text)

And you should get:

Sample Page 1
Sample Page 2
Sample Page 3

Conclusion

In this article we explored how to extract text from PDF files using Python and PyPDF2.

Feel free to leave comments below if you have any questions or have suggestions for some edits and check out more of my Python Programming tutorials.

The post Extract Text from PDF using Python appeared first on PyShark.

To leave a comment for the author, please follow the link and comment on their blog: PyShark .

Want to share your content on python-bloggers? click here.