In this tutorial we will explore how to extract text from PDF files using Python.
Table of Contents
Extracting text from PDF files is a very common task that’s often performed when working with reports and research papers.
It’s a tedious task if you do it manually for every file using the available software and online tools.
In this tutorial we will explore how to extract text from PDF files using Python with a few lines of code.
To continue following this tutorial we will need the following Python library: PyPDF2.
If you don’t have them installed, please open “Command Prompt” (on Windows) and install them using the following code:
pip install PyPDF2
Sample PDF file
Here is the PDF files we will use in this tutorial:
This PDF file will reside in the same folder as the main.py with our code.
Here is how the structure of my files looks like:
Extract text from PDF using Python
Now we have everything we need and can easily extract text from image using Python:
from PyPDF2 import PdfFileReader #Define path to PDF file pdf_file_name = 'sample_file.pdf' #Open the file in binary mode for reading with open(pdf_file_name, 'rb') as pdf_file: #Read the PDF file pdf_reader = PdfFileReader(pdf_file) #Get number of pages in the PDF file page_nums = pdf_reader.numPages #Iterate over each page number for page_num in range(page_nums): #Read the given PDF file page page = pdf_reader.getPage(page_num) #Extract text from the given PDF file page text = page.extractText() #Print text print(text)
And you should get:
Sample Page 1 Sample Page 2 Sample Page 3
In this article we explored how to extract text from PDF files using Python and PyPDF2.
Feel free to leave comments below if you have any questions or have suggestions for some edits and check out more of my Python Programming tutorials.