Convert PDF to docx and docx to PDF using Python

[This article was first published on PyShark, and kindly contributed to python-bloggers]. (You can report issue about the content on this page here)
Want to share your content on python-bloggers? click here.

In this article we will explore how to convert PDF files into Microsoft Word docx format using Python.

Table of contents


Introduction

In one of our tutorials explaining how to work with PDF files in Python, and specifically how to extract tables from PDF files, we focused on PDF files with tables. However, in a lot of instances, we don’t want to only export certain parts of the PDF, rather than convert the whole PDF file to docx to allow for editing. In this article we will see how to easily and efficiently perform this conversion.

To continue following this tutorial we will need the following Python libraries: pdf2docx and docx2pdf.

If you don’t have it installed, please open “Command Prompt” (on Windows) and install it using the following code:

pip install pdf2docx
pip install docx2pdf

Sample files

In order to follow the examples shown below, you will need to have a PDF file to make the conversion to docx format.

The file I will be using for this article is here. And you can also download it from below:

Once you have downloaded the file, place it in the same folder as you Python file. For example, here is my setup:


For the docx to PDF conversion, you will need a sample docx file which you can download below:

Once you have downloaded the file, place it in the same folder as you Python file. For example, here is my setup:


How to convert PDF files to docx format using Python

Using pdf2docx library, we can perform the conversion in a few lines of code. This library is quite extensive and I encourage readers to check out their official documentation to learn more about its capabilities.


Convert all pages from PDF file to docx format using Python

Method 1:

First step is to import the required dependancies:

from pdf2docx import Converter

Second step is to define input and output paths. The input path should be the path to your PDF file, and the output path should be the path to where you would like to write out the .docx file to (in our case it’s just filenames since the code and the files are located in the same directory):

pdf_file = 'sample.pdf'
docx_file = 'sample.docx'

Third step is to convert PDF file to .docx:

cv = Converter(pdf_file)
cv.convert(docx_file)
cv.close()

And you should see the sample.docx created in the same directory.


Method 2:

First step is to import the required dependancies:

from pdf2docx import parse

Second step is to define input and output paths. The input path should be the path to your PDF file, and the output path should be the path to where you would like to write out the .docx file to (in our case it’s just filenames since the code and the files are located in the same directory):

pdf_file = 'sample.pdf'
docx_file = 'sample.docx'

Third step is to convert PDF file to .docx:

parse(pdf_file, docx_file)

And you should see the sample.docx created in the same directory.


Convert a single page from PDF file to docx format using Python

First step is to import the required dependancies:

from pdf2docx import Converter

Second step is to define input and output paths. The input path should be the path to your PDF file, and the output path should be the path to where you would like to write out the .docx file to (in our case it’s just filenames since the code and the files are located in the same directory):

pdf_file = 'sample.pdf'
docx_file = 'sample.docx'

Third step is to define a list with the numbers of pages you would like to be converted. In our case, the PDF file has 2 pages and let’s say we would like to print the first one. We would define a pages_list and assign the index number of the page (first page is index 0).

pages_list = [0]

Fourth step is to convert PDF file specific pages to .docx:

cv = Converter(pdf_file)
cv.convert(docx_file, pages=pages_list)
cv.close()

How to convert docx files to PDF format using Python

Using docx2pdf library, we can perform the conversion in a few lines of code. This library is quite extensive and I encourage readers to check out their official documentation to learn more about its capabilities.

First step is to import the required dependancies:

from docx2pdf import convert

Second step is to define input and output paths. The input path should be the path to your PDF file, and the output path should be the path to where you would like to write out the .docx file to (in our case it’s just filenames since the code and the files are located in the same directory):

docx_file = 'input.docx'
pdf_file = 'output.pdf'

Third step is to convert PDF file to .docx:

convert(docx_file, pdf_file)

And you should see the output.pdf created in the same directory.


Conclusion

In this article we exploredhow to convert PDF files into Microsoft Word docx format and vice versa using Python.

Feel free to leave comments below if you have any questions or have suggestions for some edits and check out more of my Python Programming tutorials.

The post Convert PDF to docx and docx to PDF using Python appeared first on PyShark.

To leave a comment for the author, please follow the link and comment on their blog: PyShark.

Want to share your content on python-bloggers? click here.