Extract Metadata from PDF using Python

This article was first published on PyShark , and kindly contributed to python-bloggers. (You can report issue about the content on this page here)
Want to share your content on python-bloggers? click here.

In this tutorial we will explore how to extract metadata from PDF using Python.

Table of Contents


Introduction

PDF metadata consists of information about the PDF document, which includes title, author, creation date, and so on. All of these are searchable fields of each PDF document and can be retrieved.

To continue following this tutorial we will need the following Python library: pikepdf.

If you don’t have it installed, please open “Command Prompt” (on Windows) and install it using the following code:

pip install pikepdf

Sample PDF

In order to continue in this tutorial we will need some PDF file to work with.

Let’s reuse one of the PDF we created in one of our previous tutorials:


Extract metadata from PDF using Python

In order to extract metadata from PDF using Python, we will follow the three simple steps:

  1. Open PDF using pikepdf
  2. Extract metadata from PDF
  3. Print out metadata

And now we can create the metadata from PDF using the following code:

import pikepdf

#Open PDF with pikepdf
pdf = pikepdf.Pdf.open('webpage.pdf')

#Extract metadata from PDF
pdf_info = pdf.docinfo

#Print out the metadata
for key, value in pdf_info.items():
    print(key, ':', value)

You should get:

/CreationDate : D:20220624153735-04'00'
/Creator : wkhtmltopdf 0.12.6
/Producer : Qt 4.8.7
/Title : wkhtmltopdf

Conclusion

In this article we explored how to extract metadata from PDF using Python and pikepdf.

Feel free to leave comments below if you have any questions or have suggestions for some edits and check out more of my Python Programming tutorials.

The post Extract Metadata from PDF using Python appeared first on PyShark.

To leave a comment for the author, please follow the link and comment on their blog: PyShark .

Want to share your content on python-bloggers? click here.