How to Extract Data from PDF Files

Package installation

First, we need to install PDFQuery and also install Pandas for some analysis and data presentation.


pip install pdfquery
pip install pandas


Import the libraries

import pandas as pd
import pdfquery


Read and convert the PDF files

We will read the pdf file into our project as an element object and load it. Convert the pdf object into an Extensible Markup Language (XML) file. This file contains the data and the metadata of a given PDF page.


The XML defines a set of rules for encoding PDF in a format that is readable by humans and machines. Looking at the XML file using a text editor, we can see where the data we want to extract is.


#read the PDF
pdf = pdfquery.PDFQuery('customers.pdf')
pdf.load()


#convert the pdf to XML
pdf.tree.write('customers.xml', pretty_print = True)
pdf


Access and extract the Data

We can get the information we are trying to extract inside the LTTextBoxHorizontal tag, and we can see the metadata associated with it.


The values inside the text box, [68.0, 231.57, 101.990, 234.893] in the XML fragment refers to Left, Bottom, Right, Top coordinates of the text box. You can think of this as the boundaries around the data we want to extract.


Let’s access and extract the customer name using the coordinates of the text box.


# access the data using coordinates
customer_name = pdf.pq('LTTextLineHorizontal:in_bbox("68.0, 231.57, 101.990, 234.893")').text()

print(customer_name)

#output: Brandon James