How to Extract Tables from PDF in Python

PDF is a popular document format used by organizations and individuals to store variety of information. Sometimes you may need to extract a specific piece of graphics or table from PDF to be able to use it elsewhere. In such cases, python is a handy language. It provides numerous libraries & packages to extract data from PDF documents. In this article, we will learn how to extract tables from PDF using python.

How to Extract Tables from PDF in Python

We will look at how to extract tables from PDF using tabula and camelot libraries in python. Let us say your document /home/ubuntu/data.pdf contains the following table.

User_ID	Name	Occupation
1	David	Product Manage
2	Leo	IT Administrator
3	John	Lawyer

We will look at how to extract this table using each of the above mentioned libraries.

1. Using tabula-py

tabula-py is a simple python wrapper of java library tabula-java that allows you to easily read tables in PDF. You can install tabula-py with the following command. It requires Java to be present on your system but pip will automatically download & install the required dependencies.

$ pip install tabula-py
$ pip install tabulate

In this library, we will use the following two functions to extract table from PDF.

read_pdf(): reads data from table in PDF file

tabulate(): arranges data in a table format

Basically, we will first read the tabular data using read_pdf() function and then use tabulate() function to write it in a table format. Here is the code snippet to read table from PDF document and print it in console.

from tabula import read_pdf
from tabulate import tabulate

# reads table from pdf file
df = read_pdf("/home/ubuntu/data.pdf",pages="all")

# prints PDF file
print(tabulate(df))

When you run the above code in terminal, it will print the above mentioned table in the terminal. Here is more information about tabula library.

2. Using Camelot-py

Camelot is another python library that allows you to easily read tables from PDF file. You can install them using the following command.

$ pip install camelot-py

In the above library, we will use the following methods to read tables and print them.

read_pdf(): reads data from tables of pdf file

tables[index].df: points towards the desired table of a given index

In the above commands, read_pdf will read the specified pages of pdf document and store all tables in an array of tables. You can refer to the first table that appears in the PDF as tables[0], the second table as tables[1], and so on.

Here is the simple code snippet to read and print table from given PDF document.

import camelot
 
# extract all the tables in the PDF file
abc = camelot.read_pdf("/home/ubuntu/data.pdf")
 
# print the first table as Pandas DataFrame
print(abc[0].df)

When you run the above code in terminal, it will print the above mentioned table in the terminal.

It is important to note, that once you have read tables from PDF documents using read_pdf() function, you can easily work with using indexes, and access their specific rows, columns and cell values, as per your requirement.

Here is the official documentation of camelot python.

In most cases, these scripts are part of larger applications or websites. So you can customize the above commands as per your requirement.

In this article, we have learnt how to read and display tables from PDF documents using two libraries – tabula and camelot.

Plot Graph from CSV Data Using Python Matplotlib

How to Start Background Process in Python

How to Use Decimal Step Value for Range in Python

How to Find Index of Item in List in Python

How to Extract Numbers from String in Python

How to Schedule Task in Python

How to Read Binary File in Python

How to Check if String Matches Regular Expression

Sreeram Sreenivasan

Sreeram has more than 10 years of experience in web development, Python, Linux, SQL and database programming.

How to Extract Tables from PDF in Python