extract table from pdf in python

How to Extract Tables from PDF in Python

PDF is a popular document format used by organizations and individuals to store variety of information. Sometimes you may need to extract a specific piece of graphics or table from PDF to be able to use it elsewhere. In such cases, python is a handy language. It provides numerous libraries & packages to extract data from PDF documents. In this article, we will learn how to extract tables from PDF using python.


How to Extract Tables from PDF in Python

We will look at how to extract tables from PDF using tabula and camelot libraries in python. Let us say your document /home/ubuntu/data.pdf contains the following table.

User_IDNameOccupation
1DavidProduct Manage
2LeoIT Administrator
3JohnLawyer

We will look at how to extract this table using each of the above mentioned libraries.


1. Using tabula-py

tabula-py is a simple python wrapper of java library tabula-java that allows you to easily read tables in PDF. You can install tabula-py with the following command. It requires Java to be present on your system but pip will automatically download & install the required dependencies.

$ pip install tabula-py
$ pip install tabulate

In this library, we will use the following two functions to extract table from PDF.

read_pdf(): reads data from table in PDF file

tabulate(): arranges data in a table format

Basically, we will first read the tabular data using read_pdf() function and then use tabulate() function to write it in a table format. Here is the code snippet to read table from PDF document and print it in console.

from tabula import read_pdf
from tabulate import tabulate

# reads table from pdf file
df = read_pdf("/home/ubuntu/data.pdf",pages="all")

# prints PDF file
print(tabulate(df))

When you run the above code in terminal, it will print the above mentioned table in the terminal. Here is more information about tabula library.


2. Using Camelot-py

Camelot is another python library that allows you to easily read tables from PDF file. You can install them using the following command.

$ pip install camelot-py

In the above library, we will use the following methods to read tables and print them.

read_pdf(): reads data from tables of pdf file

tables[index].df: points towards the desired table of a given index

In the above commands, read_pdf will read the specified pages of pdf document and store all tables in an array of tables. You can refer to the first table that appears in the PDF as tables[0], the second table as tables[1], and so on.

Here is the simple code snippet to read and print table from given PDF document.

import camelot
 
# extract all the tables in the PDF file
abc = camelot.read_pdf("/home/ubuntu/data.pdf")
 
# print the first table as Pandas DataFrame
print(abc[0].df)

When you run the above code in terminal, it will print the above mentioned table in the terminal.

It is important to note, that once you have read tables from PDF documents using read_pdf() function, you can easily work with using indexes, and access their specific rows, columns and cell values, as per your requirement.

Here is the official documentation of camelot python.

In most cases, these scripts are part of larger applications or websites. So you can customize the above commands as per your requirement.

In this article, we have learnt how to read and display tables from PDF documents using two libraries – tabula and camelot.

Also read:

Shell Script to Backup MongoDB Database
How to Terminate Python Subprocess
How to Convert Epub to PDF in Linux
How to Convert Docx to PDF in Linux
How to Remove Yum Repository in Linux

Leave a Reply

Your email address will not be published. Required fields are marked *