PDF is a popular document format used by organizations and individuals to store variety of information. Sometimes you may need to extract a specific piece of graphics or table from PDF to be able to use it elsewhere. In such cases, python is a handy language. It provides numerous libraries & packages to extract data from PDF documents. In this article, we will learn how to extract tables from PDF using python.
How to Extract Tables from PDF in Python
We will look at how to extract tables from PDF using tabula and camelot libraries in python. Let us say your document /home/ubuntu/data.pdf contains the following table.
User_ID | Name | Occupation |
1 | David | Product Manage |
2 | Leo | IT Administrator |
3 | John | Lawyer |
We will look at how to extract this table using each of the above mentioned libraries.
1. Using tabula-py
tabula-py is a simple python wrapper of java library tabula-java that allows you to easily read tables in PDF. You can install tabula-py with the following command. It requires Java to be present on your system but pip will automatically download & install the required dependencies.
$ pip install tabula-py $ pip install tabulate
In this library, we will use the following two functions to extract table from PDF.
read_pdf(): reads data from table in PDF file tabulate(): arranges data in a table format
Basically, we will first read the tabular data using read_pdf() function and then use tabulate() function to write it in a table format. Here is the code snippet to read table from PDF document and print it in console.
from tabula import read_pdf from tabulate import tabulate # reads table from pdf file df = read_pdf("/home/ubuntu/data.pdf",pages="all") # prints PDF file print(tabulate(df))
When you run the above code in terminal, it will print the above mentioned table in the terminal. Here is more information about tabula library.
2. Using Camelot-py
Camelot is another python library that allows you to easily read tables from PDF file. You can install them using the following command.
$ pip install camelot-py
In the above library, we will use the following methods to read tables and print them.
read_pdf(): reads data from tables of pdf file tables[index].df: points towards the desired table of a given index
In the above commands, read_pdf will read the specified pages of pdf document and store all tables in an array of tables. You can refer to the first table that appears in the PDF as tables[0], the second table as tables[1], and so on.
Here is the simple code snippet to read and print table from given PDF document.
import camelot # extract all the tables in the PDF file abc = camelot.read_pdf("/home/ubuntu/data.pdf") # print the first table as Pandas DataFrame print(abc[0].df)
When you run the above code in terminal, it will print the above mentioned table in the terminal.
It is important to note, that once you have read tables from PDF documents using read_pdf() function, you can easily work with using indexes, and access their specific rows, columns and cell values, as per your requirement.
Here is the official documentation of camelot python.
In most cases, these scripts are part of larger applications or websites. So you can customize the above commands as per your requirement.
In this article, we have learnt how to read and display tables from PDF documents using two libraries – tabula and camelot.
Also read:
Shell Script to Backup MongoDB Database
How to Terminate Python Subprocess
How to Convert Epub to PDF in Linux
How to Convert Docx to PDF in Linux
How to Remove Yum Repository in Linux
Related posts:
How to Select Multiple Columns in Pandas Dataframe
Plot Graph from CSV Data Using Python Matplotlib
How to Use Sleep Function in Python
How to Randomly Select Item from List in Python
How to Write List to File in Python
How to Print Without Newline or Space in Python
How to Split Python List into N Sublists
How to Convert Bytes to String in Python
Sreeram has more than 10 years of experience in web development, Python, Linux, SQL and database programming.