Python is a powerful language that offers tons of features. Sometimes you may have received data in PDF file format but you may want to import it into another software like Excel that works with CSV file formats. In such cases, you will need to convert PDF to CSV. Python makes it easy to transform PDF to CSV files. There are several packages that allow you to easily convert PDF to CSV files in Python. In this article, we will learn how to convert PDF to CSV in Python using tabula-py module.
How to Convert PDF to CSV in Python
Here are the steps to convert PDF to CSV in Python.
1. Install Java
tabula-py requires Java to be installed on your system. So go to this link, download and install Java on your system by following the steps mentioned there.
2. Install tabula-py
Run the following command to install tabula-py.
$ pip install tabula-py
3. Read PDF File
Next read the file using read_pdf() function. It will return a Python Pandas Dataframe. Replace pdf_file_location with the location of PDF file.
4. Generate CSV File
Once you have a dataframe, you can export it to CSV file using to_csv() function.
df.to_csv('Excel File Path')
Here is a code snippet that puts together the above functions. Replace the file paths to PDF and CSV files as per your requirement.
# Import the required Module import tabula # Read a PDF File df = tabula.read_pdf("/home/ubuntu/test.pdf", pages='all') # convert PDF into CSV df.to_csv('/home/ubuntu.test.csv', encoding='utf-8') print(df)
In this article, we have learnt how to convert PDF to CSV using python. You can use this code in your application or script as per your requirement.
The key is to properly import your PDF data into Python dataframe using tabula package. Once you have the dataframe ready, you can easily export it to CSV using to_csv() function.