convert pdf to text in python

How to Convert PDF to Text in Python

PDF and text files are two common file formats used in organizations. Often we need to convert one file into another. You may even need to this within your application or bulk convert large number of PDF files into Text files. Sometimes you may get a data dump as PDF and may need to convert it into text file in order to be able to import into Excel or other software. For all these use cases, it is advisable to write a python script to automate your PDF to text file conversion. In this article, we will learn how to convert PDF to text in Python.


How to Convert PDF to Text in Python

Here are the steps to convert PDF file to Text file in Python.

1. Create or Find PDF file

If you already have a PDF file with you, then you can skip to the next step. Else open a word document, and type some text in it. Open File menu, click Print and click Save. Type your file’s name and save as PDF file, say, 1.pdf.

2. Install PyPDF2

Next, you need to install PyPDF2, a pure python pdf library that allows you to merge, split, crop and transform PDF files. You can also use it to add data, set and view passwords for PDFs. Here is the command to install this package.

$ pip install PyPDF2

You can use the above command in Windows also.

3. Create Python Script

Create an empty python script pdf_to_txt.py.

$ vi pdf_to_txt.py

Add the following code to your python file.

import PyPDF2
 

pdffileobj=open('1.pdf','rb')

pdfreader=PyPDF2.PdfFileReader(pdffileobj)
x=pdfreader.numPages

pageobj=pdfreader.getPage(x+1)
 
text=pageobj.extractText()
 
file1=open(r"/home/ubuntu/1.txt","a")
file1.writelines(text)

Save and close the file.

Let us look at the above code in detail. First, we import PyPDF2 package. Then we use open() function to read the PDF file into a file object pdffileobj. Next, we use PyPDF2.PdfFileReader() function to create a reader for the file object. Then we store the number of pages in our file in a variable x. Next, we create a variable that will select the number of pages (x+1). We use x+1 because python indexation starts with 0. Then we call extractText() function to extract text data from PDF file. Lastly, we open the text file using open() function and call writelines() function to write data to text file.

Make the file executable with the following command.

$ chmod +x pdf_to_txt.py

Run the file with the following command.

$ python pdf_to_txt.py

Please note, you can also use the above code in Windows. You just need to create the file in Windows using notepad or some other text editor.

You can also customize the above script to input a list of pdf files, loop through this list and convert each PDF to text file one by one. This will help you bulk convert PDF to TXT files. Here is a sample code for the same. We use with_suffix() function to change the suffix of file path from pdf to txt file.

import os, PyPDF2
for file in os.listdir("/mydir"):
    if file.endswith(".pdf"):
        fpath=os.path.join("/mydir", file)
        pdffileobj=open(fpath,'rb')
        pdfreader=PyPDF2.PdfFileReader(pdffileobj)
        x=pdfreader.numPages
        pageobj=pdfreader.getPage(x+1)
        text=pageobj.extractText()
        file1=open(fpath.with_suffix(".txt"),"a")
        file1.writelines(text)

Also read:

How to Convert Text File to CSV in Python
How to Split List Into Even Chunks
How to Split File into Chunks
How to Show File Without Comments in Linux
How to Enable Screen Sharing in Ubuntu

Leave a Reply

Your email address will not be published. Required fields are marked *