PDF and text files are two common file formats used in organizations. Often we need to convert one file into another. You may even need to this within your application or bulk convert large number of PDF files into Text files. Sometimes you may get a data dump as PDF and may need to convert it into text file in order to be able to import into Excel or other software. For all these use cases, it is advisable to write a python script to automate your PDF to text file conversion. In this article, we will learn how to convert PDF to text in Python.
How to Convert PDF to Text in Python
Here are the steps to convert PDF file to Text file in Python.
1. Create or Find PDF file
If you already have a PDF file with you, then you can skip to the next step. Else open a word document, and type some text in it. Open File menu, click Print and click Save. Type your file’s name and save as PDF file, say, 1.pdf.
2. Install PyPDF2
Next, you need to install PyPDF2, a pure python pdf library that allows you to merge, split, crop and transform PDF files. You can also use it to add data, set and view passwords for PDFs. Here is the command to install this package.
$ pip install PyPDF2
You can use the above command in Windows also.
3. Create Python Script
Create an empty python script pdf_to_txt.py.
$ vi pdf_to_txt.py
Add the following code to your python file.
import PyPDF2 pdffileobj=open('1.pdf','rb') pdfreader=PyPDF2.PdfFileReader(pdffileobj) x=pdfreader.numPages pageobj=pdfreader.getPage(x+1) text=pageobj.extractText() file1=open(r"/home/ubuntu/1.txt","a") file1.writelines(text)
Save and close the file.
Let us look at the above code in detail. First, we import PyPDF2 package. Then we use open() function to read the PDF file into a file object pdffileobj. Next, we use PyPDF2.PdfFileReader() function to create a reader for the file object. Then we store the number of pages in our file in a variable x. Next, we create a variable that will select the number of pages (x+1). We use x+1 because python indexation starts with 0. Then we call extractText() function to extract text data from PDF file. Lastly, we open the text file using open() function and call writelines() function to write data to text file.
Make the file executable with the following command.
$ chmod +x pdf_to_txt.py
Run the file with the following command.
$ python pdf_to_txt.py
Please note, you can also use the above code in Windows. You just need to create the file in Windows using notepad or some other text editor.
You can also customize the above script to input a list of pdf files, loop through this list and convert each PDF to text file one by one. This will help you bulk convert PDF to TXT files. Here is a sample code for the same. We use with_suffix() function to change the suffix of file path from pdf to txt file.
import os, PyPDF2 for file in os.listdir("/mydir"): if file.endswith(".pdf"): fpath=os.path.join("/mydir", file) pdffileobj=open(fpath,'rb') pdfreader=PyPDF2.PdfFileReader(pdffileobj) x=pdfreader.numPages pageobj=pdfreader.getPage(x+1) text=pageobj.extractText() file1=open(fpath.with_suffix(".txt"),"a") file1.writelines(text)