remove html tags from html in python

How to Remove HTML Tags from CSV File in Python

Python is a very useful programming language to write scripts and automate tedious tasks. For example, you may want to remove HTML tags from CSV file. Also, if there are multiple files, this only makes your life even more difficult. Such laborious work is best done using a script or software. You can use python or shell scripts for stuff like this. In this article, we will learn how to HTML tags from CSV file in Python.


How to Remove HTML Tags from CSV File in Python

There are several ways to remove HTML tags from files in Python.


1. Using Regex

You can define a regular expression that matches HTML tags, and use sub() function to substitute all strings matching the regular expression with empty string. Here is a code snippet for this purpose.

import re
CLEANR = re.compile('<.*?>') 

def cleanhtml(raw_html):
  cleantext = re.sub(CLEANR, '', raw_html)
  return cleantext

In the above code, we import re library to work with regular expressions. We define a regular expression for HTML tags as strings between < and > characters, and compile it as CLEANR. We define a function cleanhtml() where we input html string as raw_html. We use re.sub() function to find and replace all strings matching our regular expression with empty string.

Sometimes, there may be HTML entities such as &nbsp; that are not encloses within < and >. To replace such entities also, you can modify your regular expression to the following.

CLEANR = re.compile('<.*?>|&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-f]{1,6});')

You can call the above function while working with files or dataframes. Here is an example to replace HTML tags in a CSV file.

a_file = open("sample.csv", "r")

lines = a_file.readlines()
a_file.close()

new_file = open("sample.csv", "w")
for line in lines:

     line=cleanthml(line)

     new_file.write(line)
new_file.close()

In the above code, we open a file sample.csv using open() function in ‘read’ mode. We further call readlines() to read the file content into a python list. Then we re-open the file in ‘write’ mode using open() function. We loop through the list items one by one and call cleanhtml() function on each line, to remove any HTML tags in it. We write back the cleaned line back to the CSV file, and close the file.

If you don’t want to write back the cleaned HTML content to same file, you can use another file as new_file variable.

Here is the full code for your reference. Replace the filenames as per your requirement.

import re
CLEANR = re.compile('<.*?>') 

def cleanhtml(raw_html):
  cleantext = re.sub(CLEANR, '', raw_html)
  return cleantext

a_file = open("sample.csv", "r")

lines = a_file.readlines()
a_file.close()

new_file = open("sample.csv", "w")
for line in lines:
     line=cleanthml(line)
     new_file.write(line)
new_file.close()


2. Using BeautifulSoup

You can also use BeautifulSoup to replace HTML tags in CSV files. BeautifulSoup comes with a default HTML parser but you are free to use other parsers also. For our example, we will use lxml parser which is more powerful. You can install it using the command ‘pip install lxml‘. Here is a simple code snippet to remove HTML tags from string.

from bs4 import BeautifulSoup
cleantext = BeautifulSoup(raw_html, "lxml").text

In the above code, we simply pass the html string to BeautifulSoup(), along with a parser and call its .text property to get string without HTML tags.

Here is the full code to remove HTML tags from CSV file using BeautifulSoup in Python.

from bs4 import BeautifulSoup

a_file = open("sample.csv", "r")

lines = a_file.readlines()
a_file.close()

new_file = open("sample.csv", "w")
for line in lines:
     line=BeautifulSoup(line, "lxml").text
     new_file.write(line)
new_file.close()

We have learnt a couple of different ways to remove HTML tags from CSV file using Python.

Also read:

How to Convert PDF to CSV in Python
How to Convert PDF to Text in Python
How to Convert Text to CSV File in Python
How to Split List Into Even Chunks
How to Split File in Python

Leave a Reply

Your email address will not be published. Required fields are marked *