remove accents diatrics in python

Remove Accents/Diatrics from String in Python

Accent marks and Diatrics are used to correctly communicate the pronunciation to be used for words, according to their accents. This is typically seen in non-english languages. But sometimes you may need to remove accents/diatrics from string in Python if your website or app does not support them. In this article, we will learn how to do this.

Remove Accents/Diatrics from String in Python

Let us say you have the following accentuated string.

string = u'Málaga'

As you can see it contains á which is an accentuated character. In many cases, storing such characters are not supported by backend databases so you may need to remove them, or replace them with equivalent characters from your application’s character set.

There are multiple python libraries available for this purpose.

We will use unicodedata library for this purpose. It is already installed in most Python installations so you do not need to separately install it.

import unicodedata

We define a function remove_accents() that normalize() function available in this library to remove accents from string.

def remove_accents(s):
   return ''.join(c for c in unicodedata.normalize('NFD', s)
                  if unicodedata.category(c) != 'Mn')
remove_accents(string) # Malaga

In the above code, while calling normalize() function, we specify NFD argument that matches accentuated characters with their Unicode equivalents and replaces them. But we exclude Mn Unicode category which stands for Nonspacing Mark.

If you want to further convert them to ASCII characters you can use encode() function on the normalized string as shown below.

import unicodedata

def remove_accents(s):
    nfkd_form = unicodedata.normalize('NFD', s)
    only_ascii = nfkd_form.encode('ASCII', 'ignore')
    return only_ascii
remove_accents(string) # Malaga

In the above codes, you can also use NKFD instead of NFD, in normalize() function, depending on your requirement. Here are the different types of normalization available.

NFC: Normalization Form Canonical Composition
NFD: Normalization Form Canonical Decomposition
NFKC: Normalization Form Compatibility Composition
NFKD: Normalization Form Compatibility Decomposition

Alternatively, you can also use unidecode library for this purpose.

accented_string = u'Málaga'
import unidecode
unaccented_string = unidecode.unidecode(accented_string) 
print unaccented_string #Malaga

In this article, we have learnt several ways to remove accent /diatrics from strings in python. You can use them to easily convert your accentuated string to Unicode or ASCII ones.

Also read:

How to Remove Accents/Diatrics from String in JavaScript
How to Convert Array to JS Object
How to Test if JS Object is Empty
How to Convert RGB to Hex and Hex to RGB in JavaScript
How to Store Emoji in MySQL Database

Leave a Reply

Your email address will not be published. Required fields are marked *