NLP: A Comprehensive Guide to Text Cleaning and PreProcessing

Gaurav Singh Tanwar
7 min readApr 24, 2022

Treating the Raw Text in the right way

Photo by Rey Seven on Unsplash

Natural Language Processing, in short NLP, is a field in Artificial Intelligence that deals with linguistics and human language. NLP deals with interactions between computers and human languages. It enables and programs computers to understand human languages and analyze them to perform tasks like classification, summarization, Named Entity Recognition, etc.

But before a computer program can analyze human language, a lot of cleaning and pre-processing steps take place. Preprocessing and cleaning of text data have a huge impact on the performance of models. Hence, text cleaning and processing hold huge importance in the life cycle of an NLP project.

In this article, we will go through some text cleaning and processing techniques and will implement them using Python programming language.

Removing HTML Tags

Raw text may contain HTML tags especially if the text is exctracted using techniques like web or screen scraping. HTML tags noise and don’t add much value to understanding and analyzing text. Hence, they should be removed. We will use the BeautifulSoup library for removing HTML tags.

from bs4 import BeautifulSoupdef remove_html_tags(text):
return BeautifulSoup(text, 'html.parser').get_text()
remove_html_tags( ‘<p>A part of the text <span>and here another part</span></p>’)#output
>> A part of the text and here another part

Case-Standardization

It is one of the most common preprocessing steps in NLP where the text is converted into the same case, more often than not into lower case.

def to_lowercase(text):
return text.lower()
print("Learning NLP is Fun....")
>> learning nlp is fun

But this step can lead to information loss in some NLP tasks. For example, in a sentiment analysis task, words written in upper cases can signify strong emotions like anger, excitement, etc. In such cases, we might want to perform this step differently or may even avoid it.

Standardizing Accent Characters

Sometimes, people use accented characters like é, ö, etc. to signify emphasis on a particular letter during pronunciation. In some instances, accent marks also clarify the semantics of words, which might be different without accent marks. Though you might encounter accented characters very rarely, it’s a good practice to convert these characters into standard ASCII characters.

import unicodedatadef standardize_accented_chars(text):
return unicodedata.normalize(‘NFKD’, text).encode(‘ascii’, ‘ignore’).decode(‘utf-8’, ‘ignore’)
print(standardize_accented_chars('Sómě words such as résumé, café, prótest, divorcé, coördinate, exposé, latté.'))>> Some words such as resume, cafe, pretest, divorce, coordinate, expose, latte.

Dealing with URLs

Many times people use URLs, especially on social media, to provide extra information to the context. The URLs don’t generalize across samples and hence are noise. We can remove URLs using regular expressions.

impor re 
def remove_url(text):
return re.sub(r’https?:\S*’, ‘’, text)
print(remove_url('using https://www.google.com/ as an example'))>> using as an example

Note: In some tasks, URLs can add extra information to the data. For example, in a text classification task that aims to detect if a social media comment is an advertisement or not, the presence of URLs in a comment can provide useful information. In such a case, we can replace URLs with a custom token like <URL> or add an extra binary feature in the feature vector that corresponds to the presence of a URL.

Expanding Contractions

Contractions are shortened versions of words or syllables. They are created by removing, one or more letters from words. Sometimes, multiple words are combined to create a contraction. For example, I will is contracted into I’ll, do not into don’t. Considering I will and I’ll differently might result in poor performance of the model. Hence, it’s a good practice to convert each contraction into its expanded form. We can use the contractions library to convert contractions into their expanded form.

import contractionsdef expand_contractions(text):
expanded_words = []
for word in text.split():
expanded_words.append(contractions.fix(word))
return ‘ '.join(expanded_words)
print(expand_contractions("Don't is same as do not"))>> Do not is same as do not

Removing Mentions and Hashtags

This step comes into effect when dealing with social media text data, for example, Tweets. Mentions and hashtags don’t generalize across samples and are noise in most NLP tasks. Hence, it’s better to remove these.

import redef remove_mentions_and_tags(text):
text = re.sub(r’@\S*’, ‘’, text)
return re.sub(r’#\S*’, ‘’, text)
#testing the function on a single sample for explaination
print(remove_mentions_and_tags('Some random @abc and#def'))
>> Some random and

The above output might make much sense to humans but helps in improving the performance of models.

Note: In this step, we are removing text that comes after ‘@’ and ‘#’. Also, this step should be performed before removing special characters(a step that we are going to look at next) from the text

Removing Special Characters

Special characters are non-alphanumeric characters. The characters like %,$,&, etc are special. In most NLP tasks, these characters add no value to text understanding and induce noise into algorithms. We can use regular expressions for removing special characters.

import re
def remove_special_characters(text):
# define the pattern to keep
pat = r'[^a-zA-z0-9.,!?/:;\"\'\s]'
return re.sub(pat, '', text)

print(remove_special_characters(“007 Not sure@ if this % was #fun! 558923 What do# you think** of it.? $500USD!”))
>> '007 Not sure if this was fun! 558923 What do you think of it.? 500USD!'

Removing Digits

Digits in the text don’t add extra information to data and induce noise into algorithms. Hence, it’s a good practice to remove digits from the text. Again, we can use regex to achieve this task.

# imports
import re
def remove_numbers(text):
pattern = r'[^a-zA-z.,!?/:;\"\'\s]'
return re.sub(pattern, '', text)

print(remove_numbers(“You owe me 1000 Dollars”))
>> You owe me Dollars

Removing Puncuations

Again, puncuations don’t add extra information to data in NLP and hence, we remove them.

import stringdef remove_punctuation(text):
return ''.join([c for c in text if c not in string.punctuation])
remove_punctuation('On 24th April, 2005, "Me at the zoo" became the first video ever uploaded to YouTube.')>> On 24th April 2005 Me at zoo became the first video ever uploaded to Youtube

Lemmatisaton

Lemmatization generates the root form of words from their inflected forms. For example, for the root word ‘play’, ‘playing will be its inflected form. Notice that both play and playing mean almost the same and it would be better if our model considers playing the same as play. To achieve such conversions, we use lemmatization. Lemmatization makes use of vocabulary and morphological analysis of words, to generate the root form of a word. We will use the spaCy library for performing lemmatization.

import spacy
def lemmatize(text, nlp):
doc = nlp(text)
lemmatized_text = []
for token in doc:
lemmatized_text.append(token.lemma_)
return “ “.join(lemmatized_text)
print(lemmatize('Reading NLP blog is fun.' ,nlp ))>> Read NLP blog is fun.

Before lemmatization, stemming was used to reduce inflected words to their root forms. But stemming does not consider the vocabulary of the language and removes inflections from words using some rule-based approach. In many cases, stemming produces words that are not part of language vocabulary. Hence, lemmatization is almost always preferred over stemming. Still, I am providing code to perform stemming.

import nltk
def get_stem(text):
stemmer = nltk.porter.PorterStemmer()
return ' '.join([stemmer.stem(word) for word in text.split()])
print(get_stem("Sharing and caring is a good habit"))>> Shar and car is a good habit

Notice that caring has become car and Sharing as become Shar.

Removing Stopwords

Stopwords like I, am, me, etc, don’t add any information that can help in modeling. Keeping them adds noise and increases the dimensions of feature vectors badly affecting computation cost and model accuracy. Hence it is advisable to remove them. We will use the spacy library to remove stop words. Spacy has 326 words in the set of stop words. In some cases, we might want to add some custom stop words, i.e. the words that are stop words for our tasks but may not be in the spacy’s set of stop words. Also in some NLP tasks, we might want to remove some words from spacy’s stop words set. For example, in sentiment tasks, we would like to keep negation words like ‘not, neither, nor, etc’ in text, hence, we would remove them from spacy’s set of stop words.

def remove_stopwords(text,nlp):       
filtered_sentence =[]
doc=nlp(text)
for token in doc:

if token.is_stop == False:
filtered_sentence.append(token.text)
return “ “.join(filtered_sentence)nlp = spacy.load("en_core_web_sm", disable=["parser", "ner"])print(remove_stopwords('I love going to school',nlp))
>> love going school

Conclusion

We have discussed some common and useful text cleaning and processing techniques. You might have observed that I have implemented these functions separately and some functions can be combined to enhance the performance of the code. I have kept all the steps separately for a better explanation of individual steps. While implementing these steps in a project, some steps can be combined. Also, a few of these might not be required in certain NLP tasks. I have tried to explain which steps can be ignored in which situations. But, the decision to include these steps highly depends on the problem statement. Also, there could be some other libraries performing the same steps with minimal code, but I have tried to explain the steps with the code that allows one to think about what happens behind the scenes. Having an understanding of these steps, migration to those libraries will not be a tricky part. Thanks!!! See you in the next article.

References

  1. https://spacy.io/
  2. https://www.w3schools.com/python/python_regex.asp
  3. https://pypi.org/project/contractions/

--

--