Text preprocessing, cleansing and normalization
In Natural Language Processing (NLP) or text analytics, text preprocessing pr cleansing is an essential first step. For any NLP task such as sentiment classification of documents to positive or negative, topic classification, or machine translation, the first step is to prepare the data before feeding it into the classifier model. Since the data is text in the case of NLP, which is unstructured, messy and noisy, how the data is preprocessed has a major impact on the final outcome. A rigorous cleansing of the text is crucial to ensure the model works to your favor, and without being weighed down by messy text and useless data as input. It is common to say garbage in = garbage out.
This step is also referred to as text cleansing, and prepares the text to a form that is analyzable by the model. Whether the classifier model is unsupervised (e.g. based on a lexicon) or supervised (e.g. machine learning classifiers such as SVMs, NB, etc.), the text goes through an initial phase of cleansing before it is fed into the classifier model. There is no one-size-fits all solution or framework for preprocessing, and this highly depends on the task at hand. Some examples are given below.
Common Preprocessing Pitfalls
Example 1. In the case of sentiment analysis, sometimes negation words such as ‘no’ and ‘not’ are often used for handling polar words that are under the scope of a negation, and intesifiers such as ‘too’ good can be used by the model to improve overall accuracy. So if you include ‘stop words removal’ in your preprocessing process, then this would remove those important negation and intensifier words that aid in overall model accuracy. A solution is to simply remove stop words, but retain only the stop words that are important for negation handling and intensifiers.
Example 2. Applying all words to lowercase, when the model may potentially benefit from keeping the text format as-is, such as in the case of boosting the sentiment strength of any term that is in all-caps, since the writer usually expresses higher emphasis when doing this (e.g. I absolutely LOVED this movie!). A solution is to lowercase words but also keep the all-caps version of the word if it comes up in the text.
Example 3. Removing punctuation, when the model can use multiple punctuation as a cue to increase sentiment strength, since the writer uses this to express stronger emotion.
Example 4. Removing punctuation, when this can be used to effectively split the text document into sentences for sentence-level analysis, which is more fine-grained.
Example 5. Tokenizing the words to generate a ‘bag of words’ model from the corpus. This way you have completely lost the surrounding context of the text, which is important and offers important semantic information that can be exploited.
Example 6. Before classification, used word frequency as features, when other information such as TFIDF can be applied to the words to generate a weight of importance for each word, which can drastically affect the final outcome of classification.
Example 7. Compressing words with emphatic lengthening to their original form such as converting ‘goood’ to ‘good’ in ‘this movie was so goood!’. This type of lengthening my emphasize stronger emotion by the writer, and can be used to effectively increase the strength of the sentiment.
As you can see, there are many issues encountered which may hinder the model to perform with optimal accuracy. The preprocessing steps taken should depend on the particular task at hand, and should be given extra thought.
Text Preprocessing Tasks
Converting to lowercase
Converting all of your text documents in the corpus to lowercase form is a good way to catch versions of the same word, but with a different format. For example, this would help to map usa to both usa and USA. This is often overlooked but very useful for ensuring the classifier can detect all forms of word format, whether lowercase, uppercase, or mixed case.
Stemming is a technique used to map all inflected versions of a word to their root form. This may sometimes not be a real word, but rather the canonical version. Whatever the case, this is a good way to map all forms of inflection to the original root form, making it easier for the classifier to detect that all forms relate to the root form of the word (e.g. helped, helps and helping are reduced to help). So all variations of a word are basically related to the root form. The most common algorithm used for English which has proven to be reliable is Porter Stemmer.
This is similar to stemming, but rather then just cutting off an inflected word to a root form, it does this in a more proper way to ensure that the root form is a real word in a dictionary such as WordNet.
Stop words and frequent words removal
The aim here is to remove stop words, which are highly frequent in English corpora, but have no effect on the accuracy. So removing them would drastically reduce the text to be processed, for more efficient classification. A common stop words list is found in NLTK, and includes words such as ‘a’, ‘the’, ‘has’, ‘on’, etc. Removing these low information words can reduce the features size and help the classifier to only analyze the more important words for faster results. Another strategy to consider is to simply remove the X% most frequent words in the corpus, which would typically be stop words or words that have very little impact on classification. Don’t rely on predefined lists, since there may be very frequent irrelevant terms in your corpus that are not in the stop words lists you use.
Tokenization is the process of splitting up the text document into smaller pieces called tokens. The document is usually split up based on white space, and this creates each word, punctuation, or number to be an individual token. This creates a ‘bag of words’ that represent the text document.
Noise removal involves removing anything from the text that is useless or may be seen to have a negative impact on the results. It is often said that text analytics is 80% about data preparation and cleansing. Noise removal can be crucial, depending on the dataset being used. Not all text documents are a well-written set of articles. Some text documents contain the entire HTML markup, which has to be removed before the text can be analyzed. So noise can refer to anything that is irrelevant to the text to be analyzed, including HTML, XML, and other forms of markup and metadata. In the case of social media style text such as tweets or Facebook posts, hashtags, URL links, text in different languages, etc. are all considered noise that can be removed before other preprocessing steps such as tokenization. Removing punctuation, numbers, diacritics, accent marks, out of vocabulary words, and other weird symbols that irrelevant to the text is also a potentially good move.
Due to the nature of informal writing often found by users on the web and social media, spelling and typos are very common. If you are dealing with a formal style of writing such as that found in news articles or novels, then you can skip this step. But this is not always the case when the corpus is a set of social media posts. In this case, it is always a good idea to use a spell checker and correct as much typos as possible.
Informal social media terms conversion
Common social media terms, internet slang, neologism, emoticons, and emojis are very common in social media posts such as tweets. Using predefined lexicons that are readily available to convert any informal words to their formal counterparts can be effective in later stages of analysis. It is worthy to note that some social media terms have become standardized, and are frequently used. So it would be a good idea to keep these as is, since they may be helpful in classification.
Part of speech tagging
Part of speech tagging involves tagging each token or word to its corresponding lexical category, based on its context in the sentence. Lexical categories are noun, verb, adjective and adverb. This has proven to be helpful to the classifier to better understand the meaning of the word. An example is the noun ‘good’ as in product, and the adjective ‘good’ as in well. A classifier that can differentiate this would help in the case of sentiment analysis, for example. This is often referred to as text enrichment, since you are enriching the text with more information that is helpful for the classifier.
Word sense disambiguation
A sense-disambiguated corpus has proven to be more effective in certain tasks such as sentiment analysis, where the correct sense of the word can be used rather than the general version of the word, to improve results. Popular sentiment lexicons such as SentiWordNet are build on top of the sense-level digital dictionary WordNet, and can be used to detect sentiment of senses rather then words.
General Rule of Thumb
There is no one size fits all in preprocessing. The steps to take all highly depend on your dataset. The aim is to cleanse the text and remove as much noise as possible, but without also cutting off important terms. So these are generally task-dependent. For example, if you have well-written, formal articles that belong to a certain domain, such as news articles, or poems, then minimal preprocessing needs to be done. But if you have a set of informal social media posts or tweets that vary in domain and user writing style, then rigorous cleansing may be required to strip away as much noise as possible. The entire data analysis pipeline (data collection, preprocessing, model building, validation, etc.) relies on how the initial data is cleansed and transformed from its raw format to a format that is easy to analyze by the model.
Practical text preprocessing tools and applications
All of the steps mentioned above can all be easily done with a few lines of code in readily available tools. Some popular preprocessing tools for Python are NLTK (natural language tool kit), TextBlob (built on NLTK), SpaCy, Apache OpenNLP, Gensim, RapidMiner, among others.
That’s about it!
What are your thoughts? Please mention anything useful I might have missed in the comments. The aim here is to learn together 🙂