What is Natural Language Processing (NLP)?
NLP is the use of computer science, AI, and computational linguistics to allow computers to interact with and understand natural language. By natural language, I mean any human-spoken language that exists in the world, such as English. In its simplest form, it enables computers and artificially intelligent machines to understand and manipulate (usually large amounts of) natural language data. It is a discipline that concentrates on the direct interaction between natural language and data sciences, and has been applied in every domain available.
To a computer, natural language is just a bunch words (strings), each of which is composed of a bunch of characters. Computers can’t understand this, since it is meant for human consumption. Computers need to process this text in order to better understand it. There are generally three main levels of processing, namely, lexical processing, syntactic processing and semantic processing.
Lexical processing involves the investigation of words in a text, to better understand that text. This is called a bag of words approach, where words are taken out of context, and the surrounding context of each word is lost.
Syntactic processing involves understanding the syntactic or grammatical structure in order to help understand a sentence. This allows for more details than lexical processing, since syntactic information can help to add more meaning to words. For example, a determiner in front of a certain word means that it is most likely a noun. Adding syntactic information reduces ambiguity.
Semantic processing is the final level, and involves using semantic knowledge bases to infer the correct meaning of the word based on the meanings of other words surrounding it. To allow computers to obtain as much information as possible about natural language, semantic processing is crucial, but is the most complex level of text processing.
NLP overlaps with other areas such as computational linguistics, textual data mining, and text analytics. It largely overlaps with computational linguistics, which is using computers to study all aspects of human language, including semantics, syntax/grammar and phonetics. Even since the Web has become read-write, NLP has been an area for both academicians and industries. It opens new challenges in research, and has proven to be very valuable in decision-making by for-profit organizations.
NLP is complicated and challenging
NLP deals with speech and text data. Speech is typically transformed to its corresponding textual form before any NLP tasks are performed on it, so basically, it is assumed that NLP primarily deals with text. Text is challenging to deal with, due to its unstructured, free-form, noisy format. It does not fit properly into the standard row-and-column databases used for data mining. This makes it a challenging area to work with, since natural language is meant for human consumption, and not for machines.
Humans are good at understanding the meaning underlying a piece of text by understanding the semantic context that surrounds the text. This is not easy for machines to do. Machines can be given the syntactical rules and some semantic information in an attempt to understand the true meaning underlying a text, but even so, they have not yet been able to match the accuracy of humans in grasping the true meaning underlying text. To make things even more challenging, humans have become accustomed to using their own informal styles, especially when communicating on social media platforms. They do not properly follow proper grammatical rules, and intentionally use informal misspelled words. This informal social media style writing renders modern NLP applications that rely on formal grammatical rules completely useless. This makes research-based and practical NLP a challenging discipline. Even after half a century, many problems remain unsolved, and research is ongoing.
Another challenge is that text corpora and datasets can quickly become overwhelmingly massive. Data scientists that deal with NLP and text mining often big data technologies such as Apache Hadoop, Spark, in combination with no-SQL databases such as MongoDB to be able to handle massive unstructured text data streams.
The two core problems in NLP are natural language generation (converting from machine language in databases to human readable language), and natural language understanding (enabling computers to understand the underlying meaning of language). Although the tasks in NLP are heavily linked to one another, they are categorized into distinct tasks for convenience. Some of the more prominent tasks/subdisciplines that have been applied to solve real-world problems are as follows.
Stemming: Converting a word to its main root form.
Lemmatization: Converting a word to its lemma form as available in a formal dictionary.
Grammar Induction: Automatically create a formal grammar that involves a language syntax.
Word Segmentation: Also referred to as tokenization in NLP to form a ‘bag of words’, this involves splitting a unit of text into distinct words. This is easy to do in English, but is challenging in other languages such as Chinese, where there is no space between different words.
Morphological Segmentation: Involves dissecting words into morphemes, and then recognizing the class of each.
Part of Speech Tagging: Involves tagging each word based on its part of speech or lexical category. The most prominent lexical categories in English are noun, verb, adverb, adjective, preposition, and determiner.
Parsing: Involves grammatical analysis of a given sentence, where the output is a parse tree.
Sentence Breaking: Involves breaking a unit of text into distinct sentences, which is challenging when sentence boundaries are ambiguous, such as when a writer does not use punctuation to end sentences.
Terminology Extraction: Involves extracting specific terms from a target corpus.
Lexical Semantics: Studying the meaning of words within a specific context.
Machine Translation: Translating one human language (e.g. English) to and entirely different language (e.g. Spanish).
Named Entity Recognition: Extracting words that reflect proper nouns from a given text. These can be people, places, organizations, etc.
Question Answering: Automatically providing an answer to a question.
Textual Entailment: Determining how two pieces of text, check if the second entails the first.
Relationship Extraction: Determine the relationships among named entities in one text and the named entities in another.
Topic Segmentation: Classify text into segments based on the topic they refer to or talk about.
Word Sense Disambiguation: Given a polysemous word, the aim is to determine the specific term sense that it corresponds to, based on the context in which it appears. WordNet is a well known digital dictionary that contains the word senses that correspond to each word.
Sentiment Analysis: Determine the sentiment, attitude or emotion towards a particular entity in a text. This is often used by organizations to monitor public opinion on the Web and social media towards their products.
Emotion Recognition: Similar to Sentiment Analysis, but rather than a binary of positive or negative, classify text according to the set of fundamental emotions (e.g. anger, sadness, joy, surprise, shame, etc.)
Coreference Resolution: Check which mentions refer to the same entities. This also subsumes the task of anaphora resolution, which involves checking which entity a pronoun refers to.
Discourse Analysis: Recognizing the discourse structure of the following connected text.
Automatic Summarization: Given a text, generate a brief summary of the content.
Speech Recognition: Generating the corresponding text representation of an input audio speech. This is typically the first step when dealing with NLP of speech.
Recommendation Systems: Automatically recommend certain products to a user based on a history of their personal preferences.
Progress in NLP Application to solve real world problems
The above tasks have been directly applied to solve real-world problems, in many industries and domains including finance, health care, government and customer relationship management. Some use cases of NLP are:
- Reputation and brand monitoring by organizations using sentiment analysis
- Identifying fake news and news that is politically biased and untrusted
- Spam detection in emails
- Identifying possible diseases based on previous health records of a patient
- Detection of Covid-19 coronavirus outbreak cluster locations
- Detecting grammatical errors and supplying possible fixes on the fly (e.g. Grammarly)
- Ability to directly understand speech commands and act accordingly (e.g. Alexa)
- Recruiting workers based on identifying their specific skillsets
The future of NLP
NLP is an extremely interesting AI area in terms of both academic research and practical application. It has been around for over half a century. Ever since Alan Turing invented the “Turing Test” as a way to assess how close a machine has come to true artificial intelligence in 1950, there has been significant advancement in this area. One cool example is an AI program writing a novel on its own. The novel is entitled 1 the Road, published on July 2018.
But even with all this advancement in technologies, many of its tasks remain unsolved. The future of NLP involves resolving these current problems, and then supplying machines with near-human semantic, emotional, and real-world information, in order for them to understand and ‘feel’ natural language in the same way that humans do. This, with recent advancements in big data and machine learning technologies that give machines the ability to ‘learn to learn’, equips them with the capability to communicate meaningfully and respond based on the conversation (human-computer conversation).