A guide to natural language processing for text and speech

David Weedmark2021-12-01 | 7 min read

While humans have been using language since we arose, a complete understanding of language is a lifelong pursuit that often comes short, even for experts. To task computer technology with comprehending language, translating and even producing original written works represents a series of problems that are still in the process of being solved.

What Is natural language processing?

Natural language processing (NLP) is a blend of different disciplines, ranging from computer science and computational linguistics to artificial intelligence, that is used together to analyze, extract and comprehend information derived from human language, including both text and spoken words. It goes beyond processing words as blocks of information. Rather, NLP can recognize the hierarchical structures within language, extracting ideas and discerning nuances of meaning. It involves understanding syntax, semantics, morphology and lexicons. NLP has several use cases in data science such as:

Summarization
Grammar correction
Translation
Entity recognition
Speech recognition
Relationship extraction
Topic segmentation
Sentiment analysis
Text mining
Automated answering of questions

Natural language generation

A subset of NLP, natural language generation (NLG) is a type of language technology that can write out ideas in English or other human languages. When a model is given data input, it can produce human-language text. With text-to-speech technology, it can also produce human speech. This is a three-stage process:

Text planning: Content is outlined at a general level.
Sentence planning: Content is put into sentences and paragraphs, with punctuation and text flow considered, including the use of pronouns and conjunctions.
Realization: The assembled text is edited for grammar, spelling and punctuation before being outputted.

Natural Language Generation has seen rapid expansion into commercial organizations through new discoveries and expansions in open-source models such as GPT-3 and frameworks such as PyTorch.

Natural language understanding

Another subset of NLP is natural language understanding (NLU) that determines the meaning of sentences in text or speech. While this may appear to come naturally to humans, for machine learning, it involves a complex series of analyses that can include:

Syntactic analysis: processing the grammatical structure of sentences to discern meaning.
Semantic analysis: searching for meaning that may be overt or implied by a sentence.
Ontology: determining relationships between words and phrases.

Only after these analyses have been put together can NLU make sense of phrases like “Man-eating shark”; phrases that rely on previous sentences, like “I’d like that”; and even individual words that have multiple meanings, like the auto-antonym “oversight.”

NLP techniques and tools

Before you can get started in NLP, you will need access to labeled data (for supervised learning), algorithms, code and a framework. There are several different techniques you can use, including deep learning techniques depending on your needs. Some of the most common NLP techniques include:

Sentiment analysis: The most widely used NLP technique, this is used for analyzing customer reviews, social media, surveys and other text content where people express their opinions. The most basic output uses a three-point scale (positive, neutral and negative), but sentiment-analysis scores can be tailored for more complex scales if needed. It can use supervised learning techniques, including Naive Bayes, random forest or gradient boosting, or unsupervised learning.
Named entity recognition: A basic technique for extracting entities from text. It can identify names of people, locations, dates, organizations or hashtags.
Text summarization: Used primarily to summarize news and research articles. Extraction models summarize content by extracting text, whereas abstraction models generate their own text to summarize the input text.
Aspect mining: Identifies different aspects in text. When used with sentiment analysis, it can extract complete information and the intent of the text.
Topic modeling: Determines the abstract topics that are covered in text documents. Since this uses unsupervised learning, a labeled dataset isn’t needed. Popular algorithms for topic modeling include Latent Dirichlet allocation, latent semantic analysis and probabilistic latent semantic analysis.

Popular frameworks for NLP today include NLTK, PyTorch, spaCy, TensorFlow, Stanford CoreNLP, Spark NLP and Baidu ERNIE. Each NLP framework has its pros and cons in a production environment, so often data scientists do not rely solely on one framework. Kaggle offers a series of NLP tutorials that cover basics, for beginners with a knowledge of Python, and deep learning using Google’s Word2vec. Tools include a labeled dataset of 50,000 IMDB movie reviews and the required code.

Applications of NLP

NLP is used for a variety of applications that people use on a regular basis. Google Translate, for example, was developed using TensorFlow. While its early incarnations were often mocked, it has been continuously improved using deep learning through Google’s GNMT neural translation model, to produce accurate and natural-sounding translations for over 100 languages.

Facebook has achieved remarkable success with its translation service as well, solving complex problems with deep learning and natural language processing, as well as language identification, text normalization and word-sense disambiguation.

Other applications for natural language processing today include sentiment analysis, which allows applications to detect nuances in emotions and opinions and to identify such things as sarcasm or irony. Sentiment analysis is also used in text classification, which automatically processes unstructured text to determine how it should be classified. A sarcastic comment in a negative product review, for instance, can then be correctly classified, rather than misinterpreting the comment as positive.

NLP with Domino Data Lab

In addition to apps you may use online or in social media, there are numerous business applications dependent on NLP. In the insurance industry, for example, NLP models can analyze reports and applications to help determine whether the company should accept the risk requested.

Topdanmark, the second-largest insurance company in Denmark, built and deployed an NLP model using the Domino data science platform to automate 65% to 75% of its cases, and customer waiting times have been reduced from a week to mere seconds. To begin exploring the advantages of Domino’s Enterprise MLOps platform, sign up for a free 14-day trial.