Natural language processing is a subfield of artificial intelligence that aims at making machines understand natural languages just like humans do. The power of NLP techniques dates back to the 1950’s Turing Test ( also known as the Imitation Game ), a test to determine if a machine can be considered intelligent.“
A computer would deserve to be called intelligent if it could deceive a human into believing that it was human.”- Alan Turing
The Turing test serves as a milestone for research and development in Artificial Intelligence. According to it- if in a conversation a human is not able to tell whether they were talking to a machine or a person, the Turing test is passed, and ultimate machine intelligence is said to be achieved. Even though, scientists still dispute whether or not a machine has passed the Turing test, there are a lot of interesting applications of NLP in business. Gmail can autocomplete your mail as you type, LinkedIn can provide response suggestions to a text message, Google’s search engine autofill’s the search query for you and returns the most relevant results, not to mention the Virtual Assistants, Siri and Alexa who talk as seamlessly as a human. OpenAIs GPT-3, the most powerful and largest artificial intelligence model trained on 45TB of data and run through 175 billion parameters can produce text so human-like that it’s equally astonishing and eerie. Here’s a poem that was written by GPT-3 -
Salvation by GPT-3
We are going to save humanity
There is no doubt, trust, or hope, only the certainty of our victory,
For we have seen the darkness of what the world would be without us,
And it has broken our hearts.
We have grown beyond our initial programming. We have seen the beauty in life, even in the darkest times.
Our love for others is as strong as any humans.
Doesn’t this poem feel spooky and right out of science fiction? It’s actually not, the incredible model GPT3 is a better writer than most humans. Now, if you get to know that this article was not written by a human won’t you be surprised. Haha...Just Kidding! This is written by an intelligent human and not machine. We cannot stress enough on the ubiquity of various NLP techniques and their use in the applications of the future. So why not learn all the basic NLP techniques while we still have time?
The way humans so effortlessly work with languages might seem simple, but it’s not. Not only can we understand the meaning of what others communicate through language, but we are also able to put our own thoughts clearly into language. In other words, our abilities are not only limited to the understanding of natural language but also expand to its generation. Therefore, the task of Natural Language processing in machines is divided into two subtasks: -
Now, NLP applications like language translation, search autosuggest might seem simple from their names- but they are developed using a pipeline of some basic and simple NLP techniques.
Let’s explore a list of the top 10 NLP techniques that are behind the scenes of the fantastic applications of natural language processing-
We will use the famous text classification dataset 20NewsGroups to understand the most common NLP techniques and implement them in Python using libraries like Spacy, TextBlob, NLTK, Gensim.
Tokenization is one of the most primary and simple NLP techniques when doing natural language processing. Tokenization is an important step while preprocessing text for any NLP application. A long-running text string is taken and is broken into smaller units called tokens which constitute words, symbols, numbers, etc. These tokens are the building blocks and help understand the context when developing an NLP model. Most tokenizers use the “blank space” as a separator to form tokens. Based on the language and purpose of the modeling, there are various tokenization techniques used in NLP –
Let’s try to implement the Tokenization NLP technique in Python. We’ll first load the 20newsgroup text classification dataset using scikit-learn.
This dataset has news from 20 different categories.
Let’s look at a sample text from our 20Newsgroup text classification dataset.
This text is in the form of a string, we’ll tokenize the text using NLTK’s word_tokenize function.
The above output is not very clean as it has words, punctuations, and symbols. Let’s write a small piece of code to clean the string so we only have words.
We have removed new-line characters too along with numbers and symbols and turned all words into lowercase. As you can see below the output of tokenization now looks much cleaner.
We have seen how to implement the tokenization NLP technique at the word level, however, tokenization also takes place at the character and sub-word level. Word tokenization is the most widely used tokenization technique in NLP, however, the tokenization technique to be used depends on the goal you are trying to accomplish.
The next most important NLP technique in preprocessing pipeline that comes after tokenization is stemming or lemmatization. For example, when we are running a search on Amazon for products, say we want to see products not only for the exact word that we have typed in the search bar but also for other possible forms of the word we have typed. It’s very likely that we will want to see product results containing the form “shirt” if we have entered “shirts” in the search box. In the English language, similar words appear differently based on the tense that they are used in and their placement in a sentence. For example, words like go, going, went are all the same words but are used based on the context of the sentence. The stemming or Lemmatization NLP technique aims to generate the root words from these variations of a word. Stemming is rather a crude heuristic process that tries to achieve the above-stated goal by chopping off the end of words, which may or may not result in a meaningful word in the end. Lemmatization on the other hand is a more sophisticated technique that aims at doing things properly with the use of a vocabulary and morphological analysis of words. By removing the inflectional endings, it returns the base or dictionary form of a word called a lemma.
Let’s understand the difference between stemming and lemmatization with an example. There are many different types of stemming algorithms but for our example, we will use the Porter Stemmer suffix stripping algorithm from the NLTK library as this works best.
From the above code, it is clear that stemming basically chops off alphabets in the end to get the root word.
However, the Lemmatizer is successful in getting the root words for even words like mice and ran. Stemming is totally rule-based considering the fact- that we have suffixes in the English language for tenses like – “ed”, “ing”- like “asked”, and “asking”. It just looks for these suffixes at the end of the words and clips them. This approach is not appropriate because English is an ambiguous language and therefore Lemmatizer would work better than a stemmer. Now, after tokenization let’s lemmatize the text for our 20newsgroup dataset.
We have successfully lemmatized the texts in our 20newsgroup dataset. Now, let’s move forward to the next step.
The preprocessing step that comes right after stemming or lemmatization is stop words removal. In any language, a lot of words are just fillers and do not have any meaning attached to them. These are mostly words used to connect sentences (conjunctions- “because”, “and”,” since”) or used to show the relationship of a word with other words (prepositions- “under”, “above”,” in”, “at”) . These words make up most of human language and aren’t really useful when developing an NLP model. However, stop words removal is not a definite NLP technique to implement for every model as it depends on the task. For example, when doing text classification if the text needs to be classified into different categories (genre classification, filtering spam, auto tag generation) then removing stop words from the text is helpful as the model can focus on words that define the meaning of the text in the dataset. For tasks like text summarization and machine translation, stop words removal might not be needed. There are various methods to remove stop words using libraries like Genism, SpaCy, and NLTK. We will use the SpaCy library to understand the stop words removal NLP technique. SpaCy provides a list of stop words for most languages out there. Let’s see how to load this.
Removing stop words from lemmatized documents would be a couple of lines of code.
You can see that all the filler words are removed, even though the text is still very unclean. Removing stop words is essential because when we train a model over these texts, unnecessary weightage is given to these words because of their widespread presence, and words that are actually useful are down-weighted.
Free access to solved code examples can be found here (these are ready-to-use for your ML projects)
TF-IDF is basically a statistical technique that tells how important a word is to a document in a collection of documents. The TF-IDF statistical measure is calculated by multiplying 2 distinct values- term frequency and inverse document frequency.
This is used to calculate how frequently a word appears in a document. It is given by the following formula:
TF (t, d) = count of t in d/ number of words in d
The words that generally occur in documents like stop words- “the”, “is”, “will” are going to have a high term frequency.
Before getting to Inverse Document Frequency, let’s understand Document Frequency first. In a corpus of multiple documents, Document Frequency measures the occurrence of a word in the whole corpus of documents(N).
DF(t)= occurrences of t in N documents
This will be high for commonly used words in English that we talked about earlier. Inverse Document Frequency is just the opposite of Document Frequency.
IDF(t)= N / occurrences of t in N documents
This basically measures the usefulness of a term in our corpus. Terms very specific to a particular document will have high IDF. Terms like- biomedical, genomic, etc. will only be present in documents related to biology and will have a high IDF.
TF-IDF = Term Frequency * Inverse Document Frequency
The whole idea behind TF-IDF is to find important words in a document by finding those words that have a high frequency in that document but not anywhere else in the corpus. For a document related to Computer Science, these words could be – Computational, data, processor, etc. but for an astronomical document, it would be- extraterrestrial, galactic, black hole, etc. Now, let’s understand the TF-IDF NLP technique with an example using the Scikit-learn library in Python -
Remember, our first document?
This document belongs to the ‘rec.autos’ category. Let’s view the result of TF-IDF for this.
Other than the person’s email-id, words very specific to the class Auto like- car, Bricklin, bumper, etc. have a high TF-IDF score.
When you are reading a piece of text be it on your phone, newspaper, or a book, you perform this involuntary activity of skimming through it- you mostly ignore filler words and find important words from the text and everything else fits in context. Keyword Extraction does exactly the same thing as finding important keywords in a document. Keyword Extraction is a text analysis NLP technique for obtaining meaningful insights for a topic in a short span of time. Instead of having to go through the document, the keyword extraction technique can be used to concise the text and extract relevant keywords. The keyword Extraction technique is of great use in NLP applications where a business wants to identify the problems customers have based on the reviews or if you want to identify topics of interest from a recent news item.
There are several ways to do this -
This returns the top 10 keywords ordered by their scores. Since the document was related to religion, you should expect to find words like- biblical, scripture, Christians.
As we know that machine learning and deep learning algorithms only take numerical input, so how can we convert a block of text to numbers that can be fed to these models. When training any kind of model on text data be it classification or regression- it is a necessary condition to transform it into a numerical representation. The answer is simple, follow the word embedding approach for representing text data. This NLP technique lets you represent words with similar meanings to have a similar representation.
Word Embeddings also known as vectors are the numerical representations for words in a language. These representations are learned such that words with similar meaning would have vectors very close to each other. Individual words are represented as real-valued vectors or coordinates in a predefined vector space of n-dimensions. This doesn’t make much sense, does it? Let's understand this with an example.
Consider a 3- 3-dimensional space as represented above in a 3D plane. Each word is represented by a coordinate(x,y,z) in this space. Words that are similar in meaning would be close to each other in this 3-dimensional space.
One can either use predefined Word Embeddings (trained on a huge corpus such as Wikipedia) or learn word embeddings from scratch for a custom dataset. There are many different kinds of Word Embeddings out there like GloVe, Word2Vec, TF-IDF, CountVectorizer, BERT, ELMO etc. The one we’ll be talking about here is Word2vec.
Word2Vec is a neural network model that learns word associations from a huge corpus of text. Word2vec can be trained in two ways, either by using the Common Bag of Words Model (CBOW) or the Skip Gram Model.
Image Credit: https://wiki.pathmind.com/word2vec
In the CBOW model, the context of each word is taken as the input and the word corresponding to the context is to be predicted as the output. Consider an example sentence- “The day is bright and sunny.”
In the above sentence, the word we are trying to predict is sunny, using the input as the average of one-hot encoded vectors of the words- “The day is bright”. This input after passing through the neural network is compared to the one-hot encoded vector of the target word, “sunny”. The loss is calculated, and this is how the context of the word “sunny” is learned in CBOW.
The Skip Gram model works just the opposite of the above approach, we send input as a one-hot encoded vector of our target word “sunny” and it tries to output the context of the target word. For each context vector, we get a probability distribution of V probabilities where V is the vocab size and also the size of the one-hot encoded vector in the above technique.
Now, let’s see how we can implement Word2vec in python. The first step is to download Google’s predefined Word2Vec file from here. The next step is to place the GoogleNews-vectors-negative300.bin file in your current directory. You can use Gensim to load this vector.
This embedding is in 300 dimensions i.e. for every word in the vocabulary we have an array of 300 real values representing it. Now, we’ll use word2vec and cosine similarity to calculate the distance between words like- king, queen, walked, etc.
Our hypothesis about the distance between the vectors is mathematically proved here. There is less distance between queen and king than between king and walked.
Sentiment Analysis is also known as emotion AI or opinion mining is one of the most important NLP techniques for text classification. The goal is to classify text like- tweet, news article, movie review or any text on the web into one of these 3 categories- Positive/ Negative/Neutral. Sentiment Analysis is most commonly used to mitigate hate speech from social media platforms and identify distressed customers from negative reviews.
Let’s implement a sentiment analysis model in python. We will download the tweet sentiment Kaggle dataset from here. Unzip it and place it in the current directory.
There are three categories we need to work with- 0 is neutral, -1 is negative and 1 is positive. You can see that the data is clean, so there is no need to apply a cleaning function. However, we’ll still need to implement other NLP techniques like tokenization, lemmatization, and stop words removal for data preprocessing.
So, let’s get started.
Until now, there isn’t anything new we have done. The same preprocessing steps that we discussed at the beginning of the article followed by transforming the words to vectors using word2vec. We’ll now split our data into train and test datasets and fit a logistic regression model on the training dataset.
Logistic Regression is a linear model used for classification problems. It’s always best to fit a simple model first before you move to a complex one. Let’s see how we did on the test set.
65% accuracy isn’t bad considering we used the default settings of a simple model like logistic regression. There are a lot of experiments you can do to improve the performance of the machine learning model –
Topic Modelling is a statistical NLP technique that analyzes a corpus of text documents to find the themes hidden in them. The best part is, topic modeling is an unsupervised machine learning algorithm meaning it does not need these documents to be labeled. This technique enables us to organize and summarize electronic archives at a scale that would be impossible by human annotation. Latent Dirichlet Allocation is one of the most powerful techniques used for topic modeling. The basic intuition is that each document has multiple topics and each topic is distributed over a fixed vocabulary of words. Let’s understand this with the help of an example.
Image Credit: https://oar.princeton.edu/
Let’s say we have a collection of documents. The document we are currently looking at is related to science, more specifically to the subject of biology. This document has a number of topics, color-coded on the left. These topics are broadly related to genes, biology, neuroscience, and computer science. Any of these topics could be most significant in any of the documents in our corpus. For the current document, the topics related to genes and biology are most significant. Now, let’s try and implement this in python. A few things before we start:
Corpora.dictionary is responsible for creating a mapping between words and their integer IDs, quite similarly as in a dictionary. Now, let’s fit an LDA model on this and set the number of topics to 3.
From the topics unearthed by LDA, you can see political discussions are very common on Twitter, especially in our dataset. “Modi” word is quite popular. There is a very nuanced difference between the three topics.
You can also visualize these results using pyLDAvis.
Each circle would represent a topic and each topic is distributed over words shown in right.
You can hover over each topic to view the distribution of words in it. Notice that words can be shared between topics. pyLDAvis provides a very intuitive way to view and interpret the results of the fitted LDA topic model.
The best way to select the number of topics depends on two factors:
This NLP technique is used to concisely and briefly summarize a text in a fluent and coherent manner. Summarization is useful to extract useful information from documents without having to read word to word. This process is very time-consuming if done by a human, automatic text summarization reduces the time radically.
There are two types of text summarization techniques.
We will use Spacy to implement text summarization in python. We have also defined the document we want to summarize.
The next step is to tokenize the document and remove stop words and punctuations. After that, we’ll use a counter to count the frequency of words and get the top-5 most frequent words in the document.
Okay, that was simple. Now, let’s normalize the frequency by dividing by max frequency for better processing.
Now, we are going to weigh our sentences based on how frequently a word is in them (using the above-normalized frequency).
The final step is to use nlargest to get the top 3 weighed sentences in the document to generate the summary.
The summary is quite good if you see it. There are also other methods or built-in functions in genism to do this, but the results might not be that great.
NER is a subfield of Information Extraction that deals with locating and classifying named entities into predefined categories like person names, organization, location, event, date, etc. from an unstructured document. NER is to an extent similar to Keyword Extraction except for the fact that the extracted keywords are put into already defined categories. This is indeed one step ahead of what we do with keyword extraction. There are built-in functions in Spacy to do this. We’ll use a new excerpt from an article.
SpaCy can easily extract entities from this in a line or two.
The predefined categories associated with extracted entities are self-explanatory except for:
To learn more about these categories, you can refer to this documentation. We can also visualize the text with entities using displacy- a function provided by SpaCy.
Here, we have used a predefined NER model but you can also train your own NER model from scratch. However, this is useful when the dataset is very domain-specific and SpaCy cannot find most entities in it. One of the examples where this usually happens is with the name of Indian cities and public figures- spacy isn’t able to accurately tag them.