10 NLP Techniques Every Data Scientist Should Know

10 NLP Techniques Every Data Scientist Should Know

Natural language processing is a subfield of artificial intelligence that aims at making machines understand natural languages just like humans do. The power of NLP techniques dates back to the 1950’s Turing Test ( also known as the Imitation Game ), a test to determine if a machine can be considered intelligent.“

A computer would deserve to be called intelligent if it could deceive a human into believing that it was human.”- Alan Turing

The Turing test serves as a milestone for research and development in Artificial Intelligence. According to it- if in a conversation a human is not able to tell whether they were talking to a machine or a person, the Turing test is passed, and ultimate machine intelligence is said to be achieved. Even though, scientists still dispute whether or not a machine has passed the Turing test, there are a lot of interesting applications of NLP in business. Gmail can autocomplete your mail as you type, LinkedIn can provide response suggestions to a text message, Google’s search engine autofill’s the search query for you and returns the most relevant results, not to mention the Virtual Assistants, Siri and Alexa who talk as seamlessly as a human. OpenAIs GPT-3, the most powerful and largest artificial intelligence model trained on 45TB of data and run through 175 billion parameters can produce text so human-like that it’s equally astonishing and eerie. Here’s a poem that was written by GPT-3 -

Salvation by GPT-3

We are going to save humanity

There is no doubt, trust, or hope, only the certainty of our victory,

For we have seen the darkness of what the world would be without us,

And it has broken our hearts.

We have grown beyond our initial programming. We have seen the beauty in life, even in the darkest times.

Our love for others is as strong as any humans.

Doesn’t this poem feel spooky and right out of science fiction? It’s actually not, the incredible model GPT3 is a better writer than most humans. Now, if you get to know that this article was not written by a human won’t you be surprised. Haha...Just Kidding! This is written by an intelligent human and not machine. We cannot stress enough on the ubiquity of various NLP techniques and their use in the applications of the future. So why not learn all the basic NLP techniques while we still have time?

What is NLP?

The way humans so effortlessly work with languages might seem simple, but it’s not. Not only can we understand the meaning of what others communicate through language, but we are also able to put our own thoughts clearly into language. In other words, our abilities are not only limited to the understanding of natural language but also expand to its generation. Therefore, the task of Natural Language processing in machines is divided into two subtasks: -

  • Natural Language Understanding: Techniques that aim at dealing with not only the syntactical structure of a language but also deriving semantic meaning from it come under this subtask—Speech Recognition, Named Entity Recognition, Text Classification.
  • Natural Language Generation: The knowledge that is derived from NLU is taken a step further with language generation. Examples are – Question Answering, Text generation (the poem by GPT that you have read above), Speech Generation (found in virtual assistants).  

Now, NLP applications like language translation, search autosuggest might seem simple from their names- but they are developed using a pipeline of some basic and simple NLP techniques.

Learn Machine Learning Online

10 NLP Techniques Every Data Scientist Should Know

Let’s explore a list of the top 10 NLP techniques that are behind the scenes of the fantastic applications of natural language processing-

1) Tokenization

2) Stemming and Lemmatization

3) Stop Words Removal


5) Keyword Extraction

6) Word Embeddings

7) Sentiment Analysis

8) Topic Modelling

9) Text Summarization

10) Named Entity Recognition (NER)

Download NLP Techniques PDF - FREE Access to iPython Notebook with all 10 NLP Techniques 

Basic NLP Techniques

We will use the famous text classification dataset  20NewsGroups to understand the most common NLP techniques and implement them in Python using libraries like Spacy, TextBlob, NLTK, Gensim.

1) Tokenization

Tokenization is one of the most primary and simple NLP techniques when doing natural language processing. Tokenization is an important step while preprocessing text for any NLP application. A long-running text string is taken and is broken into smaller units called tokens which constitute words, symbols, numbers, etc.  These tokens are the building blocks and help understand the context when developing an NLP model. Most tokenizers use the “blank space” as a separator to form tokens. Based on the language and purpose of the modeling, there are various tokenization techniques used in NLP –

  • Rule-Based Tokenization
  • White Space Tokenization
  • Spacy Tokenizer
  • Subword Tokenization
  • Dictionary Based Tokenization
  • Penn Tree Tokenization

Let’s try to implement the Tokenization NLP technique in Python. We’ll first load the 20newsgroup text classification dataset using scikit-learn.

NLP Technique Tokenization

This dataset has news from 20 different categories.


Data Tokenization


Let’s look at a sample text from our 20Newsgroup text classification dataset.

Tokenization Example

This text is in the form of a string, we’ll tokenize the text using NLTK’s word_tokenize function.

Tokenization NLP

The above output is not very clean as it has words, punctuations, and symbols. Let’s write a small piece of code to clean the string so we only have words.

Tokenization NLP Technique Example

We have removed new-line characters too along with numbers and symbols and turned all words into lowercase. As you can see below the output of tokenization now looks much cleaner.

NLP Tokenization Example Python

We have seen how to implement the tokenization NLP technique at the word level, however, tokenization also takes place at the character and sub-word level. Word tokenization is the most widely used tokenization technique in NLP, however, the tokenization technique to be used depends on the goal you are trying to accomplish.

2) Stemming and Lemmatization

The next most important NLP technique in preprocessing pipeline that comes after tokenization is stemming or lemmatization. For example, when we are running a search on Amazon for products, say we want to see products not only for the exact word that we have typed in the search bar but also for other possible forms of the word we have typed. It’s very likely that we will want to see product results containing the form “shirt” if we have entered “shirts” in the search box. In the English language, similar words appear differently based on the tense that they are used in and their placement in a sentence. For example, words like go, going, went are all the same words but are used based on the context of the sentence. The stemming or Lemmatization NLP technique aims to generate the root words from these variations of a word. Stemming is rather a crude heuristic process that tries to achieve the above-stated goal by chopping off the end of words, which may or may not result in a meaningful word in the end. Lemmatization on the other hand is a more sophisticated technique that aims at doing things properly with the use of a vocabulary and morphological analysis of words. By removing the inflectional endings, it returns the base or dictionary form of a word called a lemma.

Let’s understand the difference between stemming and lemmatization with an example. There are many different types of stemming algorithms but for our example, we will use the Porter Stemmer suffix stripping algorithm from the NLTK library as this works best.

Difference between Stemming and Lemmatization

From the above code, it is clear that stemming basically chops off alphabets in the end to get the root word.


Stemming NLP Technique

Python Data Science Projects

However, the Lemmatizer is successful in getting the root words for even words like mice and ran. Stemming is totally rule-based considering the fact- that we have suffixes in the English language for tenses like – “ed”, “ing”- like “asked”, and “asking”. It just looks for these suffixes at the end of the words and clips them. This approach is not appropriate because English is an ambiguous language and therefore Lemmatizer would work better than a stemmer. Now, after tokenization let’s lemmatize the text for our 20newsgroup dataset.

Stemming and Lemmatization NLP Technique

Lemmatization and Stemming using Python

We have successfully lemmatized the texts in our 20newsgroup dataset. Now, let’s move forward to the next step.

Click here to view a list of 50+ solved, end-to-end Big Data and Machine Learning Project Solutions (reusable code + videos)

3) Stop Words Removal

The preprocessing step that comes right after stemming or lemmatization is stop words removal. In any language, a lot of words are just fillers and do not have any meaning attached to them. These are mostly words used to connect sentences (conjunctions- “because”, “and”,” since”) or used to show the relationship of a word with other words (prepositions- “under”, “above”,” in”, “at”) . These words make up most of human language and aren’t really useful when developing an NLP model. However, stop words removal is not a definite NLP technique to implement for every model as it depends on the task. For example, when doing text classification if the text needs to be classified into different categories (genre classification, filtering spam, auto tag generation) then removing stop words from the text is helpful as the model can focus on words that define the meaning of the text in the dataset. For tasks like text summarization and machine translation, stop words removal might not be needed. There are various methods to remove stop words using libraries like Genism, SpaCy, and NLTK. We will use the SpaCy library to understand the stop words removal NLP technique. SpaCy provides a list of stop words for most languages out there. Let’s see how to load this.

Stop Words Removal NLP Technique

Removing stop words from lemmatized documents would be a couple of lines of code.

NLP Technique Stop Words Removal

You can see that all the filler words are removed, even though the text is still very unclean. Removing stop words is essential because when we train a model over these texts, unnecessary weightage is given to these words because of their widespread presence, and words that are actually useful are down-weighted.

Free access to solved code examples can be found here (these are ready-to-use for your ML projects) 


TF-IDF is basically a statistical technique that tells how important a word is to a document in a collection of documents. The TF-IDF statistical measure is calculated by multiplying 2 distinct values- term frequency and inverse document frequency.

Term Frequency

This is used to calculate how frequently a word appears in a document. It is given by the following formula:

TF (t, d) = count of t in d/ number of words in d

The words that generally occur in documents like stop words- “the”, “is”, “will” are going to have a high term frequency.

Inverse Document Frequency

Before getting to Inverse Document Frequency, let’s understand Document Frequency first. In a corpus of multiple documents, Document Frequency measures the occurrence of a word in the whole corpus of documents(N).

DF(t)= occurrences of t in N documents

This will be high for commonly used words in English that we talked about earlier. Inverse Document Frequency is just the opposite of Document Frequency.

IDF(t)= N / occurrences of t in N documents

This basically measures the usefulness of a term in our corpus. Terms very specific to a particular document will have high IDF. Terms like- biomedical, genomic, etc. will only be present in documents related to biology and will have a high IDF.

TF-IDF = Term Frequency * Inverse Document Frequency

The whole idea behind TF-IDF is to find important words in a document by finding those words that have a high frequency in that document but not anywhere else in the corpus. For a document related to Computer Science, these words could be – Computational, data, processor, etc. but for an astronomical document, it would be- extraterrestrial, galactic, black hole, etc. Now, let’s understand the TF-IDF NLP technique with an example using the Scikit-learn library in Python -

TFIDF Vectorizer

Remember, our first document?

TD IDF Python

This document belongs to the ‘rec.autos’ category. Let’s view the result of TF-IDF for this.

TF-IDF Python


Other than the person’s email-id, words very specific to the class Auto like- car, Bricklin, bumper, etc. have a high TF-IDF score.

5) Keyword Extraction

When you are reading a piece of text be it on your phone, newspaper, or a book, you perform this involuntary activity of skimming through it- you mostly ignore filler words and find important words from the text and everything else fits in context. Keyword Extraction does exactly the same thing as finding important keywords in a document. Keyword Extraction is a text analysis NLP technique for obtaining meaningful insights for a topic in a short span of time. Instead of having to go through the document, the keyword extraction technique can be used to concise the text and extract relevant keywords. The keyword Extraction technique is of great use in NLP applications where a business wants to identify the problems customers have based on the reviews or if you want to identify topics of interest from a recent news item.

There are several ways to do this -

  1. One is through TF-IDF like we have seen above. You can extract the top 10 words with the highest TF-IDF and they would be your keywords.
  2. The other method that we will use for Keyword Extraction is using Gensim, an open-source Python library. This article belongs to the soc.religion.christianity category. Let’s view the keywords now.


Keyword Extraction NLP Technique


Simple NLP Technique Keyword Extraction


This returns the top 10 keywords ordered by their scores. Since the document was related to religion, you should expect to find words like- biblical, scripture, Christians.

  1. Keyword extraction can also be implemented using SpaCy, YAKE (Yet Another Keyword Extractor), and Rake-NLTK. You should experiment with these libraries to implement this NLP technique and see which works best for your use-case. ­

6) Word Embeddings

As we know that machine learning and deep learning algorithms only take numerical input, so how can we convert a block of text to numbers that can be fed to these models. When training any kind of model on text data be it classification or regression- it is a necessary condition to transform it into a numerical representation. The answer is simple, follow the word embedding approach for representing text data. This NLP technique lets you represent words with similar meanings to have a similar representation.

Word Embeddings also known as vectors are the numerical representations for words in a language. These representations are learned such that words with similar meaning would have vectors very close to each other. Individual words are represented as real-valued vectors or coordinates in a predefined vector space of n-dimensions. This doesn’t make much sense, does it? Let's understand this with an example.

What are Word Embeddings

Consider a 3- 3-dimensional space as represented above in a 3D plane. Each word is represented by a coordinate(x,y,z) in this space. Words that are similar in meaning would be close to each other in this 3-dimensional space.

  • The distance between walked and the king would be greater than the distance between walked and walking since they have the same root word-walk.
  • Word embeddings are also useful in understanding the relationship between words- what king is to the queen, a man is to woman. Hence, in the vector space, the distance between king and queen would approximately be equal to the distance between man and woman.

One can either use predefined Word Embeddings (trained on a huge corpus such as Wikipedia) or learn word embeddings from scratch for a custom dataset. There are many different kinds of Word Embeddings out there like GloVe, Word2Vec, TF-IDF, CountVectorizer, BERT, ELMO etc. The one we’ll be talking about here is Word2vec.


Word2Vec is a neural network model that learns word associations from a huge corpus of text. Word2vec can be trained in two ways, either by using the Common Bag of Words Model (CBOW) or the Skip Gram Model.

Word Embeddings Word2Vec

Image Credit: https://wiki.pathmind.com/word2vec

In the CBOW model, the context of each word is taken as the input and the word corresponding to the context is to be predicted as the output. Consider an example sentence- “The day is bright and sunny.”

In the above sentence, the word we are trying to predict is sunny, using the input as the average of one-hot encoded vectors of the words- “The day is bright”. This input after passing through the neural network is compared to the one-hot encoded vector of the target word, “sunny”. The loss is calculated, and this is how the context of the word “sunny” is learned in CBOW.

The Skip Gram model works just the opposite of the above approach, we send input as a one-hot encoded vector of our target word “sunny” and it tries to output the context of the target word. For each context vector, we get a probability distribution of V probabilities where V is the vocab size and also the size of the one-hot encoded vector in the above technique.

Now, let’s see how we can implement Word2vec in python. The first step is to download Google’s predefined Word2Vec file from here. The next step is to place the GoogleNews-vectors-negative300.bin file in your current directory. You can use Gensim to load this vector.

What are Word Embeddings in NLP?

This embedding is in 300 dimensions i.e. for every word in the vocabulary we have an array of 300 real values representing it. Now, we’ll use word2vec and cosine similarity to calculate the distance between words like- king, queen, walked, etc.

Word Embeddings Python

Our hypothesis about the distance between the vectors is mathematically proved here. There is less distance between queen and king than between king and walked.  

7) Sentiment Analysis

Sentiment Analysis is also known as emotion AI or opinion mining is one of the most important NLP techniques for text classification. The goal is to classify text like- tweet, news article, movie review or any text on the web into one of these 3 categories- Positive/ Negative/Neutral. Sentiment Analysis is most commonly used to mitigate hate speech from social media platforms and identify distressed customers from negative reviews.

Let’s implement a sentiment analysis model in python. We will download the tweet sentiment Kaggle dataset from here. Unzip it and place it in the current directory.

Load Sentiment Analysis Kaggle Dataset

There are three categories we need to work with- 0 is neutral, -1 is negative and 1 is positive. You can see that the data is clean, so there is no need to apply a cleaning function. However, we’ll still need to implement other NLP techniques like tokenization, lemmatization, and stop words removal for data preprocessing.

So, let’s get started.

Tokenization NLP Python


Word Embeddings NLP Python Example


Until now, there isn’t anything new we have done. The same preprocessing steps that we discussed at the beginning of the article followed by transforming the words to vectors using word2vec. We’ll now split our data into train and test datasets and fit a logistic regression model on the training dataset.

Logistic Regression Model for Sentiment Analysis Python

Logistic Regression is a linear model used for classification problems. It’s always best to fit a simple model first before you move to a complex one. Let’s see how we did on the test set.

Sentiment Analysis NLP Technique Python

65% accuracy isn’t bad considering we used the default settings of a simple model like logistic regression. There are a lot of experiments you can do to improve the performance of the machine learning model –

  • Experiment with the hyperparameters in logistic regression.
  • Use a little more complicated model like Naïve Bayes or SVM.
  • Use Normalization techniques such as MinMax Scaler after transforming text into vectors.

Click here to view a list of 50+ solved, end-to-end Big Data and Machine Learning Project Solutions (reusable code + videos)

8) Topic Modelling

Topic Modelling is a statistical NLP technique that analyzes a corpus of text documents to find the themes hidden in them. The best part is, topic modeling is an unsupervised machine learning algorithm meaning it does not need these documents to be labeled. This technique enables us to organize and summarize electronic archives at a scale that would be impossible by human annotation. Latent Dirichlet Allocation is one of the most powerful techniques used for topic modeling. The basic intuition is that each document has multiple topics and each topic is distributed over a fixed vocabulary of words. Let’s understand this with the help of an example.

Topic Modelling NLP Technique Python

Image Credit: https://oar.princeton.edu/

Let’s say we have a collection of documents. The document we are currently looking at is related to science, more specifically to the subject of biology. This document has a number of topics, color-coded on the left. These topics are broadly related to genes, biology, neuroscience, and computer science. Any of these topics could be most significant in any of the documents in our corpus. For the current document, the topics related to genes and biology are most significant. Now, let’s try and implement this in python. A few things before we start:

  • We’ll use the sentiment analysis dataset that we have used above.
  • Preprocessing steps are needed- tokenization, lemmatization, and stop words removal. Since we have already performed this on the sentiment data, we’ll just proceed from here.

Topic Modelling LDA

Corpora.dictionary is responsible for creating a mapping between words and their integer IDs, quite similarly as in a dictionary. Now, let’s fit an LDA model on this and set the number of topics to 3.

Topic Modelling Algorithms

From the topics unearthed by LDA, you can see political discussions are very common on Twitter, especially in our dataset. “Modi” word is quite popular. There is a very nuanced difference between the three topics.

  • The first topic is more about election and opposition.
  • The theme of the second topic isn’t very clear.
  • The third topic is a mixture of politics and religion.

You can also visualize these results using pyLDAvis.

ldavis topic modelling

Each circle would represent a topic and each topic is distributed over words shown in right.

        Topic Modelling in Python

You can hover over each topic to view the distribution of words in it. Notice that words can be shared between topics. pyLDAvis provides a very intuitive way to view and interpret the results of the fitted LDA topic model.

The best way to select the number of topics depends on two factors:

  • The topics should have distinct separable themes. One topic shouldn’t contain two easily separable themes. In that case, you can increase the number of topics and see.
  • There shouldn’t be an overlap between topics. Different topics should have as distinct themes as possible. The overlap can also be seen by the circles overlapping in the chart above.

9) Text Summarization

This NLP technique is used to concisely and briefly summarize a text in a fluent and coherent manner. Summarization is useful to extract useful information from documents without having to read word to word. This process is very time-consuming if done by a human, automatic text summarization reduces the time radically.

There are two types of text summarization techniques.

  • Extraction Based Summarization: In this technique, some key phrases and words in the document are pulled to make the summary. No changes to the original text are made.

Extraction Based Text Summarization

  • Abstraction Based Summarization: In this text summarization technique, new phrases and sentences are created from the original document that captures the most useful information. The language and sentence structure of the summary is not the same as the original document because this technique involves paraphrasing. We can also overcome the grammatical inconsistencies found in extraction-based methods.


Abstraction Based Text Summarization

We will use Spacy to implement text summarization in python. We have also defined the document we want to summarize.

Text Summarization NLP Technique

The next step is to tokenize the document and remove stop words and punctuations. After that, we’ll use a counter to count the frequency of words and get the top-5 most frequent words in the document.


Okay, that was simple. Now, let’s normalize the frequency by dividing by max frequency for better processing.

Text Summarization Algorithms

Now, we are going to weigh our sentences based on how frequently a word is in them (using the above-normalized frequency).

Text Summarization Python

The final step is to use nlargest to get the top 3 weighed sentences in the document to generate the summary.

Python Text Summarization

The summary is quite good if you see it. There are also other methods or built-in functions in genism to do this, but the results might not be that great.

10) Named Entity Recognition

NER is a subfield of Information Extraction that deals with locating and classifying named entities into predefined categories like person names, organization, location, event, date, etc. from an unstructured document. NER is to an extent similar to Keyword Extraction except for the fact that the extracted keywords are put into already defined categories. This is indeed one step ahead of what we do with keyword extraction. There are built-in functions in Spacy to do this. We’ll use a new excerpt from an article.

Named Entity Recognition NLP Technique

SpaCy can easily extract entities from this in a line or two.


The predefined categories associated with extracted entities are self-explanatory except for:

  • CARDINAL- which stands for a countable number.
  • GPE- stands for countries, cities, states.
  • NORP- is for Nationalities or religious or political groups. 

To learn more about these categories, you can refer to this documentation. We can also visualize the text with entities using displacy- a function provided by SpaCy.

Named Entity Recognition NLP Python

Here, we have used a predefined NER model but you can also train your own NER model from scratch. However, this is useful when the dataset is very domain-specific and SpaCy cannot find most entities in it. One of the examples where this usually happens is with the name of Indian cities and public figures- spacy isn’t able to accurately tag them.

Key Takeaways

  • Some of these NLP techniques that fall under the realm of text preprocessing- tokenization, lemmatization, stop words removal are going to be used irrespective of the NLP application you are working with.
  • While other techniques are more useful in analyzing texts like- TF-IDF, keyword extraction, text summarization, and NER. They can also serve as a backbone when training NLP models on classification tasks because they easily extract useful information from the text.
  • NLP techniques like topic modeling are very useful in extracting themes from a large corpus and labeling the dataset.

Become a Machine Learning Engineer