NLP Project on LDA Topic Modelling Python using RACE Dataset

NLP Project on LDA Topic Modelling Python using RACE Dataset

Use the RACE dataset to extract a dominant topic from each document and perform LDA topic modeling in python.
explanation image


Each project comes with 2-5 hours of micro-videos explaining the solution.

ipython image

Code & Dataset

Get access to 50+ solved projects with iPython notebooks and datasets.

project experience

Project Experience

Add project experience to your Linkedin/Github profiles.

Customer Love

Read All Reviews
profile image

Ray Han linkedin profile url

Tech Leader | Stanford / Yale University

I think that they are fantastic. I attended Yale and Stanford and have worked at Honeywell,Oracle, and Arthur Andersen(Accenture) in the US. I have taken Big Data and Hadoop,NoSQL, Spark, Hadoop... Read More

profile image

Prasanna Lakshmi T linkedin profile url

Advisory System Analyst at IBM

Initially, I was unaware of how this would cater to my career needs. But when I stumbled through the reviews given on the website. I went through many of them and found them all positive. I would... Read More

What will you learn

Understanding the problem statement

How and what kind of text cleaning needs to be done
What tokenization and lemmatization is
Performing EDA on documents word and POS counts, most occurring words
Types of vectorizer such as TF IDF and Countvectorizer
Understanding the basic math and the working behind various Topic Modeling algorithms
Implementation of Topic Modeling algorithms such as LSA(Latent Semantic Analysis), LDA(Latent Dirichlet Allocation), NMF(Non-Negative Matrix Factorization)
Hyper parameter tuning using GridSearchCV
Analyzing top words for topics and top topics for documents
Distribution of topics over the entire corpus
Visualizing distribution of topics using TSNE
Visualizing top words in a topic using WordCloud
Visualizing the distribution of topics and the occurrence and weightage of words using interactive tool which is pyLDAvis
Comparing and checking the distribution of the topics using metrics such as Perplexity and Coherence Score
Training and predicting the documents using LDA and NMF in a modular code using python script.

Project Description

Business Context 

With the advent of big data and Machine Learning along with Natural Language Processing, it has become the need of an hour to extract a certain topic or a collection of topics that the document is about. Think when you have to analyze or go through thousands of documents and categorize under 10 – 15 buckets. How tedious and boring will it be ?

Thanks to Topic Modeling where instead of manually going through numerous documents, with the help of Natural Language Processing and Text Mining, each document can be categorized under a certain topic.

Thus, we expect that logically related words will co-exist in the same document more frequently than words from different topics. For example, in a document about space, it is more possible to find words such as: planet, satellite, universe, galaxy, and asteroid. Whereas, in a document about the wildlife, it is more likely to find words such as: ecosystem, species, animal, and plant, landscape. A topic contains a cluster of words that frequently occurs together. A topic modeling can connect words with similar meanings and distinguish between uses of words with multiple meanings.

A sentence or a document is made up of numerous topics and each topic is made up of numerous words.

Data Overview

The dataset has odd 25000 documents where words are of various nature such as Noun,Adjective,Verb,Preposition and many more. Even the length of documents varies vastly from having a minimum number of words in the range around 40 to maximum number of words in the range around 500. Complete data is split 90% in the training and the rest 10% to get an idea how to predict a topic on unseen documents.


  To extract or identify a dominant topic from each document and perform topic modeling.

Tools and Libraries

We will be using Python as a tool to perform all kinds of operations.

Main Libraries used are

  • Pandas for data manipulation, aggregation
  • Matplotlib and bokeh for visualization of how documents are structured.
  • NumPy for computationally efficient operations.
  • Scikit Learn and Gensim packages for topic modeling
  • nltk for text cleaning and preprocessing
  • TSNE and pyLDAvis for visualization of topics


Topic EDA

  • Top Words within topics using Word Cloud
  • Topics distribution using t-SNE
  • Topics distribution and words importance within topics using interactive tool pyLDAvis

Documents Pre-processing 

  • Lowering all the words in documents and removing everything except alphabets.
  • Tokenizing each sentence and lemmatizing each word and storing in a list only if it is not a stop word and length of a word is greater than 3 alphabets.
  • Joining the list to make a document and also keeping the lemmatized tokens for NMF Topic Modelling.
  • Transforming the above pre-processed documents using TF IDF and Count Vectorizer depending on the chosen algorithm

 Topic Modelling algorithms 

  • Latent Semantic Analysis or Latent Semantic Indexing (LSA)
  • Latent Dirichlet Allocation (LDA)
  • Non-Negative Matrix Factorization (NMF)
  • Popular topic modelling metric score known as Coherence Score
  • Predicting a set of topics and the dominant topic for each documents
  • Running a python script end to end using Command Prompt

Code Overview

  1. Complete dataset is splitted into 90% for training and 10% for predicting unseen documents.
  2. Preprocessing is done to avoid noise
  • Lowering all the words and replacing words in their normal form and keeping only alphabets.
  • Making a new document after tokenizing each sentence and lemmatizing every word. 
  1. For LSA and LDA Topic Modeling
  •  TF IDF Vectorizer and Countvectorizer is fitted and transformed on a clean set of documents and topics are extracted using sklean LSA and LDA packages respectively and proceeded with 10 topics for both the algorithms.
  1. For NMF Topic Modeling
  • TF IDF Vectorizer is fitted and transformed on clean tokens and 13 topics are extracted and the number was found using Coherence Score.
  1. Topics distribution is analyzed using t-SNE algorithm and iterative tool using pyLDAvis.
  2. For unseen documents, topics were predicted using the above three algorithms.

Similar Projects

In this supervised learning machine learning project, you will predict the availability of a driver in a specific area by using multi step time series analysis.

Use the Amazon Reviews/Ratings dataset of 2 Million records to build a recommender system using memory-based collaborative filtering in Python.

In this machine learning churn project, we implement a churn prediction model in python using ensemble techniques.

Curriculum For This Mini Project

Introduction - Problem Statement
Splitting documents into train test
Cleaning the documents
EDA on documents on top words and length of docs
Understanding Topic Modeling LSA and TFIDF Vectorizer
Distribution of topics over documents and words over topics
Visualizing topics distribution using TSNE
Visualizing top occuring words in topics using WordCloud
Predictions on unseen documents using LSA
Understanding Topic Modeling LDA and Count Vectorizer
Training the model using LDA and checking metrics
Finding optimal parameters using GridSearchCV
Visualizing topics distribution using TSNE and pyLDAvis
Understanding popular topic modeling metric
Understanding Topic Modeling NMF
Finding optimal parameters using Coherence Score
Visualizing topics distribution and words relevance using pyLDAvis
Modular Code Overview and training and predicting topics using NMF and LDA