NLP Project on LDA Topic Modelling Python using RACE Dataset

Use the RACE dataset to extract a dominant topic from each document and perform LDA topic modeling in python.

START PROJECT

Project Template Outcomes

Understanding the problem statement
How and what kind of text cleaning needs to be done
What tokenization and lemmatization is
Performing EDA on documents word and POS counts, most occurring words
Types of vectorizer such as TF IDF and Countvectorizer
Understanding the basic math and the working behind various Topic Modeling algorithms
Implementation of Topic Modeling algorithms such as LSA(Latent Semantic Analysis), LDA(Latent Dirichlet Allocation), NMF(Non-Negative Matrix Factorization)
Hyper parameter tuning using GridSearchCV
Analyzing top words for topics and top topics for documents
Distribution of topics over the entire corpus
Visualizing distribution of topics using TSNE
Visualizing top words in a topic using WordCloud
Visualizing the distribution of topics and the occurrence and weightage of words using interactive tool which is pyLDAvis
Comparing and checking the distribution of the topics using metrics such as Perplexity and Coherence Score
Training and predicting the documents using LDA and NMF in a modular code using python script.

Get started today

Request for free demo with us.

Architecture Diagrams

Unlimited 1:1 Live Interactive Sessions

60-minute live session
Schedule 60-minute live interactive 1-to-1 video sessions with experts.
No extra charges
Unlimited number of sessions with no extra charges. Yes, unlimited!
We match you to the right expert
Give us 72 hours prior notice with a problem statement so we can match you to the right expert.
Schedule recurring sessions
Schedule recurring sessions, once a week or bi-weekly, or monthly.

Pick your favorite expert
If you find a favorite expert, schedule all future sessions with them.
Use the 1-to-1 sessions to
- Troubleshoot your projects
- Customize our templates to your use-case
- Build a project portfolio
- Brainstorm architecture design
- Bring any project, even from outside ProjectPro
- Mock interview practice
- Career guidance
- Resume review

START PROJECT

Customers sharing their love on online platforms

Source:

Benefits

250+ end-to-end project solutions

Each project solves a real business problem from start to finish. These projects cover the domains of Data Science, Machine Learning, Data Engineering, Big Data and Cloud.

15 new projects added every month

New projects every month to help you stay updated in the latest tools and tactics.

500,000 lines of code

Each project comes with verified and tested solutions including code, queries, configuration files, and scripts. Download and reuse them.

600+ hours of videos

Each project solves a real business problem from start to finish. These projects cover the domains of Data Science, Machine Learning, Data Engineering, Big Data and Cloud.

Cloud Lab Workspace

New projects every month to help you stay updated in the latest tools and tactics.

Unlimited 1:1 sessions

Each project comes with verified and tested solutions including code, queries, configuration files, and scripts. Download and reuse them.

Technical Support

Chat with our technical experts to solve any issues you face while building your projects.

7 Days risk-free trial

We offer an unconditional 7-day money-back guarantee. Use the product for 7 days and if you don't like it we will make a 100% full refund. No terms or conditions.

Payment Options

0% interest monthly payment schemes available for all countries.

START PROJECT

Testimonials

As a student looking to break into the field of data engineering and data science, one can get really confused as to which path to take. Very few ways to do it are Google, YouTube, etc. I was one of them too, and that's when I came across ProjectPro while watching one of the SQL videos on the E-Learning Bridge YouTube channel. One of the standout features was that it featured real projects on topics I just read about, across different job descriptions at the time. The main issue was the right path to guide us in using these tools and adding to the resume, and that's exactly what ProjectPro got me through. The fact that I can have a reliable route and videos explaining each tool in detail really motivated me to continue with the platform. Another thing we all struggle with is how to really connect with someone if we're stuck somewhere because there are so many solutions. But this has also been solved by experts we can chat with and believe me when I say this they will do whatever it takes to solve your problem even if it takes longer than expected. In my sophomore year of college and getting hands-on exposure to technologies like PySpark, NLP, Kafka, etc, and being able to really apply the theory and work on a project from start to finish really boosted my confidence in general!

Savvy Sahai

Data Science Intern, Capgemini

I come from a background in Marketing and Analytics and when I developed an interest in Machine Learning algorithms, I did multiple in-class courses from reputed institutions though I got good theoretical knowledge, the practical approach, real word application, and deployment knowledge were missing. ProjectPro helped me bridge that gap. ProjectPro has real-time projects that helped me improve my skills. What I liked most is that I get exposure to so many projects, given the work nature I wouldn't have gotten exposure to such a variety of projects and their approaches. It is helping me apply knowledge to other projects too. I highly recommend ProjectPro to everyone who wants to excel in their DataScience career.

Ameeruddin Mohammed

ETL (Abintio) developer at IBM

I am the Director of Data Analytics with over 10+ years of IT experience. I have a background in SQL, Python, and Big Data working with Accenture, IBM, and Infosys. I am looking to enhance my skills in Data Engineering/Science and hoping to find real-world projects fortunately, I came across Project Pro. Project Pro helped me by providing an in-depth explanation of the end-to-end real-world data engineering projects. From data extraction, transformation, and storage up to data visualization. I learned more about Kafka, AWS, NI-FI, and Spark. Thru the help of the knowledge I gained from Project Pro, I was able to do well in the coding exams, interview and helped me land a job at EY. I will recommend every aspiring data professional as well as existing data science/engineer expert to try Project Pro to enhance their knowledge.

Ed Godalle

Director Data Analytics at EY / EY Tech

I come from Northwestern University, which is ranked 9th in the US. Although the high-quality academics at school taught me all the basics I needed, obtaining practical experience was a challenge. This is when I was introduced to ProjectPro, and the fact that I am on my second subscription year only goes to prove that the ROI is satisfactory. I managed to switch to analytics companies, only because of the relevant practical experience this product served me with. I now work at a leading healthcare startup as a Senior Analytics Consultant. I am a customer who is not only satisfied with ProjectPro but also mighty impressed by how Dezyre bends over backward to ensure customer satisfaction. I have had a couple of interactions with Binny and each time I was left happy and content. I also had a conversation with their investors, and I was really glad to articulate my appreciation of the product. They not only have enterprise-grade projects, but also set up 1:1 sessions with seasoned experts in case we get stuck, or are having trouble understanding a certain concept. As the cherry on the icing, there are experts to guide you with resume writing and interview preparation as well, to culminate the whole process of making you job-ready. Kudos to ProjectPro!

Abhinav Agarwal

Graduate Student at Northwestern University

ProjectPro is a unique platform and helps many people in the industry to solve real-life problems with a step-by-step walkthrough of projects. A platform with some fantastic resources to gain hands-on experience and prepare for job interviews. I would highly recommend this platform to anyone looking to upskill and stay updated with the latest projects and solutions. Overall this platform is awesome and worth the money spent as we get a lot of value out of it and helps soar our career to greater heights.

Anand Kumpatla

Sr Data Scientist @ Doubleslash Software Solutions Pvt Ltd

I think that they are fantastic. I attended Yale and Stanford and have worked at Honeywell,Oracle, and Arthur Andersen(Accenture) in the US. I have taken Big Data and Hadoop,NoSQL, Spark, Hadoop Admin, Hadoop projects. I have been happy with every project. They have really brought me into the forefront of Data Science and Big data. I would recommend this to everyone. It is more than worth the price. After working with them I feel so much more employable for current projects.

Ray han

Tech Leader | Stanford / Yale University

Having worked in the field of Data Science, I wanted to explore how I can implement projects in other domains, So I thought of connecting with ProjectPro. A project that helped me absorb this topic was "Credit Risk Modelling". To understand other domains, it is important to wear a thinking cap and that's where ProjectPro helped me. I also got a chance to talk to experts who have worked on these domains - they helped me by walking through the project. Kudos to the ProjectPro team!

Gautam Vermani

Data Consultant at Confidential

ProjectPro is an awesome platform that helps me learn much hands-on industrial experience with a step-by-step walkthrough of projects. There are two primary paths to learn: Data Science and Big Data. In each learning path, there are many customized projects with all the details from the beginner to the expert. As a new data science learner, you can just follow these projects to master the important techniques quickly. It is really helpful for both my research and job searching. Hope you can come and join ProjectPro to win a great future for yourself.

Jingwei Li

Graduate Research assistance at Stony Brook University

View all Testimonial

Comparison with other platforms

We provide ready-made project templates that solve real business problems, end-to-end and comes with solution code,
explanation videos, cloud lab environment and tech support.

End-to-end implementation

Real industry grade projects
by industry experts

Ready-made solutions to real

business problems

Detailed Explanations

Courses/ Tutorials

Our expert panel

Ana Garcia

Director of Data Science & AnalyticsDirector, ZipRecruiter

Ted Anderson

Director of Business Intelligence , CouponFollow

Kai Tarafdar

NLP Engineer, Speechkit

Benjamin Larson

Principal Data Scientist - Cyber Security Risk Management, Verizon

Victoria Williams

Senior Data Engineer, Hogan Assessment Systems

Kirk Borne

Chief Science Officer at DataPrime, Inc.

Tory Borsboom-Hanson

Data Science Consultant, Fractal Analytics

Shaurya Uppal

Data Scientist, Inmobi

Divya Sistla

Data Engineering Lead - Uber

Saniya Zahid

Principal Software Engineer, Afiniti

Amedeo Biolatti

Data Scientist, SwissRe

Sara Beck

Head of Data Science, Slated

Muhy Eddin Zater

Senior Data Scientist, Mawdoo3 Ltd

Diego Argueta

Senior Data Platform Engineer, GoodRx

Guang Yang

Senior Applied Scientist, Amazon

Kedar Kanhere

Data Scientist, Credit Suisse

Gareth Morinan

Chief Scientific Officer, Machine Medicine Technologies

Manoj Kumar

Data Scientist, Boeing

Anh Le

Data and Blockchain Professional

Balram Singh

Data Engineering Manager, Microsoft Corporation

Dina Jankovic

Data Science, Yelp

Camille Girabawe

Machine Learning Manager, Adobe

Carlos Contreras

Big Data & Analytics architect, Amazon

Stefan Jenkins

Data Engineer, Microsoft

Bertil Hatt

Head of Data science, OutFund

Varun Jain

Senior Data Engineer, Publicis Sapient

James Briggs

Dev Advocate, Pinecone and Freelance ML

Mir Muntasar Ali Agha

Senior Data Engineer, National Bank of Belgium

Shraddha Surana

Global Data Community Lead | Lead Data Scientist, Thoughtworks

Pawan Kumar Yerravelly

Data Engineer - Capacity Supply Chain and Provisioning, Microsoft India CoE

Brian Zhu

Big Data Engineer, Beyond Limits

Mehmet Akgun

University of Economics and Technology, Instructor

Deepak Sahu

Senior Data Engineer, Slintel-6sense company

Ana Garcia

Director of Data Science & AnalyticsDirector, ZipRecruiter

Ted Anderson

Director of Business Intelligence , CouponFollow

Kai Tarafdar

NLP Engineer, Speechkit

Benjamin Larson

Principal Data Scientist - Cyber Security Risk Management, Verizon

Victoria Williams

Senior Data Engineer, Hogan Assessment Systems

Kirk Borne

Chief Science Officer at DataPrime, Inc.

Tory Borsboom-Hanson

Data Science Consultant, Fractal Analytics

Shaurya Uppal

Data Scientist, Inmobi

Divya Sistla

Data Engineering Lead - Uber

Saniya Zahid

Principal Software Engineer, Afiniti

Amedeo Biolatti

Data Scientist, SwissRe

Sara Beck

Head of Data Science, Slated

Muhy Eddin Zater

Senior Data Scientist, Mawdoo3 Ltd

Diego Argueta

Senior Data Platform Engineer, GoodRx

Guang Yang

Senior Applied Scientist, Amazon

Kedar Kanhere

Data Scientist, Credit Suisse

Gareth Morinan

Chief Scientific Officer, Machine Medicine Technologies

Manoj Kumar

Data Scientist, Boeing

Anh Le

Data and Blockchain Professional

Balram Singh

Data Engineering Manager, Microsoft Corporation

Dina Jankovic

Data Science, Yelp

Camille Girabawe

Machine Learning Manager, Adobe

Carlos Contreras

Big Data & Analytics architect, Amazon

Stefan Jenkins

Data Engineer, Microsoft

Bertil Hatt

Head of Data science, OutFund

Varun Jain

Senior Data Engineer, Publicis Sapient

James Briggs

Dev Advocate, Pinecone and Freelance ML

Mir Muntasar Ali Agha

Senior Data Engineer, National Bank of Belgium

Shraddha Surana

Global Data Community Lead | Lead Data Scientist, Thoughtworks

Pawan Kumar Yerravelly

Data Engineer - Capacity Supply Chain and Provisioning, Microsoft India CoE

Brian Zhu

Big Data Engineer, Beyond Limits

Mehmet Akgun

University of Economics and Technology, Instructor

Deepak Sahu

Senior Data Engineer, Slintel-6sense company

Project Description

Business Context

With the advent of big data and Machine Learning along with Natural Language Processing, it has become the need of an hour to extract a certain topic or a collection of topics that the document is about. Think when you have to analyze or go through thousands of documents and categorize under 10 – 15 buckets. How tedious and boring will it be ?

Thanks to Topic Modeling where instead of manually going through numerous documents, with the help of Natural Language Processing and Text Mining, each document can be categorized under a certain topic.

Thus, we expect that logically related words will co-exist in the same document more frequently than words from different topics. For example, in a document about space, it is more possible to find words such as: planet, satellite, universe, galaxy, and asteroid. Whereas, in a document about the wildlife, it is more likely to find words such as: ecosystem, species, animal, and plant, landscape. A topic contains a cluster of words that frequently occurs together. A topic modeling can connect words with similar meanings and distinguish between uses of words with multiple meanings.

A sentence or a document is made up of numerous topics and each topic is made up of numerous words.

Data Overview

The dataset has odd 25000 documents where words are of various nature such as Noun,Adjective,Verb,Preposition and many more. Even the length of documents varies vastly from having a minimum number of words in the range around 40 to maximum number of words in the range around 500. Complete data is split 90% in the training and the rest 10% to get an idea how to predict a topic on unseen documents.

Objective

To extract or identify a dominant topic from each document and perform topic modeling.

Tools and Libraries

We will be using Python as a tool to perform all kinds of operations.

Main Libraries used are

Pandas for data manipulation, aggregation
Matplotlib and bokeh for visualization of how documents are structured.
NumPy for computationally efficient operations.
Scikit Learn and Gensim packages for topic modeling
nltk for text cleaning and preprocessing
TSNE and pyLDAvis for visualization of topics

Approach

Topic EDA

Top Words within topics using Word Cloud
Topics distribution using t-SNE
Topics distribution and words importance within topics using interactive tool pyLDAvis

Documents Pre-processing

Lowering all the words in documents and removing everything except alphabets.
Tokenizing each sentence and lemmatizing each word and storing in a list only if it is not a stop word and length of a word is greater than 3 alphabets.
Joining the list to make a document and also keeping the lemmatized tokens for NMF Topic Modelling.
Transforming the above pre-processed documents using TF IDF and Count Vectorizer depending on the chosen algorithm

Topic Modelling algorithms

Latent Semantic Analysis or Latent Semantic Indexing (LSA)
Latent Dirichlet Allocation (LDA)
Non-Negative Matrix Factorization (NMF)
Popular topic modelling metric score known as Coherence Score
Predicting a set of topics and the dominant topic for each documents
Running a python script end to end using Command Prompt

Code Overview

Complete dataset is splitted into 90% for training and 10% for predicting unseen documents.
Preprocessing is done to avoid noise

Lowering all the words and replacing words in their normal form and keeping only alphabets.
Making a new document after tokenizing each sentence and lemmatizing every word.

For LSA and LDA Topic Modeling

TF IDF Vectorizer and Countvectorizer is fitted and transformed on a clean set of documents and topics are extracted using sklean LSA and LDA packages respectively and proceeded with 10 topics for both the algorithms.

For NMF Topic Modeling

TF IDF Vectorizer is fitted and transformed on clean tokens and 13 topics are extracted and the number was found using Coherence Score.

Topics distribution is analyzed using t-SNE algorithm and iterative tool using pyLDAvis.
For unseen documents, topics were predicted using the above three algorithms.

START PROJECT

Topics Covered

Introduction - Problem Statement 06m
Splitting documents into train test 01m
Cleaning the documents 04m
EDA on documents on top words and length of docs 04m
Understanding Topic Modeling LSA and TFIDF Vectorizer 11m
Distribution of topics over documents and words over topics 05m
Visualizing topics distribution using TSNE 03m
Visualizing top occuring words in topics using WordCloud 02m
Predictions on unseen documents using LSA 03m
Understanding Topic Modeling LDA and Count Vectorizer 04m
Training the model using LDA and checking metrics 04m
Finding optimal parameters using GridSearchCV 04m
Visualizing topics distribution using TSNE and pyLDAvis 11m
Understanding popular topic modeling metric 10m
Understanding Topic Modeling NMF 03m
Finding optimal parameters using Coherence Score 04m
Visualizing topics distribution and words relevance using pyLDAvis 08m
Modular Code Overview and training and predicting topics using NMF and LDA 07m

START PROJECT

Recommended
Projects

Latest Blogs

Data Products-Your Blueprint to Maximizing ROI

Explore ProjectPro's Blueprint on Data Products for Maximizing ROI to Transform your Business Strategy.

Best MLOps Certifications To Boost Your Career In 2024

Chart your course to success with our ultimate MLOps certification guide. Explore the best options and pave the way for a thriving MLOps career. | ProjectPro