How to create an LDA topic model using Gensim

In this recipe, we will learn to create an LDA topic model using the Gensim library. We will make use of the 20 Newsgroups dataset to create the LDA model.
Last Updated: 06 Sep 2022

Get access to Data Science projects View all Data Science projects

MACHINE LEARNING PROJECTS IN PYTHON DATA CLEANING PYTHON DATA MUNGING MACHINE LEARNING RECIPES PANDAS CHEATSHEET ALL TAGS

Recipe Objective: How to create an LDA topic model using Gensim?

The topic modelling strategy used by LDA is to assign text in a document to a specific topic, and LDA constructs Dirichlet distributions as a model.

A model of a topic per document and a model of words per topic. It re-arranges the topic-keyword distribution after giving the LDA topic model algorithm to produce a decent composition of the topic-keyword distribution. The distribution of themes inside the document and the distribution of keywords within the topics

Hands-On Approach to Topic Modelling in Python

Some of the assumptions made by LDA during processing are:
Every document is modelled as a set of multi-nominal topic distributions.
Every topic is represented by multi-nominal word distributions.
Because LDA believes that each piece of text contains the linked terms, we should choose the proper corpus of data. LDA also expects that the documents are made up of various subjects.

To extract naturally talked topics from the dataset, we'll utilize LDA (Latent Dirichlet Allocation). We'll use the '20 Newsgroups' dataset, which contains thousands of news pieces from various areas of a news report. It can be found in the Sklearn data sets section.

#importing required libraries import re import numpy as np import pandas as pd from pprint import pprint import gensim import gensim.corpora as corpora from gensim.utils import simple_preprocess from nltk.corpus import stopwords from gensim.models import CoherenceModel import spacy import pyLDAvis import pyLDAvis.gensim_models import matplotlib.pyplot as plt import nltk import spacy nltk.download('stopwords') nlp=spacy.load('en_core_web_sm',disable=['parser', 'ner']) #importing the Stopwords to use them stop_words = stopwords.words('english') stop_words.extend(['from', 'subject', 're', 'edu', 'use','for']) #downloading the data from sklearn.datasets import fetch_20newsgroups newsgroups_train = fetch_20newsgroups(subset='train') data = newsgroups_train.data data = [re.sub('\S*@\S*\s?', '', sent) for sent in data] data = [re.sub('\s+', ' ', sent) for sent in data] data = [re.sub("\'", "", sent) for sent in data] #cleaning the text def tokeniz(sentences): for sentence in sentences: yield(gensim.utils.simple_preprocess(str(sentence), deacc=True)) processed_data = list(tokeniz(data)) #Building Bigram & Trigram Models bigram = gensim.models.Phrases(processed_data, min_count=5, threshold=100) trigram = gensim.models.Phrases(bigram[processed_data], threshold=100) bigram_mod = gensim.models.phrases.Phraser(bigram) trigram_mod = gensim.models.phrases.Phraser(trigram) #function to filter out stopwords def remove_stopwords(texts): return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts] #function to create bigrams def create_bigrams(texts): return [bigram_mod[doc] for doc in texts] #function to create trigrams def create_trigrams(texts): [trigram_mod[bigram_mod[doc]] for doc in texts] #function for lemmatization def lemmatize(texts, allowed_postags=['NOUN', 'ADJ', 'VERB']): texts_op = [] for sent in texts: doc = nlp(" ".join(sent)) texts_op.append([token.lemma_ for token in doc if token.pos_ in allowed_postags]) return texts_op #removing stopwords, creating bigrams and lemmatizing the text data_wo_stopwords = remove_stopwords(processed_data) data_bigrams = create_bigrams(data_wo_stopwords) data_lemmatized = lemmatize(data_bigrams, allowed_postags=[ 'NOUN', 'ADJ', 'VERB']) #printing the lemmatized data print(data_lemmatized[:3]) #creating a dictionary gensim_dictionary = corpora.Dictionary(data_lemmatized) texts = data_lemmatized #building a corpus for the topic model gensim_corpus = [gensim_dictionary.doc2bow(text) for text in texts] #printing the corpus we created above. print(gensim_corpus[:3]) #we can print the words with their frequencies. [[(gensim_dictionary[id], freq) for id, freq in cp] for cp in gensim_corpus[:4]] #creating the LDA model lda_model = gensim.models.ldamodel.LdaModel( corpus=gensim_corpus, id2word=gensim_dictionary, num_topics=20, random_state=100, update_every=1, chunksize=100, passes=10, alpha='auto', per_word_topics=True )

This model can now get the topics, compute model perplexity, coherence score, etc.

What Users are saying..

Savvy Sahai

Data Science Intern, Capgemini

As a student looking to break into the field of data engineering and data science, one can get really confused as to which path to take. Very few ways to do it are Google, YouTube, etc. I was one of... Read More

Relevant Projects

Machine Learning Projects

Data Science Projects

Python Projects for Data Science

Data Science Projects in R

Machine Learning Projects for Beginners

Deep Learning Projects

Neural Network Projects

Tensorflow Projects

NLP Projects

Kaggle Projects

IoT Projects

Big Data Projects

Hadoop Real-Time Projects Examples

Spark Projects

Data Analytics Projects for Students

Relevant Projects

Text Classification with Transformers-RoBERTa and XLNet Model

In this machine learning project, you will learn how to load, fine tune and evaluate various transformer models for text classification tasks.

View Project Details

Recommender System Machine Learning Project for Beginners-1

Recommender System Machine Learning Project for Beginners - Learn how to design, implement and train a rule-based recommender system in Python

View Project Details

Customer Churn Prediction Analysis using Ensemble Techniques

In this machine learning churn project, we implement a churn prediction model in python using ensemble techniques.

View Project Details

Ensemble Machine Learning Project - All State Insurance Claims Severity Prediction

In this ensemble machine learning project, we will predict what kind of claims an insurance company will get. This is implemented in python using ensemble machine learning algorithms.

View Project Details

Build an AI Chatbot from Scratch using Keras Sequential Model

In this NLP Project, you will learn how to build an AI Chatbot from Scratch using Keras Sequential Model.

View Project Details

Hands-On Approach to Master PyTorch Tensors with Examples

In this deep learning project, you will learn how to perform various operations on the building block of PyTorch : Tensors.

View Project Details

Mastering A/B Testing: A Practical Guide for Production

In this A/B Testing for Machine Learning Project, you will gain hands-on experience in conducting A/B tests, analyzing statistical significance, and understanding the challenges of building a solution for A/B testing in a production environment.

View Project Details

Stock Price Prediction Project using LSTM and RNN

Learn how to predict stock prices using RNN and LSTM models. Understand deep learning concepts and apply them to real-world financial data for accurate forecasting.

View Project Details

Learn Object Tracking (SOT, MOT) using OpenCV and Python

Get Started with Object Tracking using OpenCV and Python - Learn to implement Multiple Instance Learning Tracker (MIL) algorithm, Generic Object Tracking Using Regression Networks Tracker (GOTURN) algorithm, Kernelized Correlation Filters Tracker (KCF) algorithm, Tracking, Learning, Detection Tracker (TLD) algorithm for single and multiple object tracking from various video clips.

View Project Details

Learn How to Build a Linear Regression Model in PyTorch

In this Machine Learning Project, you will learn how to build a simple linear regression model in PyTorch to predict the number of days subscribed.

View Project Details

How to create an LDA topic model using Gensim

Recipe Objective: How to create an LDA topic model using Gensim?

Savvy Sahai

Relevant Projects

You might also like

Relevant Projects