How to view topics in LDA topic model in Gensim

In this recipe, we will first create an LDA model using the gensim library in python and then learn the steps to view the topics in the model.

Recipe Objective: How to view topics in the LDA topic model in Gensim?

First, create or load an LDA model as we did in the previous recipe by following the steps given below-

#importing required libraries
import re
import numpy as np
import pandas as pd
from pprint import pprint
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from nltk.corpus import stopwords
from gensim.models import CoherenceModel
import spacy
import pyLDAvis
import pyLDAvis.gensim_models
import matplotlib.pyplot as plt
import nltk
import spacy
nltk.download('stopwords')
nlp=spacy.load('en_core_web_sm',disable=['parser', 'ner'])

#importing the Stopwords to use them
stop_words = stopwords.words('english')
stop_words.extend(['from', 'subject', 're', 'edu', 'use','for'])

#downloading the data
from sklearn.datasets import fetch_20newsgroups
newsgroups_train = fetch_20newsgroups(subset='train')
data = newsgroups_train.data
data = [re.sub('\S*@\S*\s?', '', sent) for sent in data]
data = [re.sub('\s+', ' ', sent) for sent in data]
data = [re.sub("\'", "", sent) for sent in data]

#cleaning the text
def tokeniz(sentences):
  for sentence in sentences:
   yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))
processed_data = list(tokeniz(data))

#Building Bigram & Trigram Models
bigram = gensim.models.Phrases(processed_data, min_count=5, threshold=100)
trigram = gensim.models.Phrases(bigram[processed_data], threshold=100)
bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)

#function to filter out stopwords
def remove_stopwords(texts):
  return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]

#function to create bigrams
def create_bigrams(texts):
  return [bigram_mod[doc] for doc in texts]

#function to create trigrams
def create_trigrams(texts):
  [trigram_mod[bigram_mod[doc]] for doc in texts]

#function for lemmatization
def lemmatize(texts, allowed_postags=['NOUN', 'ADJ', 'VERB']):
  texts_op = []
  for sent in texts:
   doc = nlp(" ".join(sent))
   texts_op.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
  return texts_op

#removing stopwords, creating bigrams and lemmatizing the text
data_wo_stopwords = remove_stopwords(processed_data)
data_bigrams = create_bigrams(data_wo_stopwords)
data_lemmatized = lemmatize(data_bigrams, allowed_postags=[ 'NOUN', 'ADJ', 'VERB'])

#printing the lemmatized data
print(data_lemmatized[:3])

#creating a dictionary
gensim_dictionary = corpora.Dictionary(data_lemmatized)

texts = data_lemmatized

#building a corpus for the topic model
gensim_corpus = [gensim_dictionary.doc2bow(text) for text in texts]

#printing the corpus we created above.
print(gensim_corpus[:3])

#we can print the words with their frequencies.
[[(gensim_dictionary[id], freq) for id, freq in cp] for cp in gensim_corpus[:4]]

#creating the LDA model
lda_model = gensim.models.ldamodel.LdaModel(
corpus=gensim_corpus, id2word=gensim_dictionary, num_topics=20, random_state=100,
update_every=1, chunksize=100, passes=10, alpha='auto', per_word_topics=True
)

#viewing topics
pprint(lda_model.print_topics())

Output:
[(0,
  '0.017*"year" + 0.017*"new" + 0.015*"make" + 0.011*"work" + 0.011*"number" + '
  '0.010*"will" + 0.010*"use" + 0.010*"may" + 0.009*"high" + 0.009*"large"'),
 (1,
  '0.047*"line" + 0.046*"would" + 0.042*"write" + 0.027*"article" + '
  '0.025*"know" + 0.024*"be" + 0.024*"go" + 0.022*"get" + 0.020*"think" + '
  '0.018*"good"'),
 (2,
  '0.038*"man" + 0.016*"straight" + 0.015*"male" + 0.015*"homosexual" + '
  '0.014*"sex" + 0.014*"marriage" + 0.013*"helmet" + 0.013*"gay" + '
  '0.013*"mirror" + 0.012*"creation"'),
 (3,
  '0.030*"mail" + 0.029*"include" + 0.027*"send" + 0.023*"post" + 0.020*"list" '
  '+ 0.020*"source" + 0.018*"information" + 0.017*"address" + 0.016*"email" + '
  '0.015*"book"'),
 (4,
  '0.072*"car" + 0.024*"drug" + 0.023*"distribution_usa" + 0.020*"drive" + '
  '0.019*"model" + 0.018*"engine" + 0.013*"insist" + 0.012*"road" + '
  '0.012*"dealer" + 0.012*"buy"'),
 (5,
  '0.032*"power" + 0.023*"light" + 0.018*"cut" + 0.017*"lebanese" + '
  '0.015*"notice" + 0.014*"bus" + 0.012*"route" + 0.011*"cool" + '
  '0.011*"external" + 0.010*"master"'),
 (6,
  '0.025*"kill" + 0.016*"people" + 0.015*"child" + 0.014*"attack" + '
  '0.013*"say" + 0.013*"death" + 0.013*"war" + 0.012*"soldier" + '
  '0.010*"murder" + 0.010*"village"'),
 (7,
  '0.129*"ax" + 0.109*"max" + 0.051*"bike" + 0.025*"di_di" + 0.019*"ride" + '
  '0.018*"rider" + 0.016*"dog" + 0.008*"biker" + 0.008*"cub" + 0.007*"dare"'),
 (8,
  '0.049*"report" + 0.024*"slave" + 0.020*"brain" + 0.016*"mount" + '
  '0.015*"medium" + 0.014*"laugh" + 0.014*"reference" + 0.014*"beat" + '
  '0.012*"tumor" + 0.012*"mine"'),
 (9,
  '0.038*"key" + 0.016*"system" + 0.015*"use" + 0.014*"test" + 0.014*"entry" + '
  '0.013*"technology" + 0.012*"public" + 0.011*"provide" + 0.011*"encryption" '
  '+ 0.011*"phone"'),
 (10,
  '0.022*"say" + 0.022*"believe" + 0.021*"faith" + 0.016*"religion" + '
  '0.014*"people" + 0.013*"truth" + 0.013*"atheist" + 0.012*"belief" + '
  '0.010*"church" + 0.010*"man"'),
 (11,
  '0.056*"game" + 0.041*"year" + 0.038*"team" + 0.031*"play" + 0.031*"player" '
  '+ 0.017*"run" + 0.017*"field" + 0.017*"score" + 0.016*"division" + '
  '0.014*"last"'),
 (12,
  '0.038*"people" + 0.024*"state" + 0.019*"right" + 0.017*"law" + 0.014*"gun" '
  '+ 0.012*"government" + 0.011*"would" + 0.011*"case" + 0.010*"person" + '
  '0.010*"god"'),
 (13,
  '0.044*"space" + 0.027*"speed" + 0.020*"device" + 0.017*"scsi" + '
  '0.016*"design" + 0.016*"performance" + 0.015*"launch" + 0.014*"compare" + '
  '0.014*"datum" + 0.012*"orbit"'),
 (14,
  '0.022*"reason" + 0.021*"evidence" + 0.018*"may" + 0.014*"point" + '
  '0.014*"claim" + 0.013*"sense" + 0.012*"exist" + 0.010*"question" + '
  '0.010*"make" + 0.009*"must"'),
 (15,
  '0.037*"file" + 0.034*"program" + 0.034*"window" + 0.021*"use" + '
  '0.020*"image" + 0.019*"set" + 0.018*"problem" + 0.015*"version" + '
  '0.015*"solution" + 0.015*"screen"'),
 (16,
  '0.027*"team" + 0.023*"season" + 0.021*"fan" + 0.017*"wing" + 0.016*"trade" '
  '+ 0.015*"box" + 0.014*"playoff" + 0.013*"play" + 0.012*"pen" + 0.012*"cop"'),
 (17,
  '0.027*"pin" + 0.026*"israeli" + 0.025*"suggest" + 0.017*"period" + '
  '0.015*"lead" + 0.015*"greek" + 0.013*"peace" + 0.013*"pro" + '
  '0.012*"examine" + 0.012*"position"'),
 (18,
  '0.035*"sale" + 0.018*"item" + 0.016*"food" + 0.014*"research" + '
  '0.013*"doctor" + 0.013*"cd" + 0.013*"diagnosis" + 0.012*"pain" + '
  '0.011*"treatment" + 0.011*"body"'),
 (19,
  '0.042*"drive" + 0.033*"system" + 0.028*"card" + 0.022*"software" + '
  '0.021*"thank" + 0.020*"computer" + 0.019*"use" + 0.019*"machine" + '
  '0.018*"bit" + 0.015*"color"')]

The LDA model (lda_model) we have created above is used to view the topics from the documents.

What Users are saying..

profile image

Ray han

Tech Leader | Stanford / Yale University
linkedin profile url

I think that they are fantastic. I attended Yale and Stanford and have worked at Honeywell,Oracle, and Arthur Andersen(Accenture) in the US. I have taken Big Data and Hadoop,NoSQL, Spark, Hadoop... Read More

Relevant Projects

Learn to Build a Siamese Neural Network for Image Similarity
In this Deep Learning Project, you will learn how to build a siamese neural network with Keras and Tensorflow for Image Similarity.

Azure Deep Learning-Deploy RNN CNN models for TimeSeries
In this Azure MLOps Project, you will learn to perform docker-based deployment of RNN and CNN Models for Time Series Forecasting on Azure Cloud.

Loan Eligibility Prediction in Python using H2O.ai
In this loan prediction project you will build predictive models in Python using H2O.ai to predict if an applicant is able to repay the loan or not.

Isolation Forest Model and LOF for Anomaly Detection in Python
Credit Card Fraud Detection Project - Build an Isolation Forest Model and Local Outlier Factor (LOF) in Python to identify fraudulent credit card transactions.

Learn How to Build PyTorch Neural Networks from Scratch
In this deep learning project, you will learn how to build PyTorch neural networks from scratch.

Build Classification Algorithms for Digital Transformation[Banking]
Implement a machine learning approach using various classification techniques in Python to examine the digitalisation process of bank customers.

Avocado Machine Learning Project Python for Price Prediction
In this ML Project, you will use the Avocado dataset to build a machine learning model to predict the average price of avocado which is continuous in nature based on region and varieties of avocado.

Recommender System Machine Learning Project for Beginners-3
Content Based Recommender System Project - Building a Content-Based Product Recommender App with Streamlit

Build Multi Class Text Classification Models with RNN and LSTM
In this Deep Learning Project, you will use the customer complaints data about consumer financial products to build multi-class text classification models using RNN and LSTM.

Build Regression (Linear,Ridge,Lasso) Models in NumPy Python
In this machine learning regression project, you will learn to build NumPy Regression Models (Linear Regression, Ridge Regression, Lasso Regression) from Scratch.