How to create HDP topic model in Gensim

In this recipe, we will learn how to create an HDP topic model i.e. Hierarchical Dirichlet Process topic model using Gensim in python.

Recipe Objective: How to create an HDP topic model in Gensim?

Aside from LDA and LSI, HDP is another helpful topic model in Gensim (Hierarchical Dirichlet Process). It's essentially a mixed-membership model for unsupervised data processing. HDP infers the number of topics from the data, unlike LDA (its finite cousin). To build HDP in Gensim, we must first train the corpus and dictionary (as done while implementing LDA and LSI topic models). We'll also apply the HDP topic model to 20Newsgroup data, and the methods will be the same.

#importing required libraries
import re
import numpy as np
import pandas as pd
from pprint import pprint
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from nltk.corpus import stopwords
from gensim.models import CoherenceModel
import spacy
import pyLDAvis
import pyLDAvis.gensim_models
import matplotlib.pyplot as plt
import nltk
import spacy
nltk.download('stopwords')
nlp=spacy.load('en_core_web_sm',disable=['parser', 'ner'])

#importing the Stopwords to use them
stop_words = stopwords.words('english')
stop_words.extend(['from', 'subject', 're', 'edu', 'use','for'])

#downloading the data
from sklearn.datasets import fetch_20newsgroups
newsgroups_train = fetch_20newsgroups(subset='train')
data = newsgroups_train.data
data = [re.sub('\S*@\S*\s?', '', sent) for sent in data]
data = [re.sub('\s+', ' ', sent) for sent in data]
data = [re.sub("\'", "", sent) for sent in data]

#cleaning the text
def tokeniz(sentences):
  for sentence in sentences:
    yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))
processed_data = list(tokeniz(data))

#Building Bigram & Trigram Models
bigram = gensim.models.Phrases(processed_data, min_count=5, threshold=100)
trigram = gensim.models.Phrases(bigram[processed_data], threshold=100)
bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)

#function to filter out stopwords
def remove_stopwords(texts):
  return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]

#function to create bigrams
def create_bigrams(texts):
  return [bigram_mod[doc] for doc in texts]

#function to create trigrams
def create_trigrams(texts):
  [trigram_mod[bigram_mod[doc]] for doc in texts]

#function for lemmatization
def lemmatize(texts, allowed_postags=['NOUN', 'ADJ', 'VERB']):
  texts_op = []
  for sent in texts:
   doc = nlp(" ".join(sent))
   texts_op.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
  return texts_op

#removing stopwords, creating bigrams and lemmatizing the text
data_wo_stopwords = remove_stopwords(processed_data)
data_bigrams = create_bigrams(data_wo_stopwords)
data_lemmatized = lemmatize(data_bigrams, allowed_postags=[ 'NOUN', 'ADJ', 'VERB'])

#printing the lemmatized data
print(data_lemmatized[:3])

#creating a dictionary
gensim_dictionary = corpora.Dictionary(data_lemmatized)

texts = data_lemmatized

#building a corpus for the topic model
gensim_corpus = [gensim_dictionary.doc2bow(text) for text in texts]

#printing the corpus we created above.
print(gensim_corpus[:3])

#we can print the words with their frequencies.
[[(gensim_dictionary[id], freq) for id, freq in cp] for cp in gensim_corpus[:4]]

#creating hdp model
hdp_model = gensim.models.hdpmodel.HdpModel(corpus=gensim_corpus, id2word=gensim_dictionary)

#viewing topics
pprint(hdp_model.print_topics())
)

Output:
[(0,
  '0.011*would + 0.010*line + 0.009*write + 0.006*say + 0.006*know + '
  '0.006*article + 0.006*people + 0.005*make + 0.005*get + 0.005*go'),
 (1,
  '0.013*line + 0.011*would + 0.011*write + 0.007*article + 0.006*know + '
  '0.006*say + 0.006*people + 0.006*think + 0.005*be + 0.005*make'),
 (2,
  '0.012*say + 0.010*line + 0.009*would + 0.009*go + 0.008*write + 0.007*know '
  '+ 0.006*people + 0.006*come + 0.006*see + 0.006*think'),
 (3,
  '0.014*line + 0.010*write + 0.009*would + 0.007*article + 0.006*know + '
  '0.005*say + 0.005*get + 0.005*make + 0.005*be + 0.004*go'),
 (4,
  '0.013*line + 0.009*write + 0.008*would + 0.006*article + 0.006*know + '
  '0.004*think + 0.004*say + 0.004*be + 0.004*good + 0.004*make'),
 (5,
  '0.227*ax + 0.182*max + 0.004*di_di + 0.001*bhjn + 0.001*wt + 0.001*part + '
  '0.001*would + 0.001*wwiz + 0.001*wm_wm + 0.001*pl_pl'),
 (6,
  '0.004*argument + 0.004*line + 0.003*fallacy + 0.003*would + 0.003*write + '
  '0.003*example + 0.003*true + 0.002*conclusion + 0.002*use + 0.002*say'),
 (7,
  '0.002*line + 0.002*kill + 0.002*greek + 0.002*write + 0.002*people + '
  '0.001*know + 0.001*would + 0.001*article + 0.001*say + 0.001*turkish'),
 (8,
  '0.003*line + 0.002*would + 0.002*write + 0.001*say + 0.001*thank + '
  '0.001*game + 0.001*playoff + 0.001*get + 0.001*article + 0.001*car'),
 (9,
  '0.003*line + 0.002*would + 0.002*point + 0.002*know + 0.002*write + '
  '0.001*new + 0.001*article + 0.001*say + 0.001*water + 0.001*find'),
 (10,
  '0.002*would + 0.002*line + 0.001*write + 0.001*people + 0.001*go + '
  '0.001*say + 0.001*eternal + 0.001*get + 0.001*day + 0.001*may'),
 (11,
  '0.002*people + 0.002*software + 0.002*write + 0.002*would + 0.001*line + '
  '0.001*level + 0.001*get + 0.001*think + 0.001*article + 0.001*process'),
 (12,
  '0.003*year + 0.002*be + 0.002*would + 0.002*go + 0.001*line + 0.001*car + '
  '0.001*rate + 0.001*say + 0.001*insurance + 0.001*game'),
 (13,
  '0.002*would + 0.001*line + 0.001*think + 0.001*write + 0.001*article + '
  '0.001*mean + 0.001*say + 0.001*time + 0.001*use + 0.001*nature'),
 (14,
  '0.003*would + 0.002*line + 0.002*write + 0.002*article + 0.001*see + '
  '0.001*say + 0.001*be + 0.001*make + 0.001*think + 0.001*group'),
 (15,
  '0.002*line + 0.002*period + 0.001*scorer_pt + 0.001*would + 0.001*write + '
  '0.001*second + 0.001*first + 0.001*lead + 0.001*know + 0.001*see'),
 (16,
  '0.001*would + 0.001*team + 0.001*mission + 0.001*water + 0.001*launch + '
  '0.001*write + 0.001*line + 0.001*know + 0.001*probe + 0.001*player'),
 (17,
  '0.002*line + 0.001*would + 0.001*write + 0.001*option + 0.001*power + '
  '0.001*use + 0.001*think + 0.001*thank + 0.001*know + 0.001*host'),
 (18,
  '0.004*good + 0.002*excellent + 0.002*miss + 0.001*cover + 0.001*line + '
  '0.001*include + 0.001*poster + 0.001*dragon + 0.001*uccxkvb + 0.001*fair'),
 (19,
  '0.001*belief + 0.001*line + 0.001*would + 0.001*exhaust + 0.001*think + '
  '0.001*write + 0.001*people + 0.001*article + 0.001*pressure + '
  '0.001*believe')]

What Users are saying..

profile image

Savvy Sahai

Data Science Intern, Capgemini
linkedin profile url

As a student looking to break into the field of data engineering and data science, one can get really confused as to which path to take. Very few ways to do it are Google, YouTube, etc. I was one of... Read More

Relevant Projects

Medical Image Segmentation Deep Learning Project
In this deep learning project, you will learn to implement Unet++ models for medical image segmentation to detect and classify colorectal polyps.

Learn to Build a Siamese Neural Network for Image Similarity
In this Deep Learning Project, you will learn how to build a siamese neural network with Keras and Tensorflow for Image Similarity.

Build Regression Models in Python for House Price Prediction
In this Machine Learning Regression project, you will build and evaluate various regression models in Python for house price prediction.

Avocado Machine Learning Project Python for Price Prediction
In this ML Project, you will use the Avocado dataset to build a machine learning model to predict the average price of avocado which is continuous in nature based on region and varieties of avocado.

Build Customer Propensity to Purchase Model in Python
In this machine learning project, you will learn to build a machine learning model to estimate customer propensity to purchase.

Detectron2 Object Detection and Segmentation Example Python
Object Detection using Detectron2 - Build a Dectectron2 model to detect the zones and inhibitions in antibiogram images.

End-to-End Snowflake Healthcare Analytics Project on AWS-1
In this Snowflake Healthcare Analytics Project, you will leverage Snowflake on AWS to predict patient length of stay (LOS) in hospitals. The prediction of LOS can help in efficient resource allocation, lower the risk of staff/visitor infections, and improve overall hospital functioning.

Learn How to Build a Logistic Regression Model in PyTorch
In this Machine Learning Project, you will learn how to build a simple logistic regression model in PyTorch for customer churn prediction.

Image Classification Model using Transfer Learning in PyTorch
In this PyTorch Project, you will build an image classification model in PyTorch using the ResNet pre-trained model.

Build a Graph Based Recommendation System in Python-Part 2
In this Graph Based Recommender System Project, you will build a recommender system project for eCommerce platforms and learn to use FAISS for efficient similarity search.