How to compute coherence score of an LDA model in Gensim

In this recipe, we will learn how to create an LDA model followed by computing the coherence score of the model and learning what is a good coherence score.

Recipe Objective: How to compute the coherence score of an LDA model in Gensim?

First, create or load an LDA model as we did in the previous recipe by following the steps given below-

#importing required libraries
import re
import numpy as np
import pandas as pd
from pprint import pprint
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from nltk.corpus import stopwords
from gensim.models import CoherenceModel
import spacy
import pyLDAvis
import pyLDAvis.gensim_models
import matplotlib.pyplot as plt
import nltk
import spacy
nltk.download('stopwords')
nlp=spacy.load('en_core_web_sm',disable=['parser', 'ner'])

#importing the Stopwords to use them
stop_words = stopwords.words('english')
stop_words.extend(['from', 'subject', 're', 'edu', 'use','for'])

#downloading the data
from sklearn.datasets import fetch_20newsgroups
newsgroups_train = fetch_20newsgroups(subset='train')
data = newsgroups_train.data
data = [re.sub('\S*@\S*\s?', '', sent) for sent in data]
data = [re.sub('\s+', ' ', sent) for sent in data]
data = [re.sub("\'", "", sent) for sent in data]

#cleaning the text
def tokeniz(sentences):
  for sentence in sentences:
   yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))
processed_data = list(tokeniz(data))

#Building Bigram & Trigram Models
bigram = gensim.models.Phrases(processed_data, min_count=5, threshold=100)
trigram = gensim.models.Phrases(bigram[processed_data], threshold=100)
bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)

#function to filter out stopwords
def remove_stopwords(texts):
  return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]

#function to create bigrams
def create_bigrams(texts):
  return [bigram_mod[doc] for doc in texts]

#function to create trigrams
def create_trigrams(texts):
  [trigram_mod[bigram_mod[doc]] for doc in texts]

#function for lemmatization
def lemmatize(texts, allowed_postags=['NOUN', 'ADJ', 'VERB']):
  texts_op = []
  for sent in texts:
   doc = nlp(" ".join(sent))
   texts_op.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
  return texts_op

#removing stopwords, creating bigrams and lemmatizing the text
data_wo_stopwords = remove_stopwords(processed_data)
data_bigrams = create_bigrams(data_wo_stopwords)
data_lemmatized = lemmatize(data_bigrams, allowed_postags=[ 'NOUN', 'ADJ', 'VERB'])

#printing the lemmatized data
print(data_lemmatized[:3])

#creating a dictionary
gensim_dictionary = corpora.Dictionary(data_lemmatized)

texts = data_lemmatized

#building a corpus for the topic model
gensim_corpus = [gensim_dictionary.doc2bow(text) for text in texts]

#printing the corpus we created above.
print(gensim_corpus[:3])

#we can print the words with their frequencies.
[[(gensim_dictionary[id], freq) for id, freq in cp] for cp in gensim_corpus[:4]]

#creating the LDA model
lda_model = gensim.models.ldamodel.LdaModel(
corpus=gensim_corpus, id2word=gensim_dictionary, num_topics=20, random_state=100,
update_every=1, chunksize=100, passes=10, alpha='auto', per_word_topics=True
)
#calculating and displaying the coherence score
coherence_model_lda = CoherenceModel(
model=lda_model, texts=data_lemmatized, dictionary=gensim_dictionary, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)

Output:
Coherence Score:  0.4706850590438568

The model's coherence score is computed using the LDA model (lda model) we created before, which is the average /median of the pairwise word-similarity scores of the words in the topic.

What Users are saying..

profile image

Jingwei Li

Graduate Research assistance at Stony Brook University
linkedin profile url

ProjectPro is an awesome platform that helps me learn much hands-on industrial experience with a step-by-step walkthrough of projects. There are two primary paths to learn: Data Science and Big Data.... Read More

Relevant Projects

Learn to Build a Polynomial Regression Model from Scratch
In this Machine Learning Regression project, you will learn to build a polynomial regression model to predict points scored by the sports team.

Build Multi Class Text Classification Models with RNN and LSTM
In this Deep Learning Project, you will use the customer complaints data about consumer financial products to build multi-class text classification models using RNN and LSTM.

Multilabel Classification Project for Predicting Shipment Modes
Multilabel Classification Project to build a machine learning model that predicts the appropriate mode of transport for each shipment, using a transport dataset with 2000 unique products. The project explores and compares four different approaches to multilabel classification, including naive independent models, classifier chains, natively multilabel models, and multilabel to multiclass approaches.

MLOps Project on GCP using Kubeflow for Model Deployment
MLOps using Kubeflow on GCP - Build and deploy a deep learning model on Google Cloud Platform using Kubeflow pipelines in Python

OpenCV Project for Beginners to Learn Computer Vision Basics
In this OpenCV project, you will learn computer vision basics and the fundamentals of OpenCV library using Python.

Learn to Build an End-to-End Machine Learning Pipeline - Part 2
In this Machine Learning Project, you will learn how to build an end-to-end machine learning pipeline for predicting truck delays, incorporating Hopsworks' feature store and Weights and Biases for model experimentation.

Recommender System Machine Learning Project for Beginners-1
Recommender System Machine Learning Project for Beginners - Learn how to design, implement and train a rule-based recommender system in Python

BigMart Sales Prediction ML Project in Python
The goal of the BigMart Sales Prediction ML project is to build and evaluate different predictive models and determine the sales of each product at a store.

Build ARCH and GARCH Models in Time Series using Python
In this Project we will build an ARCH and a GARCH model using Python

Build an End-to-End AWS SageMaker Classification Model
MLOps on AWS SageMaker -Learn to Build an End-to-End Classification Model on SageMaker to predict a patient’s cause of death.