How to compute coherence score of an LDA model in Gensim

In this recipe, we will learn how to create an LDA model followed by computing the coherence score of the model and learning what is a good coherence score.
Last Updated: 26 Dec 2022

Get access to Data Science projects View all Data Science projects

MACHINE LEARNING PROJECTS IN PYTHON DATA CLEANING PYTHON DATA MUNGING MACHINE LEARNING RECIPES PANDAS CHEATSHEET ALL TAGS

Recipe Objective: How to compute the coherence score of an LDA model in Gensim?

First, create or load an LDA model as we did in the previous recipe by following the steps given below-

#importing required libraries import re import numpy as np import pandas as pd from pprint import pprint import gensim import gensim.corpora as corpora from gensim.utils import simple_preprocess from nltk.corpus import stopwords from gensim.models import CoherenceModel import spacy import pyLDAvis import pyLDAvis.gensim_models import matplotlib.pyplot as plt import nltk import spacy nltk.download('stopwords') nlp=spacy.load('en_core_web_sm',disable=['parser', 'ner']) #importing the Stopwords to use them stop_words = stopwords.words('english') stop_words.extend(['from', 'subject', 're', 'edu', 'use','for']) #downloading the data from sklearn.datasets import fetch_20newsgroups newsgroups_train = fetch_20newsgroups(subset='train') data = newsgroups_train.data data = [re.sub('\S*@\S*\s?', '', sent) for sent in data] data = [re.sub('\s+', ' ', sent) for sent in data] data = [re.sub("\'", "", sent) for sent in data] #cleaning the text def tokeniz(sentences): for sentence in sentences: yield(gensim.utils.simple_preprocess(str(sentence), deacc=True)) processed_data = list(tokeniz(data)) #Building Bigram & Trigram Models bigram = gensim.models.Phrases(processed_data, min_count=5, threshold=100) trigram = gensim.models.Phrases(bigram[processed_data], threshold=100) bigram_mod = gensim.models.phrases.Phraser(bigram) trigram_mod = gensim.models.phrases.Phraser(trigram) #function to filter out stopwords def remove_stopwords(texts): return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts] #function to create bigrams def create_bigrams(texts): return [bigram_mod[doc] for doc in texts] #function to create trigrams def create_trigrams(texts): [trigram_mod[bigram_mod[doc]] for doc in texts] #function for lemmatization def lemmatize(texts, allowed_postags=['NOUN', 'ADJ', 'VERB']): texts_op = [] for sent in texts: doc = nlp(" ".join(sent)) texts_op.append([token.lemma_ for token in doc if token.pos_ in allowed_postags]) return texts_op #removing stopwords, creating bigrams and lemmatizing the text data_wo_stopwords = remove_stopwords(processed_data) data_bigrams = create_bigrams(data_wo_stopwords) data_lemmatized = lemmatize(data_bigrams, allowed_postags=[ 'NOUN', 'ADJ', 'VERB']) #printing the lemmatized data print(data_lemmatized[:3]) #creating a dictionary gensim_dictionary = corpora.Dictionary(data_lemmatized) texts = data_lemmatized #building a corpus for the topic model gensim_corpus = [gensim_dictionary.doc2bow(text) for text in texts] #printing the corpus we created above. print(gensim_corpus[:3]) #we can print the words with their frequencies. [[(gensim_dictionary[id], freq) for id, freq in cp] for cp in gensim_corpus[:4]] #creating the LDA model lda_model = gensim.models.ldamodel.LdaModel( corpus=gensim_corpus, id2word=gensim_dictionary, num_topics=20, random_state=100, update_every=1, chunksize=100, passes=10, alpha='auto', per_word_topics=True ) #calculating and displaying the coherence score coherence_model_lda = CoherenceModel( model=lda_model, texts=data_lemmatized, dictionary=gensim_dictionary, coherence='c_v') coherence_lda = coherence_model_lda.get_coherence() print('\nCoherence Score: ', coherence_lda)

Output:
Coherence Score:  0.4706850590438568

The model's coherence score is computed using the LDA model (lda model) we created before, which is the average /median of the pairwise word-similarity scores of the words in the topic.

What Users are saying..

Jingwei Li

Graduate Research assistance at Stony Brook University

ProjectPro is an awesome platform that helps me learn much hands-on industrial experience with a step-by-step walkthrough of projects. There are two primary paths to learn: Data Science and Big Data.... Read More

Relevant Projects

Machine Learning Projects

Data Science Projects

Python Projects for Data Science

Data Science Projects in R

Machine Learning Projects for Beginners

Deep Learning Projects

Neural Network Projects

Tensorflow Projects

NLP Projects

Kaggle Projects

IoT Projects

Big Data Projects

Hadoop Real-Time Projects Examples

Spark Projects

Data Analytics Projects for Students

Relevant Projects

Learn to Build a Polynomial Regression Model from Scratch

In this Machine Learning Regression project, you will learn to build a polynomial regression model to predict points scored by the sports team.

View Project Details

Build Multi Class Text Classification Models with RNN and LSTM

In this Deep Learning Project, you will use the customer complaints data about consumer financial products to build multi-class text classification models using RNN and LSTM.

View Project Details

Multilabel Classification Project for Predicting Shipment Modes

Multilabel Classification Project to build a machine learning model that predicts the appropriate mode of transport for each shipment, using a transport dataset with 2000 unique products. The project explores and compares four different approaches to multilabel classification, including naive independent models, classifier chains, natively multilabel models, and multilabel to multiclass approaches.

View Project Details

How to compute coherence score of an LDA model in Gensim

Recipe Objective: How to compute the coherence score of an LDA model in Gensim?

Jingwei Li

Relevant Projects

You might also like

Relevant Projects