How to create LSI topic model in Gensim

In this recipe, we will learn how to create an LSI topic model using Gensim. LSI is an NLP approach that is particularly useful in distributional semantics.
Last Updated: 06 Sep 2022

Get access to Data Science projects View all Data Science projects

MACHINE LEARNING PROJECTS IN PYTHON DATA CLEANING PYTHON DATA MUNGING MACHINE LEARNING RECIPES PANDAS CHEATSHEET ALL TAGS

Recipe Objective: How to create an LSI topic model in Gensim?

LSI is an NLP approach that is particularly useful in distributional semantics. It examines the relationship between a group of papers and the terminology contained in those documents. When we talk about how it works, it creates a matrix from a large chunk of text that comprises word counts per document.

The LSI model employs a mathematical approach known as singular value decomposition (SVD) to minimize the number of rows. It keeps the similarity structure among columns while minimizing the number of rows.

Hands-On Approach to Topic Modelling in Python

The rows in a matrix represent distinct words, whereas the columns represent each document. It is based on the distributional hypothesis, which states that words with similar meanings will appear in the same type of text.

#importing required libraries import re import numpy as np import pandas as pd from pprint import pprint import gensim import gensim.corpora as corpora from gensim.utils import simple_preprocess from nltk.corpus import stopwords from gensim.models import CoherenceModel import spacy import pyLDAvis import pyLDAvis.gensim_models import matplotlib.pyplot as plt import nltk import spacy nltk.download('stopwords') nlp=spacy.load('en_core_web_sm',disable=['parser', 'ner']) #importing the Stopwords to use them stop_words = stopwords.words('english') stop_words.extend(['from', 'subject', 're', 'edu', 'use','for']) #downloading the data from sklearn.datasets import fetch_20newsgroups newsgroups_train = fetch_20newsgroups(subset='train') data = newsgroups_train.data data = [re.sub('\S*@\S*\s?', '', sent) for sent in data] data = [re.sub('\s+', ' ', sent) for sent in data] data = [re.sub("\'", "", sent) for sent in data] #cleaning the text def tokeniz(sentences): for sentence in sentences: yield(gensim.utils.simple_preprocess(str(sentence), deacc=True)) processed_data = list(tokeniz(data)) #Building Bigram & Trigram Models bigram = gensim.models.Phrases(processed_data, min_count=5, threshold=100) trigram = gensim.models.Phrases(bigram[processed_data], threshold=100) bigram_mod = gensim.models.phrases.Phraser(bigram) trigram_mod = gensim.models.phrases.Phraser(trigram) #function to filter out stopwords def remove_stopwords(texts): return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts] #function to create bigrams def create_bigrams(texts): return [bigram_mod[doc] for doc in texts] #function to create trigrams def create_trigrams(texts): [trigram_mod[bigram_mod[doc]] for doc in texts] #function for lemmatization def lemmatize(texts, allowed_postags=['NOUN', 'ADJ', 'VERB']): texts_op = [] for sent in texts: doc = nlp(" ".join(sent)) texts_op.append([token.lemma_ for token in doc if token.pos_ in allowed_postags]) return texts_op #removing stopwords, creating bigrams and lemmatizing the text data_wo_stopwords = remove_stopwords(processed_data) data_bigrams = create_bigrams(data_wo_stopwords) data_lemmatized = lemmatize(data_bigrams, allowed_postags=[ 'NOUN', 'ADJ', 'VERB']) #printing the lemmatized data print(data_lemmatized[:3]) #creating a dictionary gensim_dictionary = corpora.Dictionary(data_lemmatized) texts = data_lemmatized #building a corpus for the topic model gensim_corpus = [gensim_dictionary.doc2bow(text) for text in texts] #printing the corpus we created above. print(gensim_corpus[:3]) #we can print the words with their frequencies. [[(gensim_dictionary[id], freq) for id, freq in cp] for cp in gensim_corpus[:4]] #creating the LSI model lsi_model = gensim.models.lsimodel.LsiModel( corpus=gensim_corpus, id2word=gensim_dictionary, num_topics=20,chunksize=100 ) #viewing topics pprint(lsi_model.print_topics())

Output:
[(0,
  '0.768*"ax" + 0.641*"max" + 0.007*"di_di" + 0.004*"bhjn" + 0.003*"pl_pl" + '
  '0.003*"part" + 0.003*"would" + 0.002*"wt" + 0.002*"wwiz" + 0.002*"pne"'),
 (1,
  '0.254*"say" + 0.219*"would" + 0.203*"file" + 0.201*"go" + 0.177*"know" + '
  '0.176*"people" + 0.147*"make" + 0.142*"may" + 0.136*"see" + 0.131*"think"'),
 (2,
  '0.460*"file" + -0.321*"say" + -0.233*"go" + 0.172*"program" + 0.170*"image" '
  '+ -0.156*"know" + -0.151*"people" + -0.137*"would" + -0.135*"think" + '
  '0.135*"available"'),
 (3,
  '-0.607*"file" + -0.282*"entry" + 0.179*"system" + -0.172*"say" + '
  '0.147*"use" + 0.132*"available" + 0.104*"wire" + -0.103*"go" + '
  '-0.101*"output" + 0.092*"server"'),
 (4,
  '0.431*"image" + -0.182*"entry" + 0.172*"say" + 0.164*"color" + '
  '0.162*"available" + 0.153*"jpeg" + 0.146*"version" + -0.142*"wire" + '
  '0.139*"go" + 0.133*"format"'),
 (5,
  '0.338*"db_db" + 0.308*"wire" + -0.207*"internet" + 0.194*"entry" + '
  '-0.186*"privacy" + 0.155*"wiring" + 0.142*"circuit" + 0.141*"outlet" + '
  '0.131*"bit" + -0.131*"mail"'),
 (6,
  '-0.889*"db_db" + -0.155*"bit" + 0.142*"wire" + -0.082*"internet" + '
  '-0.081*"byte" + -0.079*"push" + -0.076*"privacy" + 0.071*"wiring" + '
  '-0.070*"mov" + 0.065*"circuit"'),
 (7,
  '-0.350*"file" + 0.342*"entry" + -0.297*"wire" + 0.227*"program" + '
  '-0.145*"wiring" + -0.143*"image" + -0.132*"outlet" + -0.132*"circuit" + '
  '0.122*"line" + 0.122*"build"'),
 (8,
  '-0.298*"would" + 0.226*"say" + -0.206*"image" + 0.194*"entry" + 0.163*"go" '
  '+ -0.162*"write" + -0.152*"article" + 0.133*"wire" + 0.129*"internet" + '
  '0.124*"information"'),
 (9,
  '-0.396*"image" + -0.262*"entry" + 0.251*"file" + 0.230*"widget" + '
  '0.194*"application" + -0.171*"jpeg" + 0.140*"resource" + 0.132*"window" + '
  '-0.125*"format" + 0.124*"value"'),
 (10,
  '-0.306*"launch" + -0.213*"satellite" + -0.192*"year" + 0.180*"atheist" + '
  '0.176*"people" + -0.171*"space" + -0.158*"go" + 0.118*"many" + '
  '0.117*"argument" + -0.111*"market"'),
 (11,
  '-0.565*"drive" + -0.204*"system" + -0.161*"feature" + -0.144*"scsi" + '
  '0.116*"available" + 0.115*"program" + -0.110*"speed" + -0.109*"bit" + '
  '0.109*"president" + 0.107*"package"'),
 (12,
  '0.306*"launch" + 0.211*"satellite" + -0.204*"drive" + 0.196*"space" + '
  '-0.182*"s" + -0.177*"think" + -0.165*"president" + -0.123*"work" + '
  '-0.117*"go" + -0.116*"be"'),
 (13,
  '0.237*"would" + 0.232*"team" + 0.219*"line" + 0.211*"write" + '
  '-0.208*"launch" + -0.198*"atheist" + 0.191*"game" + -0.144*"satellite" + '
  '0.136*"season" + 0.132*"get"'),
 (14,
  '0.230*"available" + 0.182*"include" + -0.176*"value" + 0.167*"support" + '
  '0.161*"atheist" + -0.153*"application" + -0.143*"return" + 0.131*"send" + '
  '-0.131*"widget" + -0.129*"resource"'),
 (15,
  '0.269*"people" + -0.222*"team" + -0.197*"drive" + -0.166*"game" + '
  '0.155*"program" + -0.153*"may" + -0.137*"say" + -0.135*"season" + '
  '0.134*"line" + 0.134*"write"'),
 (16,
  '-0.231*"write" + -0.224*"line" + -0.184*"output" + -0.173*"launch" + '
  '0.149*"people" + 0.145*"entry" + 0.144*"team" + 0.138*"work" + 0.129*"jpeg" '
  '+ -0.121*"satellite"'),
 (17,
  '-0.303*"drive" + 0.266*"bit" + -0.250*"people" + -0.205*"image" + '
  '0.164*"say" + 0.155*"key" + 0.134*"version" + 0.132*"scsi" + 0.126*"color" '
  '+ 0.121*"machine"'),
 (18,
  '-0.222*"server" + -0.217*"people" + -0.176*"launch" + -0.170*"anonymous" + '
  '-0.161*"post" + -0.152*"service" + 0.150*"image" + 0.147*"say" + '
  '0.142*"key" + -0.129*"get"'),
 (19,
  '-0.228*"would" + -0.192*"military" + -0.189*"ship" + -0.177*"secret" + '
  '-0.174*"war" + 0.169*"people" + -0.164*"argument" + 0.158*"atheist" + '
  '-0.145*"island" + -0.136*"attack"')]

What Users are saying..

Anand Kumpatla

Sr Data Scientist @ Doubleslash Software Solutions Pvt Ltd

ProjectPro is a unique platform and helps many people in the industry to solve real-life problems with a step-by-step walkthrough of projects. A platform with some fantastic resources to gain... Read More

Relevant Projects

Machine Learning Projects

Data Science Projects

Python Projects for Data Science

Data Science Projects in R

Machine Learning Projects for Beginners

Deep Learning Projects

Neural Network Projects

Tensorflow Projects

NLP Projects

Kaggle Projects

IoT Projects

Big Data Projects

Hadoop Real-Time Projects Examples

Spark Projects

Data Analytics Projects for Students

Relevant Projects

Linear Regression Model Project in Python for Beginners Part 1

Machine Learning Linear Regression Project in Python to build a simple linear regression model and master the fundamentals of regression for beginners.

View Project Details

Credit Card Default Prediction using Machine learning techniques

In this data science project, you will predict borrowers chance of defaulting on credit loans by building a credit score prediction model.

View Project Details

Create Your First Chatbot with RASA NLU Model and Python

Learn the basic aspects of chatbot development and open source conversational AI RASA to create a simple AI powered chatbot on your own.

View Project Details

Loan Eligibility Prediction in Python using H2O.ai

In this loan prediction project you will build predictive models in Python using H2O.ai to predict if an applicant is able to repay the loan or not.

View Project Details

AWS Project to Build and Deploy LSTM Model with Sagemaker

In this AWS Sagemaker Project, you will learn to build a LSTM model on Sagemaker for sales forecasting while analyzing the impact of weather conditions on Sales.

View Project Details

ML Model Deployment on AWS for Customer Churn Prediction

MLOps Project-Deploy Machine Learning Model to Production Python on AWS for Customer Churn Prediction

View Project Details

PyCaret Project to Build and Deploy an ML App using Streamlit

In this PyCaret Project, you will build a customer segmentation model with PyCaret and deploy the machine learning application using Streamlit.

View Project Details

Topic modelling using Kmeans clustering to group customer reviews

In this Kmeans clustering machine learning project, you will perform topic modelling in order to group customer reviews based on recurring patterns.

View Project Details

Multilabel Classification Project for Predicting Shipment Modes

Multilabel Classification Project to build a machine learning model that predicts the appropriate mode of transport for each shipment, using a transport dataset with 2000 unique products. The project explores and compares four different approaches to multilabel classification, including naive independent models, classifier chains, natively multilabel models, and multilabel to multiclass approaches.

View Project Details

Build Regression Models in Python for House Price Prediction

In this Machine Learning Regression project, you will build and evaluate various regression models in Python for house price prediction.

View Project Details

How to create LSI topic model in Gensim

Recipe Objective: How to create an LSI topic model in Gensim?

Anand Kumpatla

Relevant Projects

You might also like

Relevant Projects