What are the various transformations in Gensim

In this recipe, we will learn the several transformations available in Gensim. These include TF-IDF, LSI, LDA, HDP and RP.

Recipe Objective: What are the various transformations in Gensim?

Various widespread transformations, such as Vector Space Model methods, can be implemented using Gensim. The following are a few of them:

Tf-Idf (Term Frequency-Inverse Document Frequency)

The tf-idf model algorithm expects an integer-valued training corpus during initialization (such as the Bag-of-Words model). After that, it takes a vector representation and returns another vector representation during transformation.

Learn About the Application of ARCH and GARCH models in Real-World 


The output vector will have the exact dimensions as the input vector, but the value of the rare features will be enhanced (at the time of training). It converts integer-valued vectors to real-valued vectors in a simple manner. The Tf-idf transformation syntax is as follows:

#tf-idf transformation syntax
tfidf_Model=models.TfidfModel(corpus, normalize=True)

LSI (Latent Semantic Indexing)

The LSI model algorithm can convert a document into latent space from an integer-valued vector model (such as the Bag-of-Words model) or a Tf-Idf weighted space. The dimensionality of the output vector will be reduced. The LSI transformation syntax is as follows:

#lsi transformation syntax
lsi_Model=models.LsiModel(tfidf_corpus, id2word=dictionary, num_topics=300)

LDA (Latent Dirichlet Allocation)

The LDA model algorithm is another approach that converts a document from Bag-of-Words model space to a topic space. The dimensionality of the output vector will be reduced. The LDA transformation syntax is as follows:

#lda transformation syntax
lda_Model=models.LdaModel(corpus, id2word=dictionary, num_topics=100)

Random Projections (RP)

RP tries to reduce the dimensionality of vector space in a very efficient way. The Tf-Idf distances between the documents are approximated using this method, and it accomplishes this by introducing some randomness.

#rp transformation syntax
rp_Model=models.RpModel(tfidf_corpus, num_topics=500)

The Hierarchical Dirichlet Process (HDP)

The HDP method is a non-parametric Bayesian method that is new to Gensim. We should use take care when employing it.

#hdp transformation syntax
hdp_Model=models.HdpModel(corpus, id2word=dictionary)

What Users are saying..

profile image

Ray han

Tech Leader | Stanford / Yale University
linkedin profile url

I think that they are fantastic. I attended Yale and Stanford and have worked at Honeywell,Oracle, and Arthur Andersen(Accenture) in the US. I have taken Big Data and Hadoop,NoSQL, Spark, Hadoop... Read More

Relevant Projects

Learn to Build a Siamese Neural Network for Image Similarity
In this Deep Learning Project, you will learn how to build a siamese neural network with Keras and Tensorflow for Image Similarity.

Credit Card Fraud Detection as a Classification Problem
In this data science project, we will predict the credit card fraud in the transactional dataset using some of the predictive models.

Recommender System Machine Learning Project for Beginners-1
Recommender System Machine Learning Project for Beginners - Learn how to design, implement and train a rule-based recommender system in Python

Deep Learning Project for Text Detection in Images using Python
CV2 Text Detection Code for Images using Python -Build a CRNN deep learning model to predict the single-line text in a given image.

Build a Collaborative Filtering Recommender System in Python
Use the Amazon Reviews/Ratings dataset of 2 Million records to build a recommender system using memory-based collaborative filtering in Python.

Build a Text Generator Model using Amazon SageMaker
In this Deep Learning Project, you will train a Text Generator Model on Amazon Reviews Dataset using LSTM Algorithm in PyTorch and deploy it on Amazon SageMaker.

Image Segmentation using Mask R-CNN with Tensorflow
In this Deep Learning Project on Image Segmentation Python, you will learn how to implement the Mask R-CNN model for early fire detection.

Customer Market Basket Analysis using Apriori and Fpgrowth algorithms
In this data science project, you will learn how to perform market basket analysis with the application of Apriori and FP growth algorithms based on the concept of association rule learning.

Azure Deep Learning-Deploy RNN CNN models for TimeSeries
In this Azure MLOps Project, you will learn to perform docker-based deployment of RNN and CNN Models for Time Series Forecasting on Azure Cloud.

Build ARCH and GARCH Models in Time Series using Python
In this Project we will build an ARCH and a GARCH model using Python