What is a model in Gensim

In this recipe, we will learn what is a model and how to change a vectorized corpus using models. We will see an example using the tfidf model.

Recipe Objective: What is a model in Gensim?

Once our corpus has been vectorized, we can change it using models. We use the term "model" to refer to a change from one representation of a document to another. Because documents in gensim are represented as vectors, a model can be considered a transition between two vector spaces. When the model reads the training Corpus during training, it learns the details of this change.

tf-idf is a simple example of a model. The tf-idf model converts vectors from a bag-of-words representation to a vector space where frequency counts are weighted according to each word's relative rarity in the corpus.

Here's a simple illustration. Let's get started by training the tf-idf model on our corpus and translating the string "sample corpus":

Build a Chatbot in Python from Scratch!

#importing required libraries
from gensim import models
import gensim
from gensim import corpora

#creating a sample corpus for demonstration purpose
txt_corpus = ["This is sample document",
"Collection of documents make a corpus",
"You can vectorize your corpus"]

# Creating a set of frequent words
stoplist = set('for a of the and to in on of to are at'.split(' '))

# Lowercasing each document, using white space as delimiter and filtering out the stopwords
processed_text = [[word for word in document.lower().split() if word not in stoplist]for document in txt_corpus]

#creating a dictionary
dictionary = corpora.Dictionary(processed_text)

#using doc2bow for vectorization of the entire corpus
bow_vec = [dictionary.doc2bow(text) for text in processed_text]

#training the model
tfidf_model = models.TfidfModel(bow_vec)

#transforming the "sample corpus" string
words = "sample corpus".lower().split()
print(tfidf_model[dictionary.doc2bow(words)])

Output:
[(2, 0.9381453975456102), (5, 0.34624155305796134)]

The tf-idf model returns a list of tuples, with the token ID as the first element and the tf-idf weighting as the second. It's worth noting that the ID for "corpus"(which appeared two times in the original corpus) has been weighted lower than the ID for "sample" (which only occurred once).

What Users are saying..

profile image

Anand Kumpatla

Sr Data Scientist @ Doubleslash Software Solutions Pvt Ltd
linkedin profile url

ProjectPro is a unique platform and helps many people in the industry to solve real-life problems with a step-by-step walkthrough of projects. A platform with some fantastic resources to gain... Read More

Relevant Projects

Recommender System Machine Learning Project for Beginners-3
Content Based Recommender System Project - Building a Content-Based Product Recommender App with Streamlit

Multi-Class Text Classification with Deep Learning using BERT
In this deep learning project, you will implement one of the most popular state of the art Transformer models, BERT for Multi-Class Text Classification

Create Your First Chatbot with RASA NLU Model and Python
Learn the basic aspects of chatbot development and open source conversational AI RASA to create a simple AI powered chatbot on your own.

Demand prediction of driver availability using multistep time series analysis
In this supervised learning machine learning project, you will predict the availability of a driver in a specific area by using multi step time series analysis.

Classification Projects on Machine Learning for Beginners - 1
Classification ML Project for Beginners - A Hands-On Approach to Implementing Different Types of Classification Algorithms in Machine Learning for Predictive Modelling

Learn to Build a Polynomial Regression Model from Scratch
In this Machine Learning Regression project, you will learn to build a polynomial regression model to predict points scored by the sports team.

Deep Learning Project for Beginners with Source Code Part 1
Learn to implement deep neural networks in Python .

Build Real Estate Price Prediction Model with NLP and FastAPI
In this Real Estate Price Prediction Project, you will learn to build a real estate price prediction machine learning model and deploy it on Heroku using FastAPI Framework.

Mastering A/B Testing: A Practical Guide for Production
In this A/B Testing for Machine Learning Project, you will gain hands-on experience in conducting A/B tests, analyzing statistical significance, and understanding the challenges of building a solution for A/B testing in a production environment.

Build an End-to-End AWS SageMaker Classification Model
MLOps on AWS SageMaker -Learn to Build an End-to-End Classification Model on SageMaker to predict a patient’s cause of death.