How to transform documents using TFIDF in Gensim

In this recipe, we will learn how transform documents in a step-by-step manner using TF-IDF with the help of the Gensim library in python.

Recipe Objective: How to transform documents in Gensim?

We start by creating a corpus, preprocessing it by removing stop words and the words that appear only once. We then create a dictionary, followed by creating the Bag of Word model for our corpus.

#importing required libraries
import gensim
import pprint
from collections import defaultdict
from gensim import corpora

#a sample corpus
docs = ["Classification of animals plays an important role in science",
"Computer science is a great subject",
"You can pursue to be a data scientist after learning computer science",
"Machine learning is very important while learning data science concepts"]

#creating a list of stopwords
stoplist = set('for a of the and to in'.split())

#removing the stop words
txts = [[word for word in document.lower().split() if word not in stoplist]for document in docs]

#calculating frequency of each text
frequency = defaultdict(int)
for text in txts:
for token in text:
frequency[token] += 1

#removing words that appear only once
txts = [[token for token in text if frequency[token] > 1]for text in txts]

#creating a dictionary
gensim_dictionary = corpora.Dictionary(txts)

#displaying the dictionary
print(gensim_dictionary)

#creating a sparse vector
bow_corpus = [gensim_dictionary.doc2bow(text) for text in txts]

#displaying the vector
pprint.pprint(bow_corpus)

Output:
Dictionary(6 unique tokens: ['important', 'science', 'computer', 'is', 'data']...)
[[(0, 1), (1, 1)],
 [(1, 1), (2, 1), (3, 1)],
 [(1, 1), (2, 1), (4, 1), (5, 1)],
 [(0, 1), (1, 1), (3, 1), (4, 1), (5, 2)]]

We will use the tf-idf model to transform our trained corpus, i.e., bow_corpus. The vectors will be transformed from one representation to another. After we initialize the tfidf model, it will be treated as a read-only object. We will convert our vector from the bag of word representation (old representation) to Tfidf real-valued weights using this tfidf object (new representation).

#importing required library
from gensim import models

#initializing the tfidf model
tfidf = models.TfidfModel(bow_corpus)

#appllying the transformation on two values of corpus
doc_bow = [(1,2),(2,3)]
print(tfidf[doc_bow])

#applying the transformation to the entire corpus
corpus_tfidf = tfidf[bow_corpus]
for doc in corpus_tfidf:
  print(doc)

Output:
[(2, 1.0)]
[(0, 1.0)]
[(2, 0.7071067811865475), (3, 0.7071067811865475)]
[(2, 0.5773502691896258), (4, 0.5773502691896258), (5, 0.5773502691896258)]
[(0, 0.3779644730092272), (3, 0.3779644730092272), (4, 0.3779644730092272), (5, 0.7559289460184544)]

This is how we can transform documents using Gensim.

What Users are saying..

profile image

Ed Godalle

Director Data Analytics at EY / EY Tech
linkedin profile url

I am the Director of Data Analytics with over 10+ years of IT experience. I have a background in SQL, Python, and Big Data working with Accenture, IBM, and Infosys. I am looking to enhance my skills... Read More

Relevant Projects

NLP Project for Multi Class Text Classification using BERT Model
In this NLP Project, you will learn how to build a multi-class text classification model using using the pre-trained BERT model.

Machine Learning Project to Forecast Rossmann Store Sales
In this machine learning project you will work on creating a robust prediction model of Rossmann's daily sales using store, promotion, and competitor data.

Learn to Build an End-to-End Machine Learning Pipeline - Part 2
In this Machine Learning Project, you will learn how to build an end-to-end machine learning pipeline for predicting truck delays, incorporating Hopsworks' feature store and Weights and Biases for model experimentation.

Learn to Build a Neural network from Scratch using NumPy
In this deep learning project, you will learn to build a neural network from scratch using NumPy

FEAST Feature Store Example for Scaling Machine Learning
FEAST Feature Store Example- Learn to use FEAST Feature Store to manage, store, and discover features for customer churn prediction machine learning project.

Build Classification Algorithms for Digital Transformation[Banking]
Implement a machine learning approach using various classification techniques in Python to examine the digitalisation process of bank customers.

OpenCV Project to Master Advanced Computer Vision Concepts
In this OpenCV project, you will learn to implement advanced computer vision concepts and algorithms in OpenCV library using Python.

Build an Image Segmentation Model using Amazon SageMaker
In this Machine Learning Project, you will learn to implement the UNet Architecture and build an Image Segmentation Model using Amazon SageMaker

Build a CNN Model with PyTorch for Image Classification
In this deep learning project, you will learn how to build an Image Classification Model using PyTorch CNN

End-to-End Snowflake Healthcare Analytics Project on AWS-2
In this AWS Snowflake project, you will build an end to end retraining pipeline by checking Data and Model Drift and learn how to redeploy the model if needed