How to convert string to vectors in Gensim

In this recipe, you will learn how to convert a string to vectors in a detailed manner using the Gensim library in python.

Recipe Objective: How to convert string to vector in Gensim?

You can convert string to vector using Gensim by following the steps mentioned below-

#importing required libraries
from gensim import corpora
from collections import defaultdict

#a sample corpus
docs = ["This is sample document",
"Collection of documents make a corpus",
"You can vectorize your corpus for a mathematically convenient representation of a document"]

#creating a list of stopwords
stoplist = set('for a of the and to in'.split())

#removing the stop words
txts = [[word for word in document.lower().split() if word not in stoplist]for document in docs]

#calculating frequency of each text
frequency = defaultdict(int)
for text in txts:
  for token in text:
   frequency[token] += 1

#removing words that appear only once
txts = [[token for token in text if frequency[token] > 1]for text in txts]

#creating a dictionary
gensim_dictionary = corpora.Dictionary(txts)

print(gensim_dictionary.token2id)

#now we convert tokenized documents to vectors

#creating a new document
doc_new = "corpus consists document"

#creating a vector
vec = gensim_dictionary.doc2bow(doc_new.lower().split())

#displaying the vector
print("\n",vec)

Output:
{'document': 0, 'corpus': 1}

 [(0, 1), (1, 1)]

The function doc2bow() returns a sparse vector. We can see that the word "consists" was not present in our original corpus; hence, its occurrence will be counted as zero and excluded in the sparse vector. You can vectorize the entire corpus as follows-

#vectorizing the original corpus
gensim_corpus = [gensim_dictionary.doc2bow(text) for text in txts]

#displaying the vectors
print(gensim_corpus)

Output:
[[(0, 1)], [(1, 1)], [(0, 1), (1, 1)]]

Output can be translated as - the word "document" with ID 0 is present in the first and third document once, and the word "corpus" with ID 1 is present in the second and third document once.

What Users are saying..

profile image

Ameeruddin Mohammed

ETL (Abintio) developer at IBM
linkedin profile url

I come from a background in Marketing and Analytics and when I developed an interest in Machine Learning algorithms, I did multiple in-class courses from reputed institutions though I got good... Read More

Relevant Projects

Learn Object Tracking (SOT, MOT) using OpenCV and Python
Get Started with Object Tracking using OpenCV and Python - Learn to implement Multiple Instance Learning Tracker (MIL) algorithm, Generic Object Tracking Using Regression Networks Tracker (GOTURN) algorithm, Kernelized Correlation Filters Tracker (KCF) algorithm, Tracking, Learning, Detection Tracker (TLD) algorithm for single and multiple object tracking from various video clips.

Learn How to Build a Logistic Regression Model in PyTorch
In this Machine Learning Project, you will learn how to build a simple logistic regression model in PyTorch for customer churn prediction.

MLOps Project for a Mask R-CNN on GCP using uWSGI Flask
MLOps on GCP - Solved end-to-end MLOps Project to deploy a Mask RCNN Model for Image Segmentation as a Web Application using uWSGI Flask, Docker, and TensorFlow.

LLM Project to Build and Fine Tune a Large Language Model
In this LLM project for beginners, you will learn to build a knowledge-grounded chatbot using LLM's and learn how to fine tune it.

PyCaret Project to Build and Deploy an ML App using Streamlit
In this PyCaret Project, you will build a customer segmentation model with PyCaret and deploy the machine learning application using Streamlit.

AWS MLOps Project to Deploy a Classification Model [Banking]
In this AWS MLOps project, you will learn how to deploy a classification model using Flask on AWS.

Build Real Estate Price Prediction Model with NLP and FastAPI
In this Real Estate Price Prediction Project, you will learn to build a real estate price prediction machine learning model and deploy it on Heroku using FastAPI Framework.

BERT Text Classification using DistilBERT and ALBERT Models
This Project Explains how to perform Text Classification using ALBERT and DistilBERT

End-to-End Snowflake Healthcare Analytics Project on AWS-2
In this AWS Snowflake project, you will build an end to end retraining pipeline by checking Data and Model Drift and learn how to redeploy the model if needed

Build CNN Image Classification Models for Real Time Prediction
Image Classification Project to build a CNN model in Python that can classify images into social security cards, driving licenses, and other key identity information.