What are vectors in Gensim

This recipe explains what is a vector and provides steps to vectorize text using the Gensim library in python

Recipe Objective: What are vectors in Gensim?

We need a mathematically manipulable approach to represent documents to infer the latent structure in our corpus.   One method defines each document as a set of features in a vector. A single feature  might be thought of as a QnA pair:

   How many times in the document does the word "good" appear? One.
   What is the total number of paragraphs in the document? Two.

Because the question is typically represented by its integer id, this document is described as a series of pairs (1, 1.0), (2, 2.0). This type of vector representation is known as a dense vector. Why is it dense? Because it has a straightforward solution to each of the questions mentioned above.

If we know all of the questions ahead of time, the representation can be as simple as (1, 2). The vector for our document is a succession of replies (only if the questions are known ahead of time).

The bag-of-words (BoW) model is another prominent type of representation. Each page is essentially represented by a vector holding the frequency count of every word in the dictionary in this technique.

Let's say we have a dictionary that has the terms ['yellow', 'violet', 'blue', 'green']. The vector [0, 2, 1, 1] would then represent a document containing the string "violet green violet blue." The vector's entries are sorted by the frequency of the words "yellow", "violet", "blue", "green."

Build a Chatbot in Python from Scratch!

#importing required libraries
import gensim
from gensim import corpora

#creating a sample corpus for demonstration purpose
txt_corpus = [
"Find end-to-end projects at ProjectPro",
"Stop wasting time on different online forums to get your project solutions"]

# Creating a set of frequent words
stoplist = set('for a of the and to in on of to are at'.split(' '))

# Lowercasing each document, using white space as delimiter and filtering out the stopwords
processed_text = [[word for word in document.lower().split() if word not in stoplist]for document in txt_corpus]

#creating a dictionary
dictionary = corpora.Dictionary(processed_text)

#using dictionary to turn tokenized documents into these 14-dimensional vectors.
print(dictionary.token2id)

Output:
{'end-to-end': 0, 'find': 1, 'projectpro': 2, 'projects': 3, 'different': 4, 'forums': 5, 'get': 6, 'online': 7, 'project': 8, 'solutions': 9, 'stop': 10, 'time': 11, 'wasting': 12, 'your': 13}

Suppose we wanted to vectorize the phrase "ProjectPro has many end-to-end projects". In that case, we can create the bag-of-word representation for a document using the doc2bow method of the dictionary, which returns a sparse representation of the word counts as follows.

#the phrase to be vectorised
doc = "ProjectPro has many end-to-end projects"

#using doc2bow for vectorization
vec = dictionary.doc2bow(doc.lower().split())

#displaying the sparse vector
print(vec)

Output:
[(0, 1), (2, 1), (3, 1)]

Each tuple's first element correlates to the token's ID in the dictionary, while the second corresponds to the token's count.
Notice how "has" and "many" were not included in the vectorization as they did not occur in the original corpus. We can convert our entire initial corpus to a list of vectors using doc2bow as follows-

#using doc2bow for vectorization of the entire corpus
bow_vec = [dictionary.doc2bow(text) for text in processed_text]

#displaying the list of vectors
print(bow_vec)

Output:
[[(0, 1), (1, 1), (2, 1), (3, 1)], [(4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1)]]

While this list is exclusively stored in memory, most applications require a more scalable solution. Gensim, on the other hand, lets you use an iterator that returns a single document vector at a time. 

What Users are saying..

profile image

Jingwei Li

Graduate Research assistance at Stony Brook University
linkedin profile url

ProjectPro is an awesome platform that helps me learn much hands-on industrial experience with a step-by-step walkthrough of projects. There are two primary paths to learn: Data Science and Big Data.... Read More

Relevant Projects

Linear Regression Model Project in Python for Beginners Part 2
Machine Learning Linear Regression Project for Beginners in Python to Build a Multiple Linear Regression Model on Soccer Player Dataset.

AWS MLOps Project to Deploy Multiple Linear Regression Model
Build and Deploy a Multiple Linear Regression Model in Python on AWS

FEAST Feature Store Example for Scaling Machine Learning
FEAST Feature Store Example- Learn to use FEAST Feature Store to manage, store, and discover features for customer churn prediction machine learning project.

Build CNN Image Classification Models for Real Time Prediction
Image Classification Project to build a CNN model in Python that can classify images into social security cards, driving licenses, and other key identity information.

Natural language processing Chatbot application using NLTK for text classification
In this NLP AI application, we build the core conversational engine for a chatbot. We use the popular NLTK text classification library to achieve this.

Expedia Hotel Recommendations Data Science Project
In this data science project, you will contextualize customer data and predict the likelihood a customer will stay at 100 different hotel groups.

NLP Project on LDA Topic Modelling Python using RACE Dataset
Use the RACE dataset to extract a dominant topic from each document and perform LDA topic modeling in python.

Stock Price Prediction Project using LSTM and RNN
Learn how to predict stock prices using RNN and LSTM models. Understand deep learning concepts and apply them to real-world financial data for accurate forecasting.

Build a Collaborative Filtering Recommender System in Python
Use the Amazon Reviews/Ratings dataset of 2 Million records to build a recommender system using memory-based collaborative filtering in Python.

Build a Graph Based Recommendation System in Python-Part 2
In this Graph Based Recommender System Project, you will build a recommender system project for eCommerce platforms and learn to use FAISS for efficient similarity search.