What are vectors in Gensim

This recipe explains what is a vector and provides steps to vectorize text using the Gensim library in python
Last Updated: 28 Jul 2022

Get access to Data Science projects View all Data Science projects

MACHINE LEARNING PROJECTS IN PYTHON DATA CLEANING PYTHON DATA MUNGING MACHINE LEARNING RECIPES PANDAS CHEATSHEET ALL TAGS

Recipe Objective: What are vectors in Gensim?

We need a mathematically manipulable approach to represent documents to infer the latent structure in our corpus.   One method defines each document as a set of features in a vector. A single feature  might be thought of as a QnA pair:

   How many times in the document does the word "good" appear? One.
   What is the total number of paragraphs in the document? Two.

Because the question is typically represented by its integer id, this document is described as a series of pairs (1, 1.0), (2, 2.0). This type of vector representation is known as a dense vector. Why is it dense? Because it has a straightforward solution to each of the questions mentioned above.

If we know all of the questions ahead of time, the representation can be as simple as (1, 2). The vector for our document is a succession of replies (only if the questions are known ahead of time).

The bag-of-words (BoW) model is another prominent type of representation. Each page is essentially represented by a vector holding the frequency count of every word in the dictionary in this technique.

Let's say we have a dictionary that has the terms ['yellow', 'violet', 'blue', 'green']. The vector [0, 2, 1, 1] would then represent a document containing the string "violet green violet blue." The vector's entries are sorted by the frequency of the words "yellow", "violet", "blue", "green."

Build a Chatbot in Python from Scratch!

#importing required libraries import gensim from gensim import corpora #creating a sample corpus for demonstration purpose txt_corpus = [ "Find end-to-end projects at ProjectPro", "Stop wasting time on different online forums to get your project solutions"] # Creating a set of frequent words stoplist = set('for a of the and to in on of to are at'.split(' ')) # Lowercasing each document, using white space as delimiter and filtering out the stopwords processed_text = [[word for word in document.lower().split() if word not in stoplist]for document in txt_corpus] #creating a dictionary dictionary = corpora.Dictionary(processed_text) #using dictionary to turn tokenized documents into these 14-dimensional vectors. print(dictionary.token2id)

Output:
{'end-to-end': 0, 'find': 1, 'projectpro': 2, 'projects': 3, 'different': 4, 'forums': 5, 'get': 6, 'online': 7, 'project': 8, 'solutions': 9, 'stop': 10, 'time': 11, 'wasting': 12, 'your': 13}

Suppose we wanted to vectorize the phrase "ProjectPro has many end-to-end projects". In that case, we can create the bag-of-word representation for a document using the doc2bow method of the dictionary, which returns a sparse representation of the word counts as follows.

#the phrase to be vectorised doc = "ProjectPro has many end-to-end projects" #using doc2bow for vectorization vec = dictionary.doc2bow(doc.lower().split()) #displaying the sparse vector print(vec)

Output:
[(0, 1), (2, 1), (3, 1)]

Each tuple's first element correlates to the token's ID in the dictionary, while the second corresponds to the token's count.
Notice how "has" and "many" were not included in the vectorization as they did not occur in the original corpus. We can convert our entire initial corpus to a list of vectors using doc2bow as follows-

#using doc2bow for vectorization of the entire corpus bow_vec = [dictionary.doc2bow(text) for text in processed_text] #displaying the list of vectors print(bow_vec)

Output:
[[(0, 1), (1, 1), (2, 1), (3, 1)], [(4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1)]]

While this list is exclusively stored in memory, most applications require a more scalable solution. Gensim, on the other hand, lets you use an iterator that returns a single document vector at a time.

What Users are saying..

Jingwei Li

Graduate Research assistance at Stony Brook University

ProjectPro is an awesome platform that helps me learn much hands-on industrial experience with a step-by-step walkthrough of projects. There are two primary paths to learn: Data Science and Big Data.... Read More

Relevant Projects

Machine Learning Projects

Data Science Projects

Python Projects for Data Science

Data Science Projects in R

Machine Learning Projects for Beginners

Deep Learning Projects

Neural Network Projects

Tensorflow Projects

NLP Projects

Kaggle Projects

IoT Projects

Big Data Projects

Hadoop Real-Time Projects Examples

Spark Projects

Data Analytics Projects for Students

Relevant Projects

Linear Regression Model Project in Python for Beginners Part 2

Machine Learning Linear Regression Project for Beginners in Python to Build a Multiple Linear Regression Model on Soccer Player Dataset.

View Project Details

AWS MLOps Project to Deploy Multiple Linear Regression Model

Build and Deploy a Multiple Linear Regression Model in Python on AWS

View Project Details

FEAST Feature Store Example for Scaling Machine Learning

FEAST Feature Store Example- Learn to use FEAST Feature Store to manage, store, and discover features for customer churn prediction machine learning project.

View Project Details

Build CNN Image Classification Models for Real Time Prediction

Image Classification Project to build a CNN model in Python that can classify images into social security cards, driving licenses, and other key identity information.

View Project Details

Natural language processing Chatbot application using NLTK for text classification

In this NLP AI application, we build the core conversational engine for a chatbot. We use the popular NLTK text classification library to achieve this.

View Project Details

Expedia Hotel Recommendations Data Science Project

In this data science project, you will contextualize customer data and predict the likelihood a customer will stay at 100 different hotel groups.

View Project Details

NLP Project on LDA Topic Modelling Python using RACE Dataset

Use the RACE dataset to extract a dominant topic from each document and perform LDA topic modeling in python.

View Project Details

Stock Price Prediction Project using LSTM and RNN

Learn how to predict stock prices using RNN and LSTM models. Understand deep learning concepts and apply them to real-world financial data for accurate forecasting.

View Project Details

Build a Collaborative Filtering Recommender System in Python

Use the Amazon Reviews/Ratings dataset of 2 Million records to build a recommender system using memory-based collaborative filtering in Python.

View Project Details

Build a Graph Based Recommendation System in Python-Part 2

In this Graph Based Recommender System Project, you will build a recommender system project for eCommerce platforms and learn to use FAISS for efficient similarity search.

View Project Details

What are vectors in Gensim

Recipe Objective: What are vectors in Gensim?

Jingwei Li

Relevant Projects

You might also like

Relevant Projects