Explain Doc2Vec model in Gensim

In this recipe, we'll learn to create a Doc2Vec model which is used to generate a vectorized representation of a group of words taken as a whole using gensim.

Recipe Objective: Explain the Doc2Vec model in Gensim

The Doc2Vec model is used to generate a vectorized representation of a group of words taken as a whole. It calculates more than just the average of the words in the sentence. We will use the text8 dataset, which can be downloaded at gensim. downloader, to build document vectors with Doc2Vec as follows-

Learn How to use XLNet for Text Classification

#importing required libraries
import gensim
import gensim.downloader as api

#downloading the Dataset
dataset = api.load("text8")
data = [d for d in dataset]

#creating tagged documents using models.doc2vec.TaggedDcument()
def tagged_doc(list_of_list_of_words):
  for i, list_of_words in enumerate(list_of_list_of_words):
   yield gensim.models.doc2vec.TaggedDocument(list_of_words, [i])
training_data = list(tagged_doc(data))

#printing the trained dataset
print(training_data[:1])

#initialising the model
dv_model = gensim.models.doc2vec.Doc2Vec(vector_size=40, min_count=2, epochs=30)

#building the vocabulary
dv_model.build_vocab(training_data)

#training the Doc2Vec model
dv_model.train(training_data, total_examples=dv_model.corpus_count, epochs=dv_model.epochs)

#analysing the output
print(dv_model.infer_vector(['describe', 'modern','era','revolution','repudiated']))

Output:
[-2.1419777e-01 -3.4295085e-01 -3.1674471e-01  7.9905950e-02
  1.1792209e-01 -5.5660107e-03  7.0156835e-02 -8.0916628e-02
 -3.0582789e-01 -2.4863353e-01  9.2477903e-02 -3.0935228e-02
 -5.2634442e-01 -3.7851343e-01 -7.9936698e-02  1.3879079e-01
  3.0395445e-01  4.3877283e-01 -4.4444799e-01  2.6140922e-01
 -1.3938751e-02  2.5438294e-01  6.6719547e-02  3.8132364e-01
 -1.8118909e-01 -2.3382125e-02 -3.1091588e-02 -2.3327848e-01
 -1.6785687e-01 -3.4823459e-01  9.0288207e-02 -1.7410168e-02
 -2.2582319e-01 -1.3211270e-01 -4.8467633e-01 -1.8533233e-01
  2.6937298e-02 -3.9798447e-01 -9.2203647e-02  2.9851799e-07]

By feeding a list of words to the trained model, we could infer a vector for any piece of text. The function infer vector is used to infer vector, and the cosine similarity of this vector can then be compared to other vectors.
It's worth noting that infer_vector() expects a list of string tokens, which should have already been tokenized using the words property of the original training document objects.
Because the underlying training/inference algorithms are an iterative approximation problem with inherent randomization, repeated inferences of the exact text will yield slightly different vectors.

What Users are saying..

profile image

Abhinav Agarwal

Graduate Student at Northwestern University
linkedin profile url

I come from Northwestern University, which is ranked 9th in the US. Although the high-quality academics at school taught me all the basics I needed, obtaining practical experience was a challenge.... Read More

Relevant Projects

Deploying Machine Learning Models with Flask for Beginners
In this MLOps on GCP project you will learn to deploy a sales forecasting ML Model using Flask.

Build a Customer Churn Prediction Model using Decision Trees
Develop a customer churn prediction model using decision tree machine learning algorithms and data science on streaming service data.

Demand prediction of driver availability using multistep time series analysis
In this supervised learning machine learning project, you will predict the availability of a driver in a specific area by using multi step time series analysis.

Multi-Class Text Classification with Deep Learning using BERT
In this deep learning project, you will implement one of the most popular state of the art Transformer models, BERT for Multi-Class Text Classification

Azure Deep Learning-Deploy RNN CNN models for TimeSeries
In this Azure MLOps Project, you will learn to perform docker-based deployment of RNN and CNN Models for Time Series Forecasting on Azure Cloud.

House Price Prediction Project using Machine Learning in Python
Use the Zillow Zestimate Dataset to build a machine learning model for house price prediction.

Build a Autoregressive and Moving Average Time Series Model
In this time series project, you will learn to build Autoregressive and Moving Average Time Series Models to forecast future readings, optimize performance, and harness the power of predictive analytics for sensor data.

PyTorch Project to Build a LSTM Text Classification Model
In this PyTorch Project you will learn how to build an LSTM Text Classification model for Classifying the Reviews of an App .

Build an AI Chatbot from Scratch using Keras Sequential Model
In this NLP Project, you will learn how to build an AI Chatbot from Scratch using Keras Sequential Model.

Recommender System Machine Learning Project for Beginners-1
Recommender System Machine Learning Project for Beginners - Learn how to design, implement and train a rule-based recommender system in Python