How can we add more tokens to an existing dictionary in Gensim

In this recipe, we will learn how to add more token to an existing dictionary with the help of the Gensim library in python.

Recipe Objective: How can we add more tokens to an existing dictionary in Gensim?

Let's look at how we can use a new document to add more tokens to an existing dictionary. Take a look at the script below-

#importing required libraries
import gensim
from gensim import corpora
from pprint import pprint

#creating a sample corpus for demonstration purpose
txt_corpus = ["This is sample document",
"Collection of documents make a corpus",
"You can vectorize your corpus for a mathematically convenient representation of a document"]
# okenisation
tokens = [[token for token in sentence.split()] for sentence in txt_corpus]

#creating a dictionary
gensim_dictionary = corpora.Dictionary(tokens)

#displaying contents of the dictionary
print("The dictionary had: " +str(len(gensim_dictionary)) + " tokens")
print(gensim_dictionary.token2id)

text = ["Model is an algorithm for transforming vectors from one representation to another"]

tokens2 = [[token for token in sentence.split()] for sentence in text]
gensim_dictionary.add_documents(tokens2)

print("\nThe dictionary now has: " + str(len(gensim_dictionary)) + " tokens after adding new documents")
print(gensim_dictionary.token2id)

Output:
The dictionary had: 18 tokens
{'This': 0, 'document': 1, 'is': 2, 'sample': 3, 'Collection': 4, 'a': 5, 'corpus': 6, 'documents': 7, 'make': 8, 'of': 9, 'You': 10, 'can': 11, 'convenient': 12, 'for': 13, 'mathematically': 14, 'representation': 15, 'vectorize': 16, 'your': 17}

The dictionary now has: 27 tokens after adding new documents
{'This': 0, 'document': 1, 'is': 2, 'sample': 3, 'Collection': 4, 'a': 5, 'corpus': 6, 'documents': 7, 'make': 8, 'of': 9, 'You': 10, 'can': 11, 'convenient': 12, 'for': 13, 'mathematically': 14, 'representation': 15, 'vectorize': 16, 'your': 17, 'Model': 18, 'algorithm': 19, 'an': 20, 'another': 21, 'from': 22, 'one': 23, 'to': 24, 'transforming': 25, 'vectors': 26}

This is how we can add more tokens to an existing dictionary using documents in Gensim.

What Users are saying..

profile image

Ameeruddin Mohammed

ETL (Abintio) developer at IBM
linkedin profile url

I come from a background in Marketing and Analytics and when I developed an interest in Machine Learning algorithms, I did multiple in-class courses from reputed institutions though I got good... Read More

Relevant Projects

Build Multi Class Text Classification Models with RNN and LSTM
In this Deep Learning Project, you will use the customer complaints data about consumer financial products to build multi-class text classification models using RNN and LSTM.

Isolation Forest Model and LOF for Anomaly Detection in Python
Credit Card Fraud Detection Project - Build an Isolation Forest Model and Local Outlier Factor (LOF) in Python to identify fraudulent credit card transactions.

Llama2 Project for MetaData Generation using FAISS and RAGs
In this LLM Llama2 Project, you will automate metadata generation using Llama2, RAGs, and AWS to reduce manual efforts.

Recommender System Machine Learning Project for Beginners-1
Recommender System Machine Learning Project for Beginners - Learn how to design, implement and train a rule-based recommender system in Python

Ola Bike Rides Request Demand Forecast
Given big data at taxi service (ride-hailing) i.e. OLA, you will learn multi-step time series forecasting and clustering with Mini-Batch K-means Algorithm on geospatial data to predict future ride requests for a particular region at a given time.

Build a Graph Based Recommendation System in Python-Part 2
In this Graph Based Recommender System Project, you will build a recommender system project for eCommerce platforms and learn to use FAISS for efficient similarity search.

Credit Card Default Prediction using Machine learning techniques
In this data science project, you will predict borrowers chance of defaulting on credit loans by building a credit score prediction model.

Build CNN for Image Colorization using Deep Transfer Learning
Image Processing Project -Train a model for colorization to make grayscale images colorful using convolutional autoencoders.

Classification Projects on Machine Learning for Beginners - 1
Classification ML Project for Beginners - A Hands-On Approach to Implementing Different Types of Classification Algorithms in Machine Learning for Predictive Modelling

Loan Eligibility Prediction using Gradient Boosting Classifier
This data science in python project predicts if a loan should be given to an applicant or not. We predict if the customer is eligible for loan based on several factors like credit score and past history.