How to create a dictionary from a corpus using Gensim

This recipe provides the steps to create a dictionary from a corpus using Gensim

Recipe Objective: How to create a dictionary from a corpus using Gensim?

We want to assign a unique integer ID to each word in the corpus. The gensim.corpora.Dictionary class can be used to accomplish this. This dictionary defines all of the terms that our processing recognizes.

#importing required libraries
import gensim
from gensim import corpora

#creating a sample corpus for demonstration purpose
txt_corpus = [
"Find end-to-end projects at ProjectPro",
"Stop wasting time on different online forums to get your project solutions",
"Each of our projects solve a real business problem from start to finish",
"All projects come with downloadable solution code and explanatory videos",
"All our projects are designed modularly so you can rapidly learn and reuse modules"]

# Creating a set of frequent words
stoplist = set('for a of the and to in on of to are at'.split(' '))

# Lowercasing each document, using white space as delimiter and filtering out the stopwords
processed_text = [[word for word in document.lower().split() if word not in stoplist]for document in txt_corpus]

#creating a dictionary
dictionary = corpora.Dictionary(processed_text)

#displaying the dictionary
print(dictionary)

Output:
Dictionary(40 unique tokens: ['end-to-end', 'find', 'projectpro', 'projects', 'different']...)

As our corpus is small, this gensim.corpora.Dictionary only has 40 tokens. Dictionary collections with hundreds of thousands of tokens are usual for larger corpora.

What Users are saying..

profile image

Abhinav Agarwal

Graduate Student at Northwestern University
linkedin profile url

I come from Northwestern University, which is ranked 9th in the US. Although the high-quality academics at school taught me all the basics I needed, obtaining practical experience was a challenge.... Read More

Relevant Projects

Loan Eligibility Prediction in Python using H2O.ai
In this loan prediction project you will build predictive models in Python using H2O.ai to predict if an applicant is able to repay the loan or not.

Model Deployment on GCP using Streamlit for Resume Parsing
Perform model deployment on GCP for resume parsing model using Streamlit App.

Isolation Forest Model and LOF for Anomaly Detection in Python
Credit Card Fraud Detection Project - Build an Isolation Forest Model and Local Outlier Factor (LOF) in Python to identify fraudulent credit card transactions.

NLP Project for Multi Class Text Classification using BERT Model
In this NLP Project, you will learn how to build a multi-class text classification model using using the pre-trained BERT model.

House Price Prediction Project using Machine Learning in Python
Use the Zillow Zestimate Dataset to build a machine learning model for house price prediction.

Build Regression (Linear,Ridge,Lasso) Models in NumPy Python
In this machine learning regression project, you will learn to build NumPy Regression Models (Linear Regression, Ridge Regression, Lasso Regression) from Scratch.

Build CI/CD Pipeline for Machine Learning Projects using Jenkins
In this project, you will learn how to create a CI/CD pipeline for a search engine application using Jenkins.

Recommender System Machine Learning Project for Beginners-2
Recommender System Machine Learning Project for Beginners Part 2- Learn how to build a recommender system for market basket analysis using association rule mining.

Build a Similar Images Finder with Python, Keras, and Tensorflow
Build your own image similarity application using Python to search and find images of products that are similar to any given product. You will implement the K-Nearest Neighbor algorithm to find products with maximum similarity.

BigMart Sales Prediction ML Project in Python
The goal of the BigMart Sales Prediction ML project is to build and evaluate different predictive models and determine the sales of each product at a store.