How to create a dictionary from a single text file using Gensim

In this recipe, we will learn to make a dictionary out of a single text file. A dictionary comprises tokens, which are mappings of words to their unique ids.

Recipe Objective: How to create a dictionary from a single text file using Gensim?

We'll make a dictionary out of a single text file in this example. We can also generate a dictionary from many text files similarly (i.e., directory of files).

We shall do this by saving the document used in the previous example in a text file called document.txt. Using a simple preprocess, Gensim will read the file line by line and process each line individually. It won't have to load the entire file into memory all at once this way.

Deep Learning Project for Text Detection in Images using Python

#importing required libraries
import gensim
from gensim import corpora
from gensim.utils import simple_preprocess
from smart_open import smart_open
import os

#creating a dictionary from a single text file
gensim_dictionary = corpora.Dictionary(simple_preprocess(line, deacc =False) for line in open('document.txt', encoding='utf-8'))

#displaying contents of the dictionary
print("The dictionary has: " +str(len(gensim_dictionary)) + " tokens")
for k, v in gensim_dictionary.token2id.items():
  print(f'{k:{15}} {v:{10}}')

Output:
The dictionary has: 17 tokens
document                 0
is                       1
sample                   2
this                     3
collection               4
corpus                   5
documents                6
make                     7
of                       8
can                      9
convenient              10
for                     11
mathematically          12
representation          13
vectorize               14
you                     15
your                    16

Using the simple preprocess function, we read the text file "document.txt" line by line. Tokens are returned in each line of the page by this approach. The dictionary is then built using the tokens.

What Users are saying..

profile image

Ed Godalle

Director Data Analytics at EY / EY Tech
linkedin profile url

I am the Director of Data Analytics with over 10+ years of IT experience. I have a background in SQL, Python, and Big Data working with Accenture, IBM, and Infosys. I am looking to enhance my skills... Read More

Relevant Projects

Build an Image Classifier for Plant Species Identification
In this machine learning project, we will use binary leaf images and extracted features, including shape, margin, and texture to accurately identify plant species using different benchmark classification techniques.

NLP Project for Beginners on Text Processing and Classification
This Project Explains the Basic Text Preprocessing and How to Build a Classification Model in Python

Build Classification Algorithms for Digital Transformation[Banking]
Implement a machine learning approach using various classification techniques in Python to examine the digitalisation process of bank customers.

Mastering A/B Testing: A Practical Guide for Production
In this A/B Testing for Machine Learning Project, you will gain hands-on experience in conducting A/B tests, analyzing statistical significance, and understanding the challenges of building a solution for A/B testing in a production environment.

Learn to Build a Siamese Neural Network for Image Similarity
In this Deep Learning Project, you will learn how to build a siamese neural network with Keras and Tensorflow for Image Similarity.

Hands-On Approach to Causal Inference in Machine Learning
In this Machine Learning Project, you will learn to implement various causal inference techniques in Python to determine, how effective the sprinkler is in making the grass wet.

Build a Review Classification Model using Gated Recurrent Unit
In this Machine Learning project, you will build a classification model in python to classify the reviews of an app on a scale of 1 to 5 using Gated Recurrent Unit.

Multilabel Classification Project for Predicting Shipment Modes
Multilabel Classification Project to build a machine learning model that predicts the appropriate mode of transport for each shipment, using a transport dataset with 2000 unique products. The project explores and compares four different approaches to multilabel classification, including naive independent models, classifier chains, natively multilabel models, and multilabel to multiclass approaches.

Azure Text Analytics for Medical Search Engine Deployment
Microsoft Azure Project - Use Azure text analytics cognitive service to deploy a machine learning model into Azure Databricks

Abstractive Text Summarization using Transformers-BART Model
Deep Learning Project to implement an Abstractive Text Summarizer using Google's Transformers-BART Model to generate news article headlines.