How to create a dictionary from multiple text files using Gensim

In this recipe, we will learn to make a dictionary out of multiple text files. A dictionary comprises tokens, which are mappings of words to their unique ids.

Recipe Objective: How to create a dictionary from multiple text files using Gensim?

Let's make a dictionary out of multiple text files placed in the same directory. For this example, we created three text files, doc1.txt, doc2.txt, doc3.txt, each containing the three lines from the previous example's text file (document.txt). These three text files are all saved in a shared directory, XYZ directory. We'll need to create a class with a function that can cycle through all three text files in the XYZ directory and return the processed list of words tokens. Take a look at the script below:

#importing required libraries
from gensim.utils import simple_preprocess
from smart_open import smart_open
from gensim import corpora
import os

#creating a class for reading multiple files
class read_multiplefiles(object):
  def __init__(self, dir_path):
   self.dir_path = dir_path

  def __iter__(self):
   for filename in os.listdir(self.dir_path):
    for line in open(os.path.join(self.dir_path, filename), encoding='utf-8'):
     yield simple_preprocess(line)

#providing the path of the directory
directory_path = r"C:\XYZ"

#creating a dictionary from multiple files in the directory provided
gensim_dictionary = corpora.Dictionary(read_multiplefiles(directory_path))

#displaying the tokens in the dictionary
print(gensim_dictionary.token2id)

Output:
{'document': 0, 'is': 1, 'sample': 2, 'this': 3, 'collection': 4, 'corpus': 5, 'documents': 6, 'make': 7, 'of': 8, 'can': 9, 'convenient': 10, 'for': 11, 'mathematically': 12, 'representation': 13, 'vectorize': 14, 'you': 15, 'your': 16}

We iterate through all of the files in the directory and then read each file line by line within the method. For each line, the simple preprocess function generates tokens. The "yield" keyword returns the tokens for each line to the caller function. The dictionary is then built using the tokens.

What Users are saying..

profile image

Anand Kumpatla

Sr Data Scientist @ Doubleslash Software Solutions Pvt Ltd
linkedin profile url

ProjectPro is a unique platform and helps many people in the industry to solve real-life problems with a step-by-step walkthrough of projects. A platform with some fantastic resources to gain... Read More

Relevant Projects

Deep Learning Project- Real-Time Fruit Detection using YOLOv4
In this deep learning project, you will learn to build an accurate, fast, and reliable real-time fruit detection system using the YOLOv4 object detection model for robotic harvesting platforms.

Digit Recognition using CNN for MNIST Dataset in Python
In this deep learning project, you will build a convolutional neural network using MNIST dataset for handwritten digit recognition.

Demand prediction of driver availability using multistep time series analysis
In this supervised learning machine learning project, you will predict the availability of a driver in a specific area by using multi step time series analysis.

Build a Collaborative Filtering Recommender System in Python
Use the Amazon Reviews/Ratings dataset of 2 Million records to build a recommender system using memory-based collaborative filtering in Python.

Build a Music Recommendation Algorithm using KKBox's Dataset
Music Recommendation Project using Machine Learning - Use the KKBox dataset to predict the chances of a user listening to a song again after their very first noticeable listening event.

Predict Churn for a Telecom company using Logistic Regression
Machine Learning Project in R- Predict the customer churn of telecom sector and find out the key drivers that lead to churn. Learn how the logistic regression model using R can be used to identify the customer churn in telecom dataset.

Build Multi Class Text Classification Models with RNN and LSTM
In this Deep Learning Project, you will use the customer complaints data about consumer financial products to build multi-class text classification models using RNN and LSTM.

Loan Eligibility Prediction using Gradient Boosting Classifier
This data science in python project predicts if a loan should be given to an applicant or not. We predict if the customer is eligible for loan based on several factors like credit score and past history.

PyTorch Project to Build a LSTM Text Classification Model
In this PyTorch Project you will learn how to build an LSTM Text Classification model for Classifying the Reviews of an App .

Build ARCH and GARCH Models in Time Series using Python
In this Project we will build an ARCH and a GARCH model using Python