How to create a dictionary from multiple text files using Gensim

In this recipe, we will learn to make a dictionary out of multiple text files. A dictionary comprises tokens, which are mappings of words to their unique ids.
Last Updated: 23 Jun 2022

Get access to Data Science projects View all Data Science projects

MACHINE LEARNING PROJECTS IN PYTHON DATA CLEANING PYTHON DATA MUNGING MACHINE LEARNING RECIPES PANDAS CHEATSHEET ALL TAGS

Recipe Objective: How to create a dictionary from multiple text files using Gensim?

Let's make a dictionary out of multiple text files placed in the same directory. For this example, we created three text files, doc1.txt, doc2.txt, doc3.txt, each containing the three lines from the previous example's text file (document.txt). These three text files are all saved in a shared directory, XYZ directory. We'll need to create a class with a function that can cycle through all three text files in the XYZ directory and return the processed list of words tokens. Take a look at the script below:

#importing required libraries from gensim.utils import simple_preprocess from smart_open import smart_open from gensim import corpora import os #creating a class for reading multiple files class read_multiplefiles(object): def __init__(self, dir_path): self.dir_path = dir_path def __iter__(self): for filename in os.listdir(self.dir_path): for line in open(os.path.join(self.dir_path, filename), encoding='utf-8'): yield simple_preprocess(line) #providing the path of the directory directory_path = r"C:\XYZ" #creating a dictionary from multiple files in the directory provided gensim_dictionary = corpora.Dictionary(read_multiplefiles(directory_path)) #displaying the tokens in the dictionary print(gensim_dictionary.token2id)

Output:
{'document': 0, 'is': 1, 'sample': 2, 'this': 3, 'collection': 4, 'corpus': 5, 'documents': 6, 'make': 7, 'of': 8, 'can': 9, 'convenient': 10, 'for': 11, 'mathematically': 12, 'representation': 13, 'vectorize': 14, 'you': 15, 'your': 16}

We iterate through all of the files in the directory and then read each file line by line within the method. For each line, the simple preprocess function generates tokens. The "yield" keyword returns the tokens for each line to the caller function. The dictionary is then built using the tokens.

What Users are saying..

Anand Kumpatla

Sr Data Scientist @ Doubleslash Software Solutions Pvt Ltd

ProjectPro is a unique platform and helps many people in the industry to solve real-life problems with a step-by-step walkthrough of projects. A platform with some fantastic resources to gain... Read More

Relevant Projects

Machine Learning Projects

Data Science Projects

Python Projects for Data Science

Data Science Projects in R

Machine Learning Projects for Beginners

Deep Learning Projects

Neural Network Projects

Tensorflow Projects

NLP Projects

Kaggle Projects

IoT Projects

Big Data Projects

Hadoop Real-Time Projects Examples

Spark Projects

Data Analytics Projects for Students

Relevant Projects

Deep Learning Project- Real-Time Fruit Detection using YOLOv4

In this deep learning project, you will learn to build an accurate, fast, and reliable real-time fruit detection system using the YOLOv4 object detection model for robotic harvesting platforms.

View Project Details

Digit Recognition using CNN for MNIST Dataset in Python

In this deep learning project, you will build a convolutional neural network using MNIST dataset for handwritten digit recognition.

View Project Details

Demand prediction of driver availability using multistep time series analysis

In this supervised learning machine learning project, you will predict the availability of a driver in a specific area by using multi step time series analysis.

View Project Details

Build a Collaborative Filtering Recommender System in Python

Use the Amazon Reviews/Ratings dataset of 2 Million records to build a recommender system using memory-based collaborative filtering in Python.

View Project Details

Build a Music Recommendation Algorithm using KKBox's Dataset

Music Recommendation Project using Machine Learning - Use the KKBox dataset to predict the chances of a user listening to a song again after their very first noticeable listening event.

View Project Details

Predict Churn for a Telecom company using Logistic Regression

Machine Learning Project in R- Predict the customer churn of telecom sector and find out the key drivers that lead to churn. Learn how the logistic regression model using R can be used to identify the customer churn in telecom dataset.

View Project Details

Build Multi Class Text Classification Models with RNN and LSTM

In this Deep Learning Project, you will use the customer complaints data about consumer financial products to build multi-class text classification models using RNN and LSTM.

View Project Details

Loan Eligibility Prediction using Gradient Boosting Classifier

This data science in python project predicts if a loan should be given to an applicant or not. We predict if the customer is eligible for loan based on several factors like credit score and past history.

View Project Details

PyTorch Project to Build a LSTM Text Classification Model

In this PyTorch Project you will learn how to build an LSTM Text Classification model for Classifying the Reviews of an App .

View Project Details

Build ARCH and GARCH Models in Time Series using Python

In this Project we will build an ARCH and a GARCH model using Python

View Project Details

How to create a dictionary from multiple text files using Gensim

Recipe Objective: How to create a dictionary from multiple text files using Gensim?

Anand Kumpatla

Relevant Projects

You might also like

Relevant Projects