Creating Bag of Words Corpus from multiple text files in Gensim

By this recipe we will be able to understand what is Bag of Words, Word Corpus and how we can create Bag of Words Corpus with multiple files using genism in Python.

Recipe Objective: How to create a Bag of Words Corpus from multiple text files using Gensim?

You can create a bag of words corpus using multiple text files as follows-

#importing required libraries
from gensim.utils import simple_preprocess
from smart_open import smart_open
from gensim import corpora
import os

#creating a class for reading multiple files
class read_multiplefiles(object):
def __init__(self, dir_path):
self.dir_path = dir_path

def __iter__(self):
for filename in os.listdir(self.dir_path):
for line in open(os.path.join(self.dir_path, filename), encoding='utf-8'):
yield simple_preprocess(line)

#providing the path of the directory
directory_path = r"C:\abc"

#creating a dictionary
gensim_dictionary = corpora.Dictionary()

#creating a bag-of-words corpus from multiple files in the directory provided
gensim_corpus = [gensim_dictionary.doc2bow(token, allow_update=True) for token in read_multiplefiles(directory_path)]

#displaying the contents in readable format
word_frequencies = [[(gensim_dictionary[id], frequence) for id, frequence in couple] for couple in gensim_corpus]
print(word_frequencies)

Output:
[[('document', 1), ('is', 1), ('sample', 1), ('this', 1)], [('collection', 1), ('corpus', 1), ('documents', 1), ('make', 1), ('of', 1)], [('document', 1), ('corpus', 1), ('of', 1), ('can', 1), ('convenient', 1), ('for', 1), ('mathematically', 1), ('representation', 1), ('vectorize', 1), ('you', 1), ('your', 1)]]

We iterate through all of the files in the directory and then read each file line by line within the method. For each line, the simple preprocess function generates tokens. The "yield" keyword returns the tokens for each line to the caller function. A bag-of-words corpus is then built using these tokens.

What Users are saying..

profile image

Ed Godalle

Director Data Analytics at EY / EY Tech
linkedin profile url

I am the Director of Data Analytics with over 10+ years of IT experience. I have a background in SQL, Python, and Big Data working with Accenture, IBM, and Infosys. I am looking to enhance my skills... Read More

Relevant Projects

Deep Learning Project for Text Detection in Images using Python
CV2 Text Detection Code for Images using Python -Build a CRNN deep learning model to predict the single-line text in a given image.

Machine Learning project for Retail Price Optimization
In this machine learning pricing project, we implement a retail price optimization algorithm using regression trees. This is one of the first steps to building a dynamic pricing model.

Ecommerce product reviews - Pairwise ranking and sentiment analysis
This project analyzes a dataset containing ecommerce product reviews. The goal is to use machine learning models to perform sentiment analysis on product reviews and rank them based on relevance. Reviews play a key role in product recommendation systems.

Build a Text Generator Model using Amazon SageMaker
In this Deep Learning Project, you will train a Text Generator Model on Amazon Reviews Dataset using LSTM Algorithm in PyTorch and deploy it on Amazon SageMaker.

Learn How to Build a Linear Regression Model in PyTorch
In this Machine Learning Project, you will learn how to build a simple linear regression model in PyTorch to predict the number of days subscribed.

Build CNN Image Classification Models for Real Time Prediction
Image Classification Project to build a CNN model in Python that can classify images into social security cards, driving licenses, and other key identity information.

Multi-Class Text Classification with Deep Learning using BERT
In this deep learning project, you will implement one of the most popular state of the art Transformer models, BERT for Multi-Class Text Classification

Ola Bike Rides Request Demand Forecast
Given big data at taxi service (ride-hailing) i.e. OLA, you will learn multi-step time series forecasting and clustering with Mini-Batch K-means Algorithm on geospatial data to predict future ride requests for a particular region at a given time.

LLM Project to Build and Fine Tune a Large Language Model
In this LLM project for beginners, you will learn to build a knowledge-grounded chatbot using LLM's and learn how to fine tune it.

Customer Churn Prediction Analysis using Ensemble Techniques
In this machine learning churn project, we implement a churn prediction model in python using ensemble techniques.