How to create Bag of Words Corpus from a text file in Gensim

In this recipe, we will learn how to create Bag of Words Corpus from a text file with the help of the Gensim library in python.

Recipe Objective: How to create a Bag of Words Corpus from a text file in Gensim?

The Gensim library's bag of words corpora is based on dictionaries and contains the ID of each word and its frequency of occurrence.
We can generate a bag of words corpus by reading a text file, just like dictionaries. Take a look at the code below:

#importing required libraries
from gensim.utils import simple_preprocess
from gensim import corpora
from smart_open import smart_open
import os

#tokenization
tokens = [simple_preprocess(sentence, deacc=True) for sentence in open(r'document.txt', encoding='utf-8')]

#creating a dictionary
gensim_dictionary = corpora.Dictionary()

#creating a bag-of-words corpus
gensim_corpus = [gensim_dictionary.doc2bow(token, allow_update=True) for token in tokens]

#displaying the contents in readable format
word_freq = [[(gensim_dictionary[id], frequence) for id, frequence in couple] for couple in gensim_corpus]
print(word_freq)

Output:
[[('document', 1), ('is', 1), ('sample', 1), ('this', 1)], [('collection', 1), ('corpus', 1), ('documents', 1), ('make', 1), ('of', 1)], [('document', 1), ('corpus', 1), ('of', 1), ('can', 1), ('convenient', 1), ('for', 1), ('mathematically', 1), ('representation', 1), ('vectorize', 1), ('you', 1), ('your', 1)]]

We have successfully created a bag of words corpus using "document.txt." The output shows the occurrence of the words in the document.

What Users are saying..

profile image

Ameeruddin Mohammed

ETL (Abintio) developer at IBM
linkedin profile url

I come from a background in Marketing and Analytics and when I developed an interest in Machine Learning algorithms, I did multiple in-class courses from reputed institutions though I got good... Read More

Relevant Projects

A/B Testing Approach for Comparing Performance of ML Models
The objective of this project is to compare the performance of BERT and DistilBERT models for building an efficient Question and Answering system. Using A/B testing approach, we explore the effectiveness and efficiency of both models and determine which one is better suited for Q&A tasks.

Machine Learning Project to Forecast Rossmann Store Sales
In this machine learning project you will work on creating a robust prediction model of Rossmann's daily sales using store, promotion, and competitor data.

NLP Project for Multi Class Text Classification using BERT Model
In this NLP Project, you will learn how to build a multi-class text classification model using using the pre-trained BERT model.

Loan Default Prediction Project using Explainable AI ML Models
Loan Default Prediction Project that employs sophisticated machine learning models, such as XGBoost and Random Forest and delves deep into the realm of Explainable AI, ensuring every prediction is transparent and understandable.

AWS MLOps Project for ARCH and GARCH Time Series Models
Build and deploy ARCH and GARCH time series forecasting models in Python on AWS .

Expedia Hotel Recommendations Data Science Project
In this data science project, you will contextualize customer data and predict the likelihood a customer will stay at 100 different hotel groups.

End-to-End Snowflake Healthcare Analytics Project on AWS-1
In this Snowflake Healthcare Analytics Project, you will leverage Snowflake on AWS to predict patient length of stay (LOS) in hospitals. The prediction of LOS can help in efficient resource allocation, lower the risk of staff/visitor infections, and improve overall hospital functioning.

Build CI/CD Pipeline for Machine Learning Projects using Jenkins
In this project, you will learn how to create a CI/CD pipeline for a search engine application using Jenkins.

Credit Card Default Prediction using Machine learning techniques
In this data science project, you will predict borrowers chance of defaulting on credit loans by building a credit score prediction model.

Learn Hyperparameter Tuning for Neural Networks with PyTorch
In this Deep Learning Project, you will learn how to optimally tune the hyperparameters (learning rate, epochs, dropout, early stopping) of a neural network model in PyTorch to improve model performance.