How to create BoW from In Memory Objects in Gensim

In this recipe, we will learn how to create a Bag of Words Corpus from In-Memory Objects using the Gensim library.

Recipe Objective: How to create Bag of Words Corpus from In-Memory Objects in Gensim?

The Gensim library's bag of words corpora is based on dictionaries and contains the ID of each word and its frequency of occurrence.

We have text in the script below that we have separated into tokens. We next use the corpora module to create a Dictionary object. The object has a function called doc2bow that effectively does two things:

   It iterates through all of the words in the text, incrementing the frequency count for the word if it already exists in the corpus.
   Otherwise, the word is inserted into the corpus, and its frequency count is set to one.

#importing required libraries
import gensim
from gensim import corpora
from pprint import pprint

#creating a sample corpus for demonstration purpose
txt_corpus = ["This is sample document",
"Collection of documents make a corpus",
"You can vectorize your corpus for a mathematically convenient representation of a document"]
#tokenisation
tokens = [[token for token in sentence.split()] for sentence in txt_corpus]

#creating a dictionary
gensim_dictionary = corpora.Dictionary(tokens)

#creating a bag-of-words corpus
gensim_corpus = [gensim_dictionary.doc2bow(token, allow_update=True) for token in tokens]

#displaying the contents
print("Output:\n",gensim_corpus)

#displaying the contents in readable format
word_freq = [[(gensim_dictionary[id], frequence) for id, frequence in couple] for couple in gensim_corpus]
print("\nOutput in a readable format:\n",word_freq)

Output:
Output:
 [[(0, 1), (1, 1), (2, 1), (3, 1)], [(4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1)], [(1, 1), (5, 2), (6, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1)]]

Output in a readable format:
 [[('This', 1), ('document', 1), ('is', 1), ('sample', 1)], [('Collection', 1), ('a', 1), ('corpus', 1), ('documents', 1), ('make', 1), ('of', 1)], [('document', 1), ('a', 2), ('corpus', 1), ('of', 1), ('You', 1), ('can', 1), ('convenient', 1), ('for', 1), ('mathematically', 1), ('representation', 1), ('vectorize', 1), ('your', 1)]]

The first tuple (0,1) in the above output means that the word with ID 0 occurred once in the text. Similarly, (5, 2) means that the word with ID 5 appeared twice in the document. We have also displayed the word and the frequency count to clarify things.

What Users are saying..

profile image

Ameeruddin Mohammed

ETL (Abintio) developer at IBM
linkedin profile url

I come from a background in Marketing and Analytics and when I developed an interest in Machine Learning algorithms, I did multiple in-class courses from reputed institutions though I got good... Read More

Relevant Projects

Multi-Class Text Classification with Deep Learning using BERT
In this deep learning project, you will implement one of the most popular state of the art Transformer models, BERT for Multi-Class Text Classification

Abstractive Text Summarization using Transformers-BART Model
Deep Learning Project to implement an Abstractive Text Summarizer using Google's Transformers-BART Model to generate news article headlines.

NLP Project to Build a Resume Parser in Python using Spacy
Use the popular Spacy NLP python library for OCR and text classification to build a Resume Parser in Python.

Digit Recognition using CNN for MNIST Dataset in Python
In this deep learning project, you will build a convolutional neural network using MNIST dataset for handwritten digit recognition.

Llama2 Project for MetaData Generation using FAISS and RAGs
In this LLM Llama2 Project, you will automate metadata generation using Llama2, RAGs, and AWS to reduce manual efforts.

Learn How to Build a Logistic Regression Model in PyTorch
In this Machine Learning Project, you will learn how to build a simple logistic regression model in PyTorch for customer churn prediction.

NLP Project on LDA Topic Modelling Python using RACE Dataset
Use the RACE dataset to extract a dominant topic from each document and perform LDA topic modeling in python.

PyTorch Project to Build a GAN Model on MNIST Dataset
In this deep learning project, you will learn how to build a GAN Model on MNIST Dataset for generating new images of handwritten digits.

Recommender System Machine Learning Project for Beginners-2
Recommender System Machine Learning Project for Beginners Part 2- Learn how to build a recommender system for market basket analysis using association rule mining.

Ola Bike Rides Request Demand Forecast
Given big data at taxi service (ride-hailing) i.e. OLA, you will learn multi-step time series forecasting and clustering with Mini-Batch K-means Algorithm on geospatial data to predict future ride requests for a particular region at a given time.