How to create BoW from In Memory Objects in Gensim

In this recipe, we will learn how to create a Bag of Words Corpus from In-Memory Objects using the Gensim library.
Last Updated: 28 Jul 2022

Get access to Data Science projects View all Data Science projects

MACHINE LEARNING PROJECTS IN PYTHON DATA CLEANING PYTHON DATA MUNGING MACHINE LEARNING RECIPES PANDAS CHEATSHEET ALL TAGS

Recipe Objective: How to create Bag of Words Corpus from In-Memory Objects in Gensim?

The Gensim library's bag of words corpora is based on dictionaries and contains the ID of each word and its frequency of occurrence.

We have text in the script below that we have separated into tokens. We next use the corpora module to create a Dictionary object. The object has a function called doc2bow that effectively does two things:

It iterates through all of the words in the text, incrementing the frequency count for the word if it already exists in the corpus.
Otherwise, the word is inserted into the corpus, and its frequency count is set to one.

#importing required libraries import gensim from gensim import corpora from pprint import pprint #creating a sample corpus for demonstration purpose txt_corpus = ["This is sample document", "Collection of documents make a corpus", "You can vectorize your corpus for a mathematically convenient representation of a document"] #tokenisation tokens = [[token for token in sentence.split()] for sentence in txt_corpus] #creating a dictionary gensim_dictionary = corpora.Dictionary(tokens) #creating a bag-of-words corpus gensim_corpus = [gensim_dictionary.doc2bow(token, allow_update=True) for token in tokens] #displaying the contents print("Output:\n",gensim_corpus) #displaying the contents in readable format word_freq = [[(gensim_dictionary[id], frequence) for id, frequence in couple] for couple in gensim_corpus] print("\nOutput in a readable format:\n",word_freq)

Output:
Output:
 [[(0, 1), (1, 1), (2, 1), (3, 1)], [(4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1)], [(1, 1), (5, 2), (6, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1)]]

Output in a readable format:
 [[('This', 1), ('document', 1), ('is', 1), ('sample', 1)], [('Collection', 1), ('a', 1), ('corpus', 1), ('documents', 1), ('make', 1), ('of', 1)], [('document', 1), ('a', 2), ('corpus', 1), ('of', 1), ('You', 1), ('can', 1), ('convenient', 1), ('for', 1), ('mathematically', 1), ('representation', 1), ('vectorize', 1), ('your', 1)]]

The first tuple (0,1) in the above output means that the word with ID 0 occurred once in the text. Similarly, (5, 2) means that the word with ID 5 appeared twice in the document. We have also displayed the word and the frequency count to clarify things.

What Users are saying..

Ameeruddin Mohammed

ETL (Abintio) developer at IBM

I come from a background in Marketing and Analytics and when I developed an interest in Machine Learning algorithms, I did multiple in-class courses from reputed institutions though I got good... Read More

Relevant Projects

Machine Learning Projects

Data Science Projects

Python Projects for Data Science

Data Science Projects in R

Machine Learning Projects for Beginners

Deep Learning Projects

Neural Network Projects

Tensorflow Projects

NLP Projects

Kaggle Projects

IoT Projects

Big Data Projects

Hadoop Real-Time Projects Examples

Spark Projects

Data Analytics Projects for Students

Relevant Projects

Multi-Class Text Classification with Deep Learning using BERT

In this deep learning project, you will implement one of the most popular state of the art Transformer models, BERT for Multi-Class Text Classification

View Project Details

Abstractive Text Summarization using Transformers-BART Model

Deep Learning Project to implement an Abstractive Text Summarizer using Google's Transformers-BART Model to generate news article headlines.

View Project Details

NLP Project to Build a Resume Parser in Python using Spacy

Use the popular Spacy NLP python library for OCR and text classification to build a Resume Parser in Python.

View Project Details

Digit Recognition using CNN for MNIST Dataset in Python

In this deep learning project, you will build a convolutional neural network using MNIST dataset for handwritten digit recognition.

View Project Details

Llama2 Project for MetaData Generation using FAISS and RAGs

In this LLM Llama2 Project, you will automate metadata generation using Llama2, RAGs, and AWS to reduce manual efforts.

View Project Details

Learn How to Build a Logistic Regression Model in PyTorch

In this Machine Learning Project, you will learn how to build a simple logistic regression model in PyTorch for customer churn prediction.

View Project Details

NLP Project on LDA Topic Modelling Python using RACE Dataset

Use the RACE dataset to extract a dominant topic from each document and perform LDA topic modeling in python.

View Project Details

PyTorch Project to Build a GAN Model on MNIST Dataset

In this deep learning project, you will learn how to build a GAN Model on MNIST Dataset for generating new images of handwritten digits.

View Project Details

Recommender System Machine Learning Project for Beginners-2

Recommender System Machine Learning Project for Beginners Part 2- Learn how to build a recommender system for market basket analysis using association rule mining.

View Project Details

Ola Bike Rides Request Demand Forecast

Given big data at taxi service (ride-hailing) i.e. OLA, you will learn multi-step time series forecasting and clustering with Mini-Batch K-means Algorithm on geospatial data to predict future ride requests for a particular region at a given time.

View Project Details

How to create BoW from In Memory Objects in Gensim

Recipe Objective: How to create Bag of Words Corpus from In-Memory Objects in Gensim?

Ameeruddin Mohammed

Relevant Projects

You might also like

Relevant Projects