Explain corpus streaming in Gensim

In this recipe, we will learn what is corpus streaming. We will learn how corpus streaming is done and how it gives an edge to the Gensim library.

Recipe Objective: Explain corpus streaming in Gensim

Assume that a corpus contains millions of documents. It's not possible to store all of them in RAM. Let's pretend the documents are saved in a file on disc, one for each line. Gensim's only requirement is that a corpus must only return one document vector at a time.

The fact that a corpus doesn't have to be a list, a NumPy array, a Pandas dataframe, or anything else gives Gensim its full capability. Gensim accepts any object that, when iterated over, produces documents in sequential order.

Explore the Real-World Applications of Recommender Systems 

Here's an example of corpus streaming in Gensim-

#importing required libraries
from smart_open import open
from collections import defaultdict
from gensim import corpora

#a sample corpus
docs = ["This is sample document",
"Collection of documents make a corpus",
"You can vectorize your corpus for a mathematically convenient representation of a document"]

#creating a list of stopwords
stoplist = set('for a of the and to in'.split())

#removing the stop words
txts = [[word for word in document.lower().split() if word not in stoplist]for document in docs]

#calculating frequency of each text
frequency = defaultdict(int)
for text in txts:
for token in text:
frequency[token] += 1

#removing words that appear only once
txts = [[token for token in text if frequency[token] > 1]for text in txts]

#creating a dictionary
gensim_dictionary = corpora.Dictionary(txts)

#creating a method to read a corpus one document per line
class read_corpus:
def __iter__(self):
for line in open(r'C:\Users\Lenovo\Documents\document.txt', encoding='utf-8'):
yield gensim_dictionary.doc2bow(line.lower().split())

#by calling the method the corpus doesn't load into memory
memory_friendly_corpus = read_corpus() print(memory_friendly_corpus)

print("\n")

#loading one vector into memory at a time
for vector in memory_friendly_corpus:
print(vector)

Output:
<__main__.read_corpus object at 0x000002AFFFEC1F10>


[(0, 1)]
[(1, 1)]
[(0, 1), (1, 1)]

This versatility enables you to design your corpus classes that stream documents from disc, network, database, dataframes, and so on. Gensim's models are written so that they don't require all vectors to be stored in RAM at the same time.

What Users are saying..

profile image

Savvy Sahai

Data Science Intern, Capgemini
linkedin profile url

As a student looking to break into the field of data engineering and data science, one can get really confused as to which path to take. Very few ways to do it are Google, YouTube, etc. I was one of... Read More

Relevant Projects

AWS MLOps Project to Deploy Multiple Linear Regression Model
Build and Deploy a Multiple Linear Regression Model in Python on AWS

Recommender System Machine Learning Project for Beginners-4
Collaborative Filtering Recommender System Project - Comparison of different model based and memory based methods to build recommendation system using collaborative filtering.

Build Classification Algorithms for Digital Transformation[Banking]
Implement a machine learning approach using various classification techniques in Python to examine the digitalisation process of bank customers.

OpenCV Project for Beginners to Learn Computer Vision Basics
In this OpenCV project, you will learn computer vision basics and the fundamentals of OpenCV library using Python.

Build an AI Chatbot from Scratch using Keras Sequential Model
In this NLP Project, you will learn how to build an AI Chatbot from Scratch using Keras Sequential Model.

MLOps using Azure Devops to Deploy a Classification Model
In this MLOps Azure project, you will learn how to deploy a classification machine learning model to predict the customer's license status on Azure through scalable CI/CD ML pipelines.

Multilabel Classification Project for Predicting Shipment Modes
Multilabel Classification Project to build a machine learning model that predicts the appropriate mode of transport for each shipment, using a transport dataset with 2000 unique products. The project explores and compares four different approaches to multilabel classification, including naive independent models, classifier chains, natively multilabel models, and multilabel to multiclass approaches.

Locality Sensitive Hashing Python Code for Look-Alike Modelling
In this deep learning project, you will find similar images (lookalikes) using deep learning and locality sensitive hashing to find customers who are most likely to click on an ad.

Machine Learning project for Retail Price Optimization
In this machine learning pricing project, we implement a retail price optimization algorithm using regression trees. This is one of the first steps to building a dynamic pricing model.

Learn to Build a Siamese Neural Network for Image Similarity
In this Deep Learning Project, you will learn how to build a siamese neural network with Keras and Tensorflow for Image Similarity.