Explain corpus streaming in Gensim

In this recipe, we will learn what is corpus streaming. We will learn how corpus streaming is done and how it gives an edge to the Gensim library.

Recipe Objective: Explain corpus streaming in Gensim

Assume that a corpus contains millions of documents. It's not possible to store all of them in RAM. Let's pretend the documents are saved in a file on disc, one for each line. Gensim's only requirement is that a corpus must only return one document vector at a time.

The fact that a corpus doesn't have to be a list, a NumPy array, a Pandas dataframe, or anything else gives Gensim its full capability. Gensim accepts any object that, when iterated over, produces documents in sequential order.

Explore the Real-World Applications of Recommender Systems 

Here's an example of corpus streaming in Gensim-

#importing required libraries
from smart_open import open
from collections import defaultdict
from gensim import corpora

#a sample corpus
docs = ["This is sample document",
"Collection of documents make a corpus",
"You can vectorize your corpus for a mathematically convenient representation of a document"]

#creating a list of stopwords
stoplist = set('for a of the and to in'.split())

#removing the stop words
txts = [[word for word in document.lower().split() if word not in stoplist]for document in docs]

#calculating frequency of each text
frequency = defaultdict(int)
for text in txts:
for token in text:
frequency[token] += 1

#removing words that appear only once
txts = [[token for token in text if frequency[token] > 1]for text in txts]

#creating a dictionary
gensim_dictionary = corpora.Dictionary(txts)

#creating a method to read a corpus one document per line
class read_corpus:
def __iter__(self):
for line in open(r'C:\Users\Lenovo\Documents\document.txt', encoding='utf-8'):
yield gensim_dictionary.doc2bow(line.lower().split())

#by calling the method the corpus doesn't load into memory
memory_friendly_corpus = read_corpus() print(memory_friendly_corpus)

print("\n")

#loading one vector into memory at a time
for vector in memory_friendly_corpus:
print(vector)

Output:
<__main__.read_corpus object at 0x000002AFFFEC1F10>


[(0, 1)]
[(1, 1)]
[(0, 1), (1, 1)]

This versatility enables you to design your corpus classes that stream documents from disc, network, database, dataframes, and so on. Gensim's models are written so that they don't require all vectors to be stored in RAM at the same time.

What Users are saying..

profile image

Anand Kumpatla

Sr Data Scientist @ Doubleslash Software Solutions Pvt Ltd
linkedin profile url

ProjectPro is a unique platform and helps many people in the industry to solve real-life problems with a step-by-step walkthrough of projects. A platform with some fantastic resources to gain... Read More

Relevant Projects

Loan Eligibility Prediction Project using Machine learning on GCP
Loan Eligibility Prediction Project - Use SQL and Python to build a predictive model on GCP to determine whether an application requesting loan is eligible or not.

Stock Price Prediction Project using LSTM and RNN
Learn how to predict stock prices using RNN and LSTM models. Understand deep learning concepts and apply them to real-world financial data for accurate forecasting.

Credit Card Fraud Detection as a Classification Problem
In this data science project, we will predict the credit card fraud in the transactional dataset using some of the predictive models.

Build a Hybrid Recommender System in Python using LightFM
In this Recommender System project, you will build a hybrid recommender system in Python using LightFM .

Llama2 Project for MetaData Generation using FAISS and RAGs
In this LLM Llama2 Project, you will automate metadata generation using Llama2, RAGs, and AWS to reduce manual efforts.

CycleGAN Implementation for Image-To-Image Translation
In this GAN Deep Learning Project, you will learn how to build an image to image translation model in PyTorch with Cycle GAN.

Skip Gram Model Python Implementation for Word Embeddings
Skip-Gram Model word2vec Example -Learn how to implement the skip gram algorithm in NLP for word embeddings on a set of documents.

Isolation Forest Model and LOF for Anomaly Detection in Python
Credit Card Fraud Detection Project - Build an Isolation Forest Model and Local Outlier Factor (LOF) in Python to identify fraudulent credit card transactions.

Natural language processing Chatbot application using NLTK for text classification
In this NLP AI application, we build the core conversational engine for a chatbot. We use the popular NLTK text classification library to achieve this.

Deploying Machine Learning Models with Flask for Beginners
In this MLOps on GCP project you will learn to deploy a sales forecasting ML Model using Flask.