Explain corpus streaming in Gensim

In this recipe, we will learn what is corpus streaming. We will learn how corpus streaming is done and how it gives an edge to the Gensim library.
Last Updated: 28 Jul 2022

Get access to Data Science projects View all Data Science projects

MACHINE LEARNING PROJECTS IN PYTHON DATA CLEANING PYTHON DATA MUNGING MACHINE LEARNING RECIPES PANDAS CHEATSHEET ALL TAGS

Recipe Objective: Explain corpus streaming in Gensim

Assume that a corpus contains millions of documents. It's not possible to store all of them in RAM. Let's pretend the documents are saved in a file on disc, one for each line. Gensim's only requirement is that a corpus must only return one document vector at a time.

The fact that a corpus doesn't have to be a list, a NumPy array, a Pandas dataframe, or anything else gives Gensim its full capability. Gensim accepts any object that, when iterated over, produces documents in sequential order.

Explore the Real-World Applications of Recommender Systems

Here's an example of corpus streaming in Gensim-

#importing required libraries from smart_open import open from collections import defaultdict from gensim import corpora #a sample corpus docs = ["This is sample document", "Collection of documents make a corpus", "You can vectorize your corpus for a mathematically convenient representation of a document"] #creating a list of stopwords stoplist = set('for a of the and to in'.split()) #removing the stop words txts = [[word for word in document.lower().split() if word not in stoplist]for document in docs] #calculating frequency of each text frequency = defaultdict(int) for text in txts: for token in text: frequency[token] += 1 #removing words that appear only once txts = [[token for token in text if frequency[token] > 1]for text in txts] #creating a dictionary gensim_dictionary = corpora.Dictionary(txts) #creating a method to read a corpus one document per line class read_corpus: def __iter__(self): for line in open(r'C:\Users\Lenovo\Documents\document.txt', encoding='utf-8'): yield gensim_dictionary.doc2bow(line.lower().split()) #by calling the method the corpus doesn't load into memory memory_friendly_corpus = read_corpus() print(memory_friendly_corpus) print("\n") #loading one vector into memory at a time for vector in memory_friendly_corpus: print(vector)

Output:
<__main__.read_corpus object at 0x000002AFFFEC1F10>


[(0, 1)]
[(1, 1)]
[(0, 1), (1, 1)]

This versatility enables you to design your corpus classes that stream documents from disc, network, database, dataframes, and so on. Gensim's models are written so that they don't require all vectors to be stored in RAM at the same time.

What Users are saying..

Anand Kumpatla

Sr Data Scientist @ Doubleslash Software Solutions Pvt Ltd

ProjectPro is a unique platform and helps many people in the industry to solve real-life problems with a step-by-step walkthrough of projects. A platform with some fantastic resources to gain... Read More

Relevant Projects

Machine Learning Projects

Data Science Projects

Python Projects for Data Science

Data Science Projects in R

Machine Learning Projects for Beginners

Deep Learning Projects

Neural Network Projects

Tensorflow Projects

NLP Projects

Kaggle Projects

IoT Projects

Big Data Projects

Hadoop Real-Time Projects Examples

Spark Projects

Data Analytics Projects for Students

Relevant Projects

Loan Eligibility Prediction Project using Machine learning on GCP

Loan Eligibility Prediction Project - Use SQL and Python to build a predictive model on GCP to determine whether an application requesting loan is eligible or not.

View Project Details

Stock Price Prediction Project using LSTM and RNN

Learn how to predict stock prices using RNN and LSTM models. Understand deep learning concepts and apply them to real-world financial data for accurate forecasting.

View Project Details

Credit Card Fraud Detection as a Classification Problem

In this data science project, we will predict the credit card fraud in the transactional dataset using some of the predictive models.

View Project Details

Build a Hybrid Recommender System in Python using LightFM

In this Recommender System project, you will build a hybrid recommender system in Python using LightFM .

View Project Details

Llama2 Project for MetaData Generation using FAISS and RAGs

In this LLM Llama2 Project, you will automate metadata generation using Llama2, RAGs, and AWS to reduce manual efforts.

View Project Details

CycleGAN Implementation for Image-To-Image Translation

In this GAN Deep Learning Project, you will learn how to build an image to image translation model in PyTorch with Cycle GAN.

View Project Details

Skip Gram Model Python Implementation for Word Embeddings

Skip-Gram Model word2vec Example -Learn how to implement the skip gram algorithm in NLP for word embeddings on a set of documents.

View Project Details

Isolation Forest Model and LOF for Anomaly Detection in Python

Credit Card Fraud Detection Project - Build an Isolation Forest Model and Local Outlier Factor (LOF) in Python to identify fraudulent credit card transactions.

View Project Details

Natural language processing Chatbot application using NLTK for text classification

In this NLP AI application, we build the core conversational engine for a chatbot. We use the popular NLTK text classification library to achieve this.

View Project Details

Deploying Machine Learning Models with Flask for Beginners

In this MLOps on GCP project you will learn to deploy a sales forecasting ML Model using Flask.

View Project Details

Explain corpus streaming in Gensim

Recipe Objective: Explain corpus streaming in Gensim

Anand Kumpatla

Relevant Projects

You might also like

Relevant Projects