Explain corpus streaming in Gensim

In this recipe, we will learn what is corpus streaming. We will learn how corpus streaming is done and how it gives an edge to the Gensim library.
Last Updated: 28 Jul 2022

Get access to Data Science projects View all Data Science projects

MACHINE LEARNING PROJECTS IN PYTHON DATA CLEANING PYTHON DATA MUNGING MACHINE LEARNING RECIPES PANDAS CHEATSHEET ALL TAGS

Recipe Objective: Explain corpus streaming in Gensim

Assume that a corpus contains millions of documents. It's not possible to store all of them in RAM. Let's pretend the documents are saved in a file on disc, one for each line. Gensim's only requirement is that a corpus must only return one document vector at a time.

The fact that a corpus doesn't have to be a list, a NumPy array, a Pandas dataframe, or anything else gives Gensim its full capability. Gensim accepts any object that, when iterated over, produces documents in sequential order.

Explore the Real-World Applications of Recommender Systems

Here's an example of corpus streaming in Gensim-

#importing required libraries from smart_open import open from collections import defaultdict from gensim import corpora #a sample corpus docs = ["This is sample document", "Collection of documents make a corpus", "You can vectorize your corpus for a mathematically convenient representation of a document"] #creating a list of stopwords stoplist = set('for a of the and to in'.split()) #removing the stop words txts = [[word for word in document.lower().split() if word not in stoplist]for document in docs] #calculating frequency of each text frequency = defaultdict(int) for text in txts: for token in text: frequency[token] += 1 #removing words that appear only once txts = [[token for token in text if frequency[token] > 1]for text in txts] #creating a dictionary gensim_dictionary = corpora.Dictionary(txts) #creating a method to read a corpus one document per line class read_corpus: def __iter__(self): for line in open(r'C:\Users\Lenovo\Documents\document.txt', encoding='utf-8'): yield gensim_dictionary.doc2bow(line.lower().split()) #by calling the method the corpus doesn't load into memory memory_friendly_corpus = read_corpus() print(memory_friendly_corpus) print("\n") #loading one vector into memory at a time for vector in memory_friendly_corpus: print(vector)

Output:
<__main__.read_corpus object at 0x000002AFFFEC1F10>


[(0, 1)]
[(1, 1)]
[(0, 1), (1, 1)]

This versatility enables you to design your corpus classes that stream documents from disc, network, database, dataframes, and so on. Gensim's models are written so that they don't require all vectors to be stored in RAM at the same time.

What Users are saying..

Savvy Sahai

Data Science Intern, Capgemini

As a student looking to break into the field of data engineering and data science, one can get really confused as to which path to take. Very few ways to do it are Google, YouTube, etc. I was one of... Read More

Relevant Projects

Machine Learning Projects

Data Science Projects

Python Projects for Data Science

Data Science Projects in R

Machine Learning Projects for Beginners

Deep Learning Projects

Neural Network Projects

Tensorflow Projects

NLP Projects

Kaggle Projects

IoT Projects

Big Data Projects

Hadoop Real-Time Projects Examples

Spark Projects

Data Analytics Projects for Students

Relevant Projects

AWS MLOps Project to Deploy Multiple Linear Regression Model

Build and Deploy a Multiple Linear Regression Model in Python on AWS

View Project Details

Recommender System Machine Learning Project for Beginners-4

Collaborative Filtering Recommender System Project - Comparison of different model based and memory based methods to build recommendation system using collaborative filtering.

View Project Details

Build Classification Algorithms for Digital Transformation[Banking]

Implement a machine learning approach using various classification techniques in Python to examine the digitalisation process of bank customers.

View Project Details

OpenCV Project for Beginners to Learn Computer Vision Basics

In this OpenCV project, you will learn computer vision basics and the fundamentals of OpenCV library using Python.

View Project Details

Build an AI Chatbot from Scratch using Keras Sequential Model

In this NLP Project, you will learn how to build an AI Chatbot from Scratch using Keras Sequential Model.

View Project Details

MLOps using Azure Devops to Deploy a Classification Model

In this MLOps Azure project, you will learn how to deploy a classification machine learning model to predict the customer's license status on Azure through scalable CI/CD ML pipelines.

View Project Details

Multilabel Classification Project for Predicting Shipment Modes

Multilabel Classification Project to build a machine learning model that predicts the appropriate mode of transport for each shipment, using a transport dataset with 2000 unique products. The project explores and compares four different approaches to multilabel classification, including naive independent models, classifier chains, natively multilabel models, and multilabel to multiclass approaches.

View Project Details

Explain corpus streaming in Gensim

Recipe Objective: Explain corpus streaming in Gensim

Savvy Sahai

Relevant Projects

You might also like

Relevant Projects