How to save and load a Gensim corpus

In this recipe, we will learn how to save a corpus. We will also learn how to load a pre-saved corpus in Gensim.

Recipe Objective: How to save and load a Gensim corpus?

We can save a corpus by using the following script-

#importing required libraries
from gensim.utils import simple_preprocess
from smart_open import smart_open
from gensim import corpora
import os

#creating a class for reading multiple files
class read_multiplefiles(object):
  def __init__(self, dir_path):
    self.dir_path = dir_path

  def __iter__(self):
    for filename in os.listdir(self.dir_path):
      for line in open(os.path.join(self.dir_path, filename), encoding='utf-8'):
        yield simple_preprocess(line)

#providing the path of the directory
directory_path = r"C:\abc"

#creating a dictionary
gensim_dictionary = corpora.Dictionary()

#creating a bag-of-words corpus from multiple files in the directory provided
gensim_corpus = [gensim_dictionary.doc2bow(token, allow_update=True) for token in read_multiplefiles(directory_path)]

#saving the corpus in Matrix Market format at the given path
corpora.MmCorpus.serialize('Desktop/gensim_corpus.mm', gensim_corpus)

We can load the saved corpus similarly by using the following script-

#loading the saved corpus
load_corpus = corpora.MmCorpus('Desktop/gensim_corpus.mm')

#displaying contents of the corpus
for line in load_corpus:
  print(line)

Output:
[(0, 1.0), (1, 1.0), (2, 1.0), (3, 1.0)]
[(4, 1.0), (5, 1.0), (6, 1.0), (7, 1.0), (8, 1.0)]
[(0, 1.0), (5, 1.0), (8, 1.0), (9, 1.0), (10, 1.0), (11, 1.0), (12, 1.0), (13, 1.0), (14, 1.0), (15, 1.0), (16, 1.0)]

What Users are saying..

profile image

Ameeruddin Mohammed

ETL (Abintio) developer at IBM
linkedin profile url

I come from a background in Marketing and Analytics and when I developed an interest in Machine Learning algorithms, I did multiple in-class courses from reputed institutions though I got good... Read More

Relevant Projects

Build a CNN Model with PyTorch for Image Classification
In this deep learning project, you will learn how to build an Image Classification Model using PyTorch CNN

Build a Autoregressive and Moving Average Time Series Model
In this time series project, you will learn to build Autoregressive and Moving Average Time Series Models to forecast future readings, optimize performance, and harness the power of predictive analytics for sensor data.

Build Time Series Models for Gaussian Processes in Python
Time Series Project - A hands-on approach to Gaussian Processes for Time Series Modelling in Python

Mastering A/B Testing: A Practical Guide for Production
In this A/B Testing for Machine Learning Project, you will gain hands-on experience in conducting A/B tests, analyzing statistical significance, and understanding the challenges of building a solution for A/B testing in a production environment.

Build an Image Classifier for Plant Species Identification
In this machine learning project, we will use binary leaf images and extracted features, including shape, margin, and texture to accurately identify plant species using different benchmark classification techniques.

Credit Card Fraud Detection as a Classification Problem
In this data science project, we will predict the credit card fraud in the transactional dataset using some of the predictive models.

Census Income Data Set Project-Predict Adult Census Income
Use the Adult Income dataset to predict whether income exceeds 50K yr based oncensus data.

Deep Learning Project for Time Series Forecasting in Python
Deep Learning for Time Series Forecasting in Python -A Hands-On Approach to Build Deep Learning Models (MLP, CNN, LSTM, and a Hybrid Model CNN-LSTM) on Time Series Data.

Deep Learning Project for Text Detection in Images using Python
CV2 Text Detection Code for Images using Python -Build a CRNN deep learning model to predict the single-line text in a given image.

Build Real Estate Price Prediction Model with NLP and FastAPI
In this Real Estate Price Prediction Project, you will learn to build a real estate price prediction machine learning model and deploy it on Heroku using FastAPI Framework.