What are the different corpus formats in Gensim

There are numerous file formats for serializing a Vector Space corpus to disc. We will learn these file formats in this recipe.

Recipe Objective: What are the different corpus formats in Gensim?

There are numerous file formats for serializing a Vector Space corpus to disc. Gensim implements them using corpus streaming, where documents are read from (or stored to) disc in a lazy manner, one at a time, rather than the entire corpus being read into main memory at once.

The Market Matrix file format is one of the most well-known. We also have Joachim’s SVMlight format, Blei’s LDA-C format, and GibbsLDA++ format. To save a corpus in different formats, follow these steps-

#importing required library
from gensim import corpora

#creating a sample corpus to save
sample_corpus = [[(0, 0.5)],[(1, 0.23)], []]

#saving in Matrix Market format
corpora.MmCorpus.serialize('sample_corpus.mm', sample_corpus)

#saving in Joachim’s SVMlight format
corpora.SvmLightCorpus.serialize('sample_corpus.svmlight', sample_corpus)

#saving in Blei’s LDA-C format
corpora.BleiCorpus.serialize('sample_corpus.lda-c', sample_corpus)

#saving in GibbsLDA++ format.
corpora.LowCorpus.serialize('sample_corpus.low', sample_corpus)

To load a corpus iterator from a Matrix Market file, follow the script written below-

#loading a corpus iterator from Matrix Market format
load_corpus = corpora.MmCorpus('sample_corpus.mm')

#displaying the corpus
print(load_corpus)

We cannot print the corpus object directly as they are streams. However, we can print them with the help of any one of the following ways-

print("first way")

#loading it entirely into memory
print(list(load_corpus)) #calling list() converts any sequence to a plain Python list

print("\nsecond way")

#printing one document at a time by making use of the streaming interface
for docu in load_corpus:
  print(docu)

 
Output:
first way
[[(0, 0.5)], [(1, 0.23)], []]

second way
[(0, 0.5)]
[(1, 0.23)]
[]

You can load a document stream using one format and immediately save it in another format. The script below saves our corpus in matrix market format in Joachim’s SVMlight format.

corpora.SvmLightCorpus.serialize('load_corpus.svmlight', load_corpus)

Gensim can hence be used as a memory-efficient I/O format conversion tool.

What Users are saying..

profile image

Gautam Vermani

Data Consultant at Confidential
linkedin profile url

Having worked in the field of Data Science, I wanted to explore how I can implement projects in other domains, So I thought of connecting with ProjectPro. A project that helped me absorb this topic... Read More

Relevant Projects

Multi-Class Text Classification with Deep Learning using BERT
In this deep learning project, you will implement one of the most popular state of the art Transformer models, BERT for Multi-Class Text Classification

Hands-On Approach to Causal Inference in Machine Learning
In this Machine Learning Project, you will learn to implement various causal inference techniques in Python to determine, how effective the sprinkler is in making the grass wet.

PyCaret Project to Build and Deploy an ML App using Streamlit
In this PyCaret Project, you will build a customer segmentation model with PyCaret and deploy the machine learning application using Streamlit.

NLP Project to Build a Resume Parser in Python using Spacy
Use the popular Spacy NLP python library for OCR and text classification to build a Resume Parser in Python.

Build a Text Classification Model with Attention Mechanism NLP
In this NLP Project, you will learn to build a multi class text classification model with attention mechanism.

Deep Learning Project for Time Series Forecasting in Python
Deep Learning for Time Series Forecasting in Python -A Hands-On Approach to Build Deep Learning Models (MLP, CNN, LSTM, and a Hybrid Model CNN-LSTM) on Time Series Data.

Build a Collaborative Filtering Recommender System in Python
Use the Amazon Reviews/Ratings dataset of 2 Million records to build a recommender system using memory-based collaborative filtering in Python.

Ensemble Machine Learning Project - All State Insurance Claims Severity Prediction
In this ensemble machine learning project, we will predict what kind of claims an insurance company will get. This is implemented in python using ensemble machine learning algorithms.

Mastering A/B Testing: A Practical Guide for Production
In this A/B Testing for Machine Learning Project, you will gain hands-on experience in conducting A/B tests, analyzing statistical significance, and understanding the challenges of building a solution for A/B testing in a production environment.

Text Classification with Transformers-RoBERTa and XLNet Model
In this machine learning project, you will learn how to load, fine tune and evaluate various transformer models for text classification tasks.