What are the different corpus formats in Gensim

There are numerous file formats for serializing a Vector Space corpus to disc. We will learn these file formats in this recipe.
Last Updated: 28 Jul 2022

Get access to Data Science projects View all Data Science projects

MACHINE LEARNING PROJECTS IN PYTHON DATA CLEANING PYTHON DATA MUNGING MACHINE LEARNING RECIPES PANDAS CHEATSHEET ALL TAGS

Recipe Objective: What are the different corpus formats in Gensim?

There are numerous file formats for serializing a Vector Space corpus to disc. Gensim implements them using corpus streaming, where documents are read from (or stored to) disc in a lazy manner, one at a time, rather than the entire corpus being read into main memory at once.

The Market Matrix file format is one of the most well-known. We also have Joachim’s SVMlight format, Blei’s LDA-C format, and GibbsLDA++ format. To save a corpus in different formats, follow these steps-

#importing required library from gensim import corpora #creating a sample corpus to save sample_corpus = [[(0, 0.5)],[(1, 0.23)], []] #saving in Matrix Market format corpora.MmCorpus.serialize('sample_corpus.mm', sample_corpus) #saving in Joachim’s SVMlight format corpora.SvmLightCorpus.serialize('sample_corpus.svmlight', sample_corpus) #saving in Blei’s LDA-C format corpora.BleiCorpus.serialize('sample_corpus.lda-c', sample_corpus) #saving in GibbsLDA++ format. corpora.LowCorpus.serialize('sample_corpus.low', sample_corpus)

To load a corpus iterator from a Matrix Market file, follow the script written below-

#loading a corpus iterator from Matrix Market format load_corpus = corpora.MmCorpus('sample_corpus.mm') #displaying the corpus print(load_corpus)

We cannot print the corpus object directly as they are streams. However, we can print them with the help of any one of the following ways-

print("first way") #loading it entirely into memory print(list(load_corpus)) #calling list() converts any sequence to a plain Python list print("\nsecond way") #printing one document at a time by making use of the streaming interface for docu in load_corpus: print(docu)

 
Output:
first way
[[(0, 0.5)], [(1, 0.23)], []]

second way
[(0, 0.5)]
[(1, 0.23)]
[]

You can load a document stream using one format and immediately save it in another format. The script below saves our corpus in matrix market format in Joachim’s SVMlight format.

corpora.SvmLightCorpus.serialize('load_corpus.svmlight', load_corpus)

Gensim can hence be used as a memory-efficient I/O format conversion tool.

What Users are saying..

Gautam Vermani

Data Consultant at Confidential

Having worked in the field of Data Science, I wanted to explore how I can implement projects in other domains, So I thought of connecting with ProjectPro. A project that helped me absorb this topic... Read More

Relevant Projects

Machine Learning Projects

Data Science Projects

Python Projects for Data Science

Data Science Projects in R

Machine Learning Projects for Beginners

Deep Learning Projects

Neural Network Projects

Tensorflow Projects

NLP Projects

Kaggle Projects

IoT Projects

Big Data Projects

Hadoop Real-Time Projects Examples

Spark Projects

Data Analytics Projects for Students

Relevant Projects

Multi-Class Text Classification with Deep Learning using BERT

In this deep learning project, you will implement one of the most popular state of the art Transformer models, BERT for Multi-Class Text Classification

View Project Details

Hands-On Approach to Causal Inference in Machine Learning

In this Machine Learning Project, you will learn to implement various causal inference techniques in Python to determine, how effective the sprinkler is in making the grass wet.

View Project Details

PyCaret Project to Build and Deploy an ML App using Streamlit

In this PyCaret Project, you will build a customer segmentation model with PyCaret and deploy the machine learning application using Streamlit.

View Project Details

NLP Project to Build a Resume Parser in Python using Spacy

Use the popular Spacy NLP python library for OCR and text classification to build a Resume Parser in Python.

View Project Details

Build a Text Classification Model with Attention Mechanism NLP

In this NLP Project, you will learn to build a multi class text classification model with attention mechanism.

View Project Details

Deep Learning Project for Time Series Forecasting in Python

Deep Learning for Time Series Forecasting in Python -A Hands-On Approach to Build Deep Learning Models (MLP, CNN, LSTM, and a Hybrid Model CNN-LSTM) on Time Series Data.

View Project Details

Build a Collaborative Filtering Recommender System in Python

Use the Amazon Reviews/Ratings dataset of 2 Million records to build a recommender system using memory-based collaborative filtering in Python.

View Project Details

Ensemble Machine Learning Project - All State Insurance Claims Severity Prediction

In this ensemble machine learning project, we will predict what kind of claims an insurance company will get. This is implemented in python using ensemble machine learning algorithms.

View Project Details

Mastering A/B Testing: A Practical Guide for Production

In this A/B Testing for Machine Learning Project, you will gain hands-on experience in conducting A/B tests, analyzing statistical significance, and understanding the challenges of building a solution for A/B testing in a production environment.

View Project Details

Text Classification with Transformers-RoBERTa and XLNet Model

In this machine learning project, you will learn how to load, fine tune and evaluate various transformer models for text classification tasks.

View Project Details

What are the different corpus formats in Gensim

Recipe Objective: What are the different corpus formats in Gensim?

Gautam Vermani

Relevant Projects

You might also like

Relevant Projects