Explain effective preprocessing of corpus in Gensim

This recipe explains how to perform effective preprocessing of corpus using the gensim library in python

Recipe Objective: Explain effective preprocessing of a corpus in Gensim

As mentioned in the previous recipe, gensim provides a function for more effective preprocessing of the corpus using the function gensim.utils.simple_preprocess(). We can convert a document into a list of lowercase tokens and even discard tokens that are too short or too long.

Explore the BERT Variants - ALBERT vs DistilBERT  


Syntax:gensim.utils.simple_preprocess(doc, deacc=False, min_len=2, max_len=15)
  Parameters:
   doc -> It is the source document on which preprocessing should be performed.
   deacc -> To remove accent marks from tokens. It is accomplished with the help of deaccent().
   min_len -> This parameter allows us to specify the minimum length of a token. Tokens that are less than the specified length will be rejected.
   max_len -> This parameter allows us to specify the maximum length of a token. Tokens that are more than the specified length will be rejected.

The tokens taken from the input document would be the function's output. Here's an example-

#importing required libraries
import pprint
import gensim

#sample document
document = "This is a sample text for preprocessing."

#calling the function
preprocessed_text=gensim.utils.simple_preprocess(document, deacc=False, min_len=2, max_len=15)

#displaying final tokens
pprint.pprint(preprocessed_text)

Output:
['this', 'is', 'sample', 'text', 'for', 'preprocessing']

What Users are saying..

profile image

Jingwei Li

Graduate Research assistance at Stony Brook University
linkedin profile url

ProjectPro is an awesome platform that helps me learn much hands-on industrial experience with a step-by-step walkthrough of projects. There are two primary paths to learn: Data Science and Big Data.... Read More

Relevant Projects

Deploying Machine Learning Models with Flask for Beginners
In this MLOps on GCP project you will learn to deploy a sales forecasting ML Model using Flask.

Insurance Pricing Forecast Using XGBoost Regressor
In this project, we are going to talk about insurance forecast by using linear and xgboost regression techniques.

Digit Recognition using CNN for MNIST Dataset in Python
In this deep learning project, you will build a convolutional neural network using MNIST dataset for handwritten digit recognition.

Hands-On Approach to Regression Discontinuity Design Python
In this machine learning project, you will learn to implement Regression Discontinuity Design Example in Python to determine the effect of age on Mortality Rate in Python.

Hands-On Approach to Causal Inference in Machine Learning
In this Machine Learning Project, you will learn to implement various causal inference techniques in Python to determine, how effective the sprinkler is in making the grass wet.

Avocado Machine Learning Project Python for Price Prediction
In this ML Project, you will use the Avocado dataset to build a machine learning model to predict the average price of avocado which is continuous in nature based on region and varieties of avocado.

Image Classification Model using Transfer Learning in PyTorch
In this PyTorch Project, you will build an image classification model in PyTorch using the ResNet pre-trained model.

Linear Regression Model Project in Python for Beginners Part 2
Machine Learning Linear Regression Project for Beginners in Python to Build a Multiple Linear Regression Model on Soccer Player Dataset.

AWS MLOps Project for ARCH and GARCH Time Series Models
Build and deploy ARCH and GARCH time series forecasting models in Python on AWS .

Time Series Python Project using Greykite and Neural Prophet
In this time series project, you will forecast Walmart sales over time using the powerful, fast, and flexible time series forecasting library Greykite that helps automate time series problems.