Explain effective preprocessing of corpus in Gensim

This recipe explains how to perform effective preprocessing of corpus using the gensim library in python
Last Updated: 17 Aug 2022

Get access to Data Science projects View all Data Science projects

MACHINE LEARNING PROJECTS IN PYTHON DATA CLEANING PYTHON DATA MUNGING MACHINE LEARNING RECIPES PANDAS CHEATSHEET ALL TAGS

Recipe Objective: Explain effective preprocessing of a corpus in Gensim

As mentioned in the previous recipe, gensim provides a function for more effective preprocessing of the corpus using the function gensim.utils.simple_preprocess(). We can convert a document into a list of lowercase tokens and even discard tokens that are too short or too long.

Explore the BERT Variants - ALBERT vs DistilBERT

Syntax:gensim.utils.simple_preprocess(doc, deacc=False, min_len=2, max_len=15)
Parameters:
   doc -> It is the source document on which preprocessing should be performed.
   deacc -> To remove accent marks from tokens. It is accomplished with the help of deaccent().
   min_len -> This parameter allows us to specify the minimum length of a token. Tokens that are less than the specified length will be rejected.
   max_len -> This parameter allows us to specify the maximum length of a token. Tokens that are more than the specified length will be rejected.

The tokens taken from the input document would be the function's output. Here's an example-

#importing required libraries import pprint import gensim #sample document document = "This is a sample text for preprocessing." #calling the function preprocessed_text=gensim.utils.simple_preprocess(document, deacc=False, min_len=2, max_len=15) #displaying final tokens pprint.pprint(preprocessed_text)

Output:
['this', 'is', 'sample', 'text', 'for', 'preprocessing']

What Users are saying..

Jingwei Li

Graduate Research assistance at Stony Brook University

ProjectPro is an awesome platform that helps me learn much hands-on industrial experience with a step-by-step walkthrough of projects. There are two primary paths to learn: Data Science and Big Data.... Read More

Relevant Projects

Machine Learning Projects

Data Science Projects

Python Projects for Data Science

Data Science Projects in R

Machine Learning Projects for Beginners

Deep Learning Projects

Neural Network Projects

Tensorflow Projects

NLP Projects

Kaggle Projects

IoT Projects

Big Data Projects

Hadoop Real-Time Projects Examples

Spark Projects

Data Analytics Projects for Students

Relevant Projects

Deploying Machine Learning Models with Flask for Beginners

In this MLOps on GCP project you will learn to deploy a sales forecasting ML Model using Flask.

View Project Details

Insurance Pricing Forecast Using XGBoost Regressor

In this project, we are going to talk about insurance forecast by using linear and xgboost regression techniques.

View Project Details

Digit Recognition using CNN for MNIST Dataset in Python

In this deep learning project, you will build a convolutional neural network using MNIST dataset for handwritten digit recognition.

View Project Details

Hands-On Approach to Regression Discontinuity Design Python

In this machine learning project, you will learn to implement Regression Discontinuity Design Example in Python to determine the effect of age on Mortality Rate in Python.

View Project Details

Hands-On Approach to Causal Inference in Machine Learning

In this Machine Learning Project, you will learn to implement various causal inference techniques in Python to determine, how effective the sprinkler is in making the grass wet.

View Project Details

Avocado Machine Learning Project Python for Price Prediction

In this ML Project, you will use the Avocado dataset to build a machine learning model to predict the average price of avocado which is continuous in nature based on region and varieties of avocado.

View Project Details

Image Classification Model using Transfer Learning in PyTorch

In this PyTorch Project, you will build an image classification model in PyTorch using the ResNet pre-trained model.

View Project Details

Linear Regression Model Project in Python for Beginners Part 2

Machine Learning Linear Regression Project for Beginners in Python to Build a Multiple Linear Regression Model on Soccer Player Dataset.

View Project Details

AWS MLOps Project for ARCH and GARCH Time Series Models

Build and deploy ARCH and GARCH time series forecasting models in Python on AWS .

View Project Details

Time Series Python Project using Greykite and Neural Prophet

In this time series project, you will forecast Walmart sales over time using the powerful, fast, and flexible time series forecasting library Greykite that helps automate time series problems.

View Project Details

Explain effective preprocessing of corpus in Gensim

Recipe Objective: Explain effective preprocessing of a corpus in Gensim

Jingwei Li

Relevant Projects

You might also like

Relevant Projects