How to add custom stopwords and then remove them from text in nltk

This recipe helps you add custom stopwords and then remove them from text in nltk
Last Updated: 19 Jan 2023

Get access to Data Science projects View all Data Science projects

MACHINE LEARNING RECIPES DATA CLEANING PYTHON DATA MUNGING PANDAS CHEATSHEET ALL TAGS

Recipe Objective

In a text or sentence, there are some words that do not contribute importance in the sentence or text, and we need to remove them. So there is a package called stopwords which is already present in the NLTK library that consists of the most commonly used words that should be removed from the text. But if we want to add our own custom list of words that we want to stop in our text or sentence, lets see how to make it.

Stopwords these are the words which does not add much meaning in the actual sentence or text, and they can be safely removed from the sentence or text. The words like the, is, have, has and many more can be removed.

Get Closer To Your Dream of Becoming a Data Scientist with 70+ Solved End-to-End ML Projects

Recipe Objective

Step 1 - Import nltk and download stopwords, and then import stopwords from NLTK

import nltk nltk.download('stopwords') from nltk.corpus import stopwords

Step 2 - lets see the stop word list present in the NLTK library, without adding our custom list

print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

Step 3 - Create a Simple sentence

simple_text = "the city is beautiful, but due to traffic noice polution is increasing on daily basis which is hurting all the people"

Step 4 - Create our custom stopword list to add

new_stopwords = ["all", "due", "to", "on", "daily"]

Step 5 - add custom list to stopword list of nltk

stpwrd = nltk.corpus.stopwords.words('english') stpwrd.extend(new_stopwords)

Step 6 - download and import the tokenizer from nltk

nltk.download('punkt') from nltk.tokenize import word_tokenize

Step 7 - tokenizing the simple text by using word tokenizer

text_tokens = word_tokenize(simple_text)

Explore More Data Science and Machine Learning Projects for Practice. Fast-Track Your Career Transition with ProjectPro

Step 8 - Remove the custom stop words and print it

removing_custom_words = [words for words in text_tokens if not words in stpwrd] print(removing_custom_words)

['city', 'beautiful', ',', 'traffic', 'noice', 'polution', 'increasing', 'basis', 'hurting', 'people']

As we can see all custom words that we have added have been removed from our text.

Join Millions of Satisfied Developers and Enterprises to Maximize Your Productivity and ROI with ProjectPro - Read ProjectPro Reviews Now!

What Users are saying..

Abhinav Agarwal

Graduate Student at Northwestern University

I come from Northwestern University, which is ranked 9th in the US. Although the high-quality academics at school taught me all the basics I needed, obtaining practical experience was a challenge.... Read More

Relevant Projects

Machine Learning Projects

Data Science Projects

Python Projects for Data Science

Data Science Projects in R

Machine Learning Projects for Beginners

Deep Learning Projects

Neural Network Projects

Tensorflow Projects

NLP Projects

Kaggle Projects

IoT Projects

Big Data Projects

Hadoop Real-Time Projects Examples

Spark Projects

Data Analytics Projects for Students

Relevant Projects

Build CI/CD Pipeline for Machine Learning Projects using Jenkins

In this project, you will learn how to create a CI/CD pipeline for a search engine application using Jenkins.

View Project Details

Learn to Build Generative Models Using PyTorch Autoencoders

In this deep learning project, you will learn how to build a Generative Model using Autoencoders in PyTorch

View Project Details

Build a Logistic Regression Model in Python from Scratch

Regression project to implement logistic regression in python from scratch on streaming app data.

View Project Details

Medical Image Segmentation Deep Learning Project

In this deep learning project, you will learn to implement Unet++ models for medical image segmentation to detect and classify colorectal polyps.

View Project Details

Avocado Machine Learning Project Python for Price Prediction

In this ML Project, you will use the Avocado dataset to build a machine learning model to predict the average price of avocado which is continuous in nature based on region and varieties of avocado.

View Project Details

Build Time Series Models for Gaussian Processes in Python

Time Series Project - A hands-on approach to Gaussian Processes for Time Series Modelling in Python

View Project Details

Forecasting Business KPI's with Tensorflow and Python

In this machine learning project, you will use the video clip of an IPL match played between CSK and RCB to forecast key performance indicators like the number of appearances of a brand logo, the frames, and the shortest and longest area percentage in the video.

View Project Details

How to add custom stopwords and then remove them from text in nltk

Recipe Objective

Table of Contents

Step 1 - Import nltk and download stopwords, and then import stopwords from NLTK

Step 2 - lets see the stop word list present in the NLTK library, without adding our custom list

Step 3 - Create a Simple sentence

Step 4 - Create our custom stopword list to add

Step 5 - add custom list to stopword list of nltk

Step 6 - download and import the tokenizer from nltk

Step 7 - tokenizing the simple text by using word tokenizer

Step 8 - Remove the custom stop words and print it

Abhinav Agarwal

Relevant Projects

You might also like

Relevant Projects