How to find ngrams from text?
MACHINE LEARNING RECIPES DATA CLEANING PYTHON DATA MUNGING PANDAS CHEATSHEET     ALL TAGS

How to find ngrams from text?

How to find ngrams from text?

This recipe helps you find ngrams from text

Recipe Objective

How to find n-grams from text?. first of all lets understand what is Ngram so it means the sequence of N words, for e.g "A mango" is a 2-gram, "the cat is dancing" is 4-gram and many more.

Step 1 - Import the necessary packages

import nltk from nltk.util import ngrams

Step 2 - Define a function for ngrams

def extract_ngrams(data, num): n_grams = ngrams(nltk.word_tokenize(data), num) return [ ' '.join(grams) for grams in n_grams]

Here we have defined a function called extract_ngrams which will generate ngrams from sentences.

Step 3 - Take a sample text

My_text = 'Jack is very good in mathematics but he is not that much good in science'

Step 4 - Print the ngrams of the sentence

print("1-gram of the sample text: ", extract_ngrams(My_text, 1), '\n') print("2-gram of the sample text: ", extract_ngrams(My_text, 2), '\n') print("3-gram of the sample text: ", extract_ngrams(My_text, 3), '\n') print("4-gram of the sample text: ", extract_ngrams(My_text, 4), '\n')
1-gram of the sample text:  ['Jack', 'is', 'very', 'good', 'in', 'mathematics', 'but', 'he', 'is', 'not', 'that', 'much', 'good', 'in', 'science'] 

2-gram of the sample text:  ['Jack is', 'is very', 'very good', 'good in', 'in mathematics', 'mathematics but', 'but he', 'he is', 'is not', 'not that', 'that much', 'much good', 'good in', 'in science'] 

3-gram of the sample text:  ['Jack is very', 'is very good', 'very good in', 'good in mathematics', 'in mathematics but', 'mathematics but he', 'but he is', 'he is not', 'is not that', 'not that much', 'that much good', 'much good in', 'good in science'] 

4-gram of the sample text:  ['Jack is very good', 'is very good in', 'very good in mathematics', 'good in mathematics but', 'in mathematics but he', 'mathematics but he is', 'but he is not', 'he is not that', 'is not that much', 'not that much good', 'that much good in', 'much good in science']

The above code will give the output for the number of ngrams that we have specified their, It can be increased or decreased as per our requirement.

Relevant Projects

Churn Prediction in Telecom using Machine Learning in R
Estimating churners before they discontinue using a product or service is extremely important. In this ML project, you will develop a churn prediction model in telecom to predict customers who are most likely subject to churn.

Customer Churn Prediction Analysis using Ensemble Techniques
In this machine learning churn project, we implement a churn prediction model in python using ensemble techniques.

Natural language processing Chatbot application using NLTK for text classification
In this NLP AI application, we build the core conversational engine for a chatbot. We use the popular NLTK text classification library to achieve this.

Identifying Product Bundles from Sales Data Using R Language
In this data science project in R, we are going to talk about subjective segmentation which is a clustering technique to find out product bundles in sales data.

Image Segmentation using Mask R-CNN with Tensorflow
In this Deep Learning Project on Image Segmentation Python, you will learn how to implement the Mask R-CNN model for early fire detection.

Locality Sensitive Hashing Python Code for Look-Alike Modelling
In this deep learning project, you will find similar images (lookalikes) using deep learning and locality sensitive hashing to find customers who are most likely to click on an ad.

Loan Eligibility Prediction in Python using H2O.ai
In this loan prediction project you will build predictive models in Python using H2O.ai to predict if an applicant is able to repay the loan or not.

Resume parsing with Machine learning - NLP with Python OCR and Spacy
In this machine learning resume parser example we use the popular Spacy NLP python library for OCR and text classification.

Topic modelling using Kmeans clustering to group customer reviews
In this Kmeans clustering machine learning project, you will perform topic modelling in order to group customer reviews based on recurring patterns.

Demand prediction of driver availability using multistep time series analysis
In this supervised learning machine learning project, you will predict the availability of a driver in a specific area by using multi step time series analysis.