How to perform chunking on a paragraph in nlp

This recipe helps you perform chunking on a paragraph in nlp

Recipe Objective

How to perform chunking on a paragraph? Chunking It follows the part of speech (POS) tagging to add more structure to the sentence which is also known as shallow parsing. The resulted words or groups of words are called chunks. These chunks are made up of words, One can even define a pattern or words that can't be a part of the chuck, and such words are known as chinks. The primary work of chunking is to make a group of "noun phrases". The part of speech is combined with regular expression. For e.g if we want to tag Noun, verb, adjective, and coordinating junction from the sentence then we can use the following: chunk : {<NN.?>*<VBD.?>*<JJ.?>*?} We can combine them according to needs and requirements as there are no predefined rules.

Step 1 - Import the necessary libraries

from nltk import pos_tag from nltk import RegexpParser

Step 2 - Take a sample text and split it

Sample_text = '''Albert Einstein was a German-born theoretical physicist who developed the theory of relativity, one of the two pillars of modern physics. His work is also known for its influence on the philosophy of science.''' Sample_text = Sample_text.split() print(Sample_text)

['Albert', 'Einstein', 'was', 'a', 'German-born', 'theoretical', 'physicist', 'who', 'developed', 'the', 'theory', 'of', 'relativity,', 'one', 'of', 'the', 'two', 'pillars', 'of', 'modern', 'physics.', 'His', 'work', 'is', 'also', 'known', 'for', 'its', 'influence', 'on', 'the', 'philosophy', 'of', 'science.']

Step 3 - Apply POS tagging

tagging = pos_tag(Sample_text) print(tagging)

[('Albert', 'NNP'), ('Einstein', 'NNP'), ('was', 'VBD'), ('a', 'DT'), ('German-born', 'JJ'), ('theoretical', 'JJ'), ('physicist', 'NN'), ('who', 'WP'), ('developed', 'VBD'), ('the', 'DT'), ('theory', 'NN'), ('of', 'IN'), ('relativity,', 'JJ'), ('one', 'CD'), ('of', 'IN'), ('the', 'DT'), ('two', 'CD'), ('pillars', 'NNS'), ('of', 'IN'), ('modern', 'JJ'), ('physics.', 'FW'), ('His', 'PRP), ('work', 'NN'), ('is', 'VBZ'), ('also', 'RB'), ('known', 'VBN'), ('for', 'IN'), ('its', 'PRP), ('influence', 'NN'), ('on', 'IN'), ('the', 'DT'), ('philosophy', 'NN'), ('of', 'IN'), ('science.', 'NN')]

Step 4 - Define the chunk patterns

chunk_patterns = """mychunk:{<NN.?>*<VBD.?>*<JJ.?>*?}"""

Step 5 - Parse that chunk patterns using RegexpParser

parsing = RegexpParser(chunk_patterns) print(parsing)

chunk.RegexpParser with 1 stages:
RegexpChunkParser with 1 rules:
       *<VBD.?>*<JJ.?>*?'>

Step 6 - Apply parser on tagging and print the results

Result = parsing.parse(tagging) print("The Final Result Should look like this:", Result)

The Final Result Should look like this: (S
  (mychunk Albert/NNP Einstein/NNP was/VBD)
  a/DT
  (mychunk German-born/JJ theoretical/JJ)
  (mychunk physicist/NN)
  who/WP
  (mychunk developed/VBD)
  the/DT
  (mychunk theory/NN)
  of/IN
  (mychunk relativity,/JJ)
  one/CD
  of/IN
  the/DT
  two/CD
  (mychunk pillars/NNS)
  of/IN
  (mychunk modern/JJ)
  physics./FW
  His/PRP$
  (mychunk work/NN)
  is/VBZ
  also/RB
  known/VBN
  for/IN
  its/PRP$
  (mychunk influence/NN)
  on/IN
  the/DT
  (mychunk philosophy/NN)
  of/IN
  (mychunk science./NN))

As we can see there are many words which are not included in our rule which are not tagged as "my chunk" for e.g known, for, its and many more. The words which are included in our rule are tagged as "my chunk".

What Users are saying..

profile image

Ray han

Tech Leader | Stanford / Yale University
linkedin profile url

I think that they are fantastic. I attended Yale and Stanford and have worked at Honeywell,Oracle, and Arthur Andersen(Accenture) in the US. I have taken Big Data and Hadoop,NoSQL, Spark, Hadoop... Read More

Relevant Projects

Learn to Build an End-to-End Machine Learning Pipeline - Part 2
In this Machine Learning Project, you will learn how to build an end-to-end machine learning pipeline for predicting truck delays, incorporating Hopsworks' feature store and Weights and Biases for model experimentation.

Walmart Sales Forecasting Data Science Project
Data Science Project in R-Predict the sales for each department using historical markdown data from the Walmart dataset containing data of 45 Walmart stores.

MLOps Project to Build Search Relevancy Algorithm with SBERT
In this MLOps SBERT project you will learn to build and deploy an accurate and scalable search algorithm on AWS using SBERT and ANNOY to enhance search relevancy in news articles.

Build Real Estate Price Prediction Model with NLP and FastAPI
In this Real Estate Price Prediction Project, you will learn to build a real estate price prediction machine learning model and deploy it on Heroku using FastAPI Framework.

Mastering A/B Testing: A Practical Guide for Production
In this A/B Testing for Machine Learning Project, you will gain hands-on experience in conducting A/B tests, analyzing statistical significance, and understanding the challenges of building a solution for A/B testing in a production environment.

Abstractive Text Summarization using Transformers-BART Model
Deep Learning Project to implement an Abstractive Text Summarizer using Google's Transformers-BART Model to generate news article headlines.

Create Your First Chatbot with RASA NLU Model and Python
Learn the basic aspects of chatbot development and open source conversational AI RASA to create a simple AI powered chatbot on your own.

Build a Review Classification Model using Gated Recurrent Unit
In this Machine Learning project, you will build a classification model in python to classify the reviews of an app on a scale of 1 to 5 using Gated Recurrent Unit.

PyTorch Project to Build a GAN Model on MNIST Dataset
In this deep learning project, you will learn how to build a GAN Model on MNIST Dataset for generating new images of handwritten digits.

Langchain Project for Customer Support App in Python
In this LLM Project, you will learn how to enhance customer support interactions through Large Language Models (LLMs), enabling intelligent, context-aware responses. This Langchain project aims to seamlessly integrate LLM technology with databases, PDF knowledge bases, and audio processing agents to create a comprehensive customer support application.