How to perform chunking on a paragraph in nlp

This recipe helps you perform chunking on a paragraph in nlp
Last Updated: 17 Feb 2022

Get access to Data Science projects View all Data Science projects

MACHINE LEARNING RECIPES DATA CLEANING PYTHON DATA MUNGING PANDAS CHEATSHEET ALL TAGS

Recipe Objective

How to perform chunking on a paragraph? Chunking It follows the part of speech (POS) tagging to add more structure to the sentence which is also known as shallow parsing. The resulted words or groups of words are called chunks. These chunks are made up of words, One can even define a pattern or words that can't be a part of the chuck, and such words are known as chinks. The primary work of chunking is to make a group of "noun phrases". The part of speech is combined with regular expression. For e.g if we want to tag Noun, verb, adjective, and coordinating junction from the sentence then we can use the following: chunk : {<NN.?>*<VBD.?>*<JJ.?>*?} We can combine them according to needs and requirements as there are no predefined rules.

Step 1 - Import the necessary libraries

from nltk import pos_tag from nltk import RegexpParser

Step 2 - Take a sample text and split it

Sample_text = '''Albert Einstein was a German-born theoretical physicist who developed the theory of relativity, one of the two pillars of modern physics. His work is also known for its influence on the philosophy of science.''' Sample_text = Sample_text.split() print(Sample_text)

['Albert', 'Einstein', 'was', 'a', 'German-born', 'theoretical', 'physicist', 'who', 'developed', 'the', 'theory', 'of', 'relativity,', 'one', 'of', 'the', 'two', 'pillars', 'of', 'modern', 'physics.', 'His', 'work', 'is', 'also', 'known', 'for', 'its', 'influence', 'on', 'the', 'philosophy', 'of', 'science.']

Step 3 - Apply POS tagging

tagging = pos_tag(Sample_text) print(tagging)

[('Albert', 'NNP'), ('Einstein', 'NNP'), ('was', 'VBD'), ('a', 'DT'), ('German-born', 'JJ'), ('theoretical', 'JJ'), ('physicist', 'NN'), ('who', 'WP'), ('developed', 'VBD'), ('the', 'DT'), ('theory', 'NN'), ('of', 'IN'), ('relativity,', 'JJ'), ('one', 'CD'), ('of', 'IN'), ('the', 'DT'), ('two', 'CD'), ('pillars', 'NNS'), ('of', 'IN'), ('modern', 'JJ'), ('physics.', 'FW'), ('His', 'PRP), ('work', 'NN'), ('is', 'VBZ'), ('also', 'RB'), ('known', 'VBN'), ('for', 'IN'), ('its', 'PRP), ('influence', 'NN'), ('on', 'IN'), ('the', 'DT'), ('philosophy', 'NN'), ('of', 'IN'), ('science.', 'NN')]

Step 4 - Define the chunk patterns

chunk_patterns = """mychunk:{<NN.?>*<VBD.?>*<JJ.?>*?}"""

Step 5 - Parse that chunk patterns using RegexpParser

parsing = RegexpParser(chunk_patterns) print(parsing)

chunk.RegexpParser with 1 stages:
RegexpChunkParser with 1 rules:
       *<VBD.?>*<JJ.?>*?'>

Step 6 - Apply parser on tagging and print the results

Result = parsing.parse(tagging) print("The Final Result Should look like this:", Result)

The Final Result Should look like this: (S
  (mychunk Albert/NNP Einstein/NNP was/VBD)
  a/DT
  (mychunk German-born/JJ theoretical/JJ)
  (mychunk physicist/NN)
  who/WP
  (mychunk developed/VBD)
  the/DT
  (mychunk theory/NN)
  of/IN
  (mychunk relativity,/JJ)
  one/CD
  of/IN
  the/DT
  two/CD
  (mychunk pillars/NNS)
  of/IN
  (mychunk modern/JJ)
  physics./FW
  His/PRP$
  (mychunk work/NN)
  is/VBZ
  also/RB
  known/VBN
  for/IN
  its/PRP$
  (mychunk influence/NN)
  on/IN
  the/DT
  (mychunk philosophy/NN)
  of/IN
  (mychunk science./NN))

As we can see there are many words which are not included in our rule which are not tagged as "my chunk" for e.g known, for, its and many more. The words which are included in our rule are tagged as "my chunk".

What Users are saying..

Ray han

Tech Leader | Stanford / Yale University

I think that they are fantastic. I attended Yale and Stanford and have worked at Honeywell,Oracle, and Arthur Andersen(Accenture) in the US. I have taken Big Data and Hadoop,NoSQL, Spark, Hadoop... Read More