Explain difference between word tokenizer in nlp

This recipe explains the difference between word tokenizer in nlp
Last Updated: 14 Jun 2022

Get access to Data Science projects View all Data Science projects

MACHINE LEARNING RECIPES DATA CLEANING PYTHON DATA MUNGING PANDAS CHEATSHEET ALL TAGS

Recipe Objective

Explain difference between word tokenizer, character tokenizer and sentence tokenizer. As we have discussed earlier only that what is tokenizer is used for chopping the text into smaller peices which are called tokens, here the tokens can be either words, characters or subwords. Difference between Word, Characterand Sentence tokenizer:

Word tokenizer Splitting the sentence into words this work is done by Word tokenizer the process is called as Word tokenization. Example : "Jon is playing football" Solution : ["Jon", "is", "playing", "football"]

Character tokenizer Splitting a piece of text into set of characters this work is done by Character tokenizer the process is called as Character tokenization Example : "Jon is playing football" Solution : ["J","o","n", "i", "s", "p", "l", "a", "y", "i", "n", "g", "f", "o","o", "t", "b", "a", "l","l"]

Sentence tokenizer Splitting a paragraph into sentences this work is done by Sentence tokenizer the process is called as Sentence tokenization Example : "Jon is playing football, he loves to play football in evening. His favourite player is Cristiano Ronaldo, he want to become like him." Solution : ["Jon is playing football","he loves to play football in evening","His favourite player is Cristiano Ronaldo","he want to become like him"]

Recipe Objective

Step 1 - Import the necessary libraries

import nltk from nltk.tokenize import word_tokenize, sent_tokenize

Step 2 - Take a sample text

Sample_text = "Jon is playing football, he loves to play football in evening. His favourite player is Cristiano Ronaldo, he want to become like him."

Step 3 - Word tokenization

print(word_tokenize(Sample_text))

['Jon', 'is', 'playing', 'football', ',', 'he', 'loves', 'to', 'play', 'football', 'in', 'evening', '.', 'His', 'favourite', 'player', 'is', 'Cristiano', 'Ronaldo', ',', 'he', 'want', 'to', 'become', 'like', 'him', '.']

Step 4 - Sentence tokenization

print(sent_tokenize(Sample_text))

['Jon is playing football, he loves to play football in evening.', 'His favourite player is Cristiano Ronaldo, he want to become like him.']

Step 5 - Character tokenization

Sample2 = [s.lower() for s in Sample_text] print(Sample2)

['j', 'o', 'n', ' ', 'i', 's', ' ', 'p', 'l', 'a', 'y', 'i', 'n', 'g', ' ', 'f', 'o', 'o', 't', 'b', 'a', 'l', 'l', ',', ' ', 'h', 'e', ' ', 'l', 'o', 'v', 'e', 's', ' ', 't', 'o', ' ', 'p', 'l', 'a', 'y', ' ', 'f', 'o', 'o', 't', 'b', 'a', 'l', 'l', ' ', 'i', 'n', ' ', 'e', 'v', 'e', 'n', 'i', 'n', 'g', '.', ' ', 'h', 'i', 's', ' ', 'f', 'a', 'v', 'o', 'u', 'r', 'i', 't', 'e', ' ', 'p', 'l', 'a', 'y', 'e', 'r', ' ', 'i', 's', ' ', 'c', 'r', 'i', 's', 't', 'i', 'a', 'n', 'o', ' ', 'r', 'o', 'n', 'a', 'l', 'd', 'o', ',', ' ', 'h', 'e', ' ', 'w', 'a', 'n', 't', ' ', 't', 'o', ' ', 'b', 'e', 'c', 'o', 'm', 'e', ' ', 'l', 'i', 'k', 'e', ' ', 'h', 'i', 'm', '.']

What Users are saying..

Ray han

Tech Leader | Stanford / Yale University

I think that they are fantastic. I attended Yale and Stanford and have worked at Honeywell,Oracle, and Arthur Andersen(Accenture) in the US. I have taken Big Data and Hadoop,NoSQL, Spark, Hadoop... Read More