What is tokenization?
MACHINE LEARNING RECIPES DATA CLEANING PYTHON DATA MUNGING PANDAS CHEATSHEET     ALL TAGS

What is tokenization?

What is tokenization?

This recipe explains what is tokenization

Recipe Objective

What is Tokenization? Tokenization is the task of chopping the text into smaller peices which are called tokens, here the tokens can be either words, characters or subwords. There are different tokenizers with different functionality lets understand them one by one.

Step 1 - Sentence Tokenization, Import the sent_tokenize

from nltk.tokenize import sent_tokenize

These tokenizer Splitt the Sentences into Paragraphs.

Step 2 - Take a simple text and apply sentence tokenization on that

My_text = "Hello everyone, Welcome to the session. Now your going to study about tokenization !!" sent_tokenize(My_text)

['Hello everyone, Welcome to the session.', 'Now your going to study about tokenization !', '!']

Here we can see that, the sentence has been converted into a paragraph.

Step 3 - Word Tokenization, Import the word_tokenize

from nltk.tokenize import word_tokenize

Step 4 - Apply word tokenization on simple text

word_tokenize(My_text)
['Hello',
 'everyone',
 ',',
 'Welcome',
 'to',
 'the',
 'session',
 '.',
 'Now',
 'your',
 'going',
 'to',
 'study',
 'about',
 'tokenization',
 '!',
 '!']

From the above we can see that the sentence has been converted into words

Relevant Projects

Expedia Hotel Recommendations Data Science Project
In this data science project, you will contextualize customer data and predict the likelihood a customer will stay at 100 different hotel groups.

Walmart Sales Forecasting Data Science Project
Data Science Project in R-Predict the sales for each department using historical markdown data from the Walmart dataset containing data of 45 Walmart stores.

Forecasting Business KPI's with Tensorflow and Python
In this machine learning project, you will use the video clip of an IPL match played between CSK and RCB to forecast key performance indicators like the number of appearances of a brand logo, the frames, and the shortest and longest area percentage in the video.

Predict Macro Economic Trends using Kaggle Financial Dataset
In this machine learning project, you will uncover the predictive value in an uncertain world by using various artificial intelligence, machine learning, advanced regression and feature transformation techniques.

Abstractive Text Summarization using Transformers-BART Model
Deep Learning Project to implement an Abstractive Text Summarizer using Google's Transformers-BART Model to generate news article headlines.

Time Series Python Project using Greykite and Neural Prophet
In this time series project, you will forecast Walmart sales over time using the powerful, fast, and flexible time series forecasting library Greykite that helps automate time series problems.

House Price Prediction Project using Machine Learning
Use the Zillow dataset to follow a test-driven approach and build a regression machine learning model to predict the price of the house based on other variables.

Time Series Forecasting with LSTM Neural Network Python
Deep Learning Project- Learn to apply deep learning paradigm to forecast univariate time series data.

German Credit Dataset Analysis to Classify Loan Applications
In this data science project, you will work with German credit dataset using classification techniques like Decision Tree, Neural Networks etc to classify loan applications using R.

Credit Card Fraud Detection as a Classification Problem
In this data science project, we will predict the credit card fraud in the transactional dataset using some of the predictive models.