How to tokenize non english language text?
MACHINE LEARNING RECIPES DATA CLEANING PYTHON DATA MUNGING PANDAS CHEATSHEET     ALL TAGS

How to tokenize non english language text?

How to tokenize non english language text?

This recipe helps you tokenize non english language text

0

Recipe Objective

How to tokenize non english language text?

As we have discussed earlier about tokenization it is the task of chopping the text into smaller peices which are called tokens, here the tokens can be either words, characters or subwords. There are different tokenizers with different functionality:

Sentence tokenizer - Split the text into sentences from a paragraph.

word tokenizer - Split the text into words.

tokenize sentence or word of different language - using the different pickle file other than English we can tokenize the text in sentences or words.

Step 1 - Import the library

import nltk.data

Step 2 - load the tokenizer

tokenize_spanish = nltk.data.load('tokenizers/punkt/PY3/spanish.pickle')

Here we are loading the spanish language tokenizer, and storing it in a variable

Step 3 - Take a sample text

Sample_text = "Hola a todos, su aprendizaje de tokenizaciĆ³n de diferentes idiomas."

Here we have taken a sample text in spanish language and its english conversion is "Hello everyone your learning tokenization of different language".

Step 4 - Apply Tokenization

tokenize_spanish.tokenize(Sample_text)
['Hola a todos, su aprendizaje de tokenizaciĆ³n de diferentes idiomas.']

Relevant Projects

Data Science Project in Python on BigMart Sales Prediction
The goal of this data science project is to build a predictive model and find out the sales of each product at a given Big Mart store.

Build a Collaborative Filtering Recommender System in Python
Use the Amazon Reviews/Ratings dataset of 2 Million records to build a recommender system using memory-based collaborative filtering in Python.

Perform Time series modelling using Facebook Prophet
In this project, we are going to talk about Time Series Forecasting to predict the electricity requirement for a particular house using Prophet.

Predict Credit Default | Give Me Some Credit Kaggle
In this data science project, you will predict borrowers chance of defaulting on credit loans by building a credit score prediction model.

Predict Macro Economic Trends using Kaggle Financial Dataset
In this machine learning project, you will uncover the predictive value in an uncertain world by using various artificial intelligence, machine learning, advanced regression and feature transformation techniques.

Walmart Sales Forecasting Data Science Project
Data Science Project in R-Predict the sales for each department using historical markdown data from the Walmart dataset containing data of 45 Walmart stores.

Forecast Inventory demand using historical sales data in R
In this machine learning project, you will develop a machine learning model to accurately forecast inventory demand based on historical sales data.

Learn to prepare data for your next machine learning project
Text data requires special preparation before you can start using it for any machine learning project.In this ML project, you will learn about applying Machine Learning models to create classifiers and learn how to make sense of textual data.

Music Recommendation System Project using Python and R
Machine Learning Project - Work with KKBOX's Music Recommendation System dataset to build the best music recommendation engine.

Predict Employee Computer Access Needs in Python
Data Science Project in Python- Given his or her job role, predict employee access needs using amazon employee database.