How to tokenize non english language text in nlp

This recipe helps you tokenize non english language text in nlp

Recipe Objective

How to tokenize non english language text?

As we have discussed earlier about tokenization it is the task of chopping the text into smaller peices which are called tokens, here the tokens can be either words, characters or subwords. There are different tokenizers with different functionality:

Sentence tokenizer - Split the text into sentences from a paragraph.

word tokenizer - Split the text into words.

tokenize sentence or word of different language - using the different pickle file other than English we can tokenize the text in sentences or words.

Step 1 - Import the library

import nltk.data

Step 2 - load the tokenizer

tokenize_spanish = nltk.data.load('tokenizers/punkt/PY3/spanish.pickle')

Here we are loading the spanish language tokenizer, and storing it in a variable

Step 3 - Take a sample text

Sample_text = "Hola a todos, su aprendizaje de tokenización de diferentes idiomas."

Here we have taken a sample text in spanish language and its english conversion is "Hello everyone your learning tokenization of different language".

Step 4 - Apply Tokenization

tokenize_spanish.tokenize(Sample_text)

['Hola a todos, su aprendizaje de tokenización de diferentes idiomas.']

What Users are saying..

profile image

Savvy Sahai

Data Science Intern, Capgemini
linkedin profile url

As a student looking to break into the field of data engineering and data science, one can get really confused as to which path to take. Very few ways to do it are Google, YouTube, etc. I was one of... Read More

Relevant Projects

GCP MLOps Project to Deploy ARIMA Model using uWSGI Flask
Build an end-to-end MLOps Pipeline to deploy a Time Series ARIMA Model on GCP using uWSGI and Flask

Build a Text Generator Model using Amazon SageMaker
In this Deep Learning Project, you will train a Text Generator Model on Amazon Reviews Dataset using LSTM Algorithm in PyTorch and deploy it on Amazon SageMaker.

OpenCV Project to Master Advanced Computer Vision Concepts
In this OpenCV project, you will learn to implement advanced computer vision concepts and algorithms in OpenCV library using Python.

Tensorflow Transfer Learning Model for Image Classification
Image Classification Project - Build an Image Classification Model on a Dataset of T-Shirt Images for Binary Classification

Build an End-to-End AWS SageMaker Classification Model
MLOps on AWS SageMaker -Learn to Build an End-to-End Classification Model on SageMaker to predict a patient’s cause of death.

Isolation Forest Model and LOF for Anomaly Detection in Python
Credit Card Fraud Detection Project - Build an Isolation Forest Model and Local Outlier Factor (LOF) in Python to identify fraudulent credit card transactions.

Time Series Project to Build a Multiple Linear Regression Model
Learn to build a Multiple linear regression model in Python on Time Series Data

Personalized Medicine: Redefining Cancer Treatment
In this Personalized Medicine Machine Learning Project you will learn to classify genetic mutations on the basis of medical literature into 9 classes.

Machine Learning project for Retail Price Optimization
In this machine learning pricing project, we implement a retail price optimization algorithm using regression trees. This is one of the first steps to building a dynamic pricing model.

Locality Sensitive Hashing Python Code for Look-Alike Modelling
In this deep learning project, you will find similar images (lookalikes) using deep learning and locality sensitive hashing to find customers who are most likely to click on an ad.