How to tokenize non english language text in nlp

This recipe helps you tokenize non english language text in nlp
Last Updated: 02 Jun 2022

Get access to Data Science projects View all Data Science projects

MACHINE LEARNING RECIPES DATA CLEANING PYTHON DATA MUNGING PANDAS CHEATSHEET ALL TAGS

Recipe Objective

How to tokenize non english language text?

As we have discussed earlier about tokenization it is the task of chopping the text into smaller peices which are called tokens, here the tokens can be either words, characters or subwords. There are different tokenizers with different functionality:

Sentence tokenizer - Split the text into sentences from a paragraph.

word tokenizer - Split the text into words.

tokenize sentence or word of different language - using the different pickle file other than English we can tokenize the text in sentences or words.

Table of Contents

Recipe Objective

Step 1 - Import the library

import nltk.data

Step 2 - load the tokenizer

tokenize_spanish = nltk.data.load('tokenizers/punkt/PY3/spanish.pickle')

Here we are loading the spanish language tokenizer, and storing it in a variable

Step 3 - Take a sample text

Sample_text = "Hola a todos, su aprendizaje de tokenización de diferentes idiomas."

Here we have taken a sample text in spanish language and its english conversion is "Hello everyone your learning tokenization of different language".

Step 4 - Apply Tokenization

tokenize_spanish.tokenize(Sample_text)

['Hola a todos, su aprendizaje de tokenización de diferentes idiomas.']

view run code

What Users are saying..

Savvy Sahai

Data Science Intern, Capgemini

As a student looking to break into the field of data engineering and data science, one can get really confused as to which path to take. Very few ways to do it are Google, YouTube, etc. I was one of... Read More

Relevant Projects

Machine Learning Projects

Data Science Projects

Python Projects for Data Science

Data Science Projects in R

Machine Learning Projects for Beginners

Deep Learning Projects

Neural Network Projects

Tensorflow Projects

NLP Projects

Kaggle Projects

IoT Projects

Big Data Projects

Hadoop Real-Time Projects Examples

Spark Projects

Data Analytics Projects for Students

Relevant Projects

GCP MLOps Project to Deploy ARIMA Model using uWSGI Flask

Build an end-to-end MLOps Pipeline to deploy a Time Series ARIMA Model on GCP using uWSGI and Flask

View Project Details

Build a Text Generator Model using Amazon SageMaker

In this Deep Learning Project, you will train a Text Generator Model on Amazon Reviews Dataset using LSTM Algorithm in PyTorch and deploy it on Amazon SageMaker.

View Project Details

OpenCV Project to Master Advanced Computer Vision Concepts

In this OpenCV project, you will learn to implement advanced computer vision concepts and algorithms in OpenCV library using Python.

View Project Details

Tensorflow Transfer Learning Model for Image Classification

Image Classification Project - Build an Image Classification Model on a Dataset of T-Shirt Images for Binary Classification

View Project Details

Build an End-to-End AWS SageMaker Classification Model

MLOps on AWS SageMaker -Learn to Build an End-to-End Classification Model on SageMaker to predict a patient’s cause of death.

View Project Details

Isolation Forest Model and LOF for Anomaly Detection in Python

Credit Card Fraud Detection Project - Build an Isolation Forest Model and Local Outlier Factor (LOF) in Python to identify fraudulent credit card transactions.

View Project Details

Time Series Project to Build a Multiple Linear Regression Model

Learn to build a Multiple linear regression model in Python on Time Series Data

View Project Details

Personalized Medicine: Redefining Cancer Treatment

In this Personalized Medicine Machine Learning Project you will learn to classify genetic mutations on the basis of medical literature into 9 classes.

View Project Details

Machine Learning project for Retail Price Optimization

In this machine learning pricing project, we implement a retail price optimization algorithm using regression trees. This is one of the first steps to building a dynamic pricing model.

View Project Details

Locality Sensitive Hashing Python Code for Look-Alike Modelling

In this deep learning project, you will find similar images (lookalikes) using deep learning and locality sensitive hashing to find customers who are most likely to click on an ad.

View Project Details