How to tokenize non english language text in nlp

This recipe helps you tokenize non english language text in nlp

Recipe Objective

How to tokenize non english language text?

As we have discussed earlier about tokenization it is the task of chopping the text into smaller peices which are called tokens, here the tokens can be either words, characters or subwords. There are different tokenizers with different functionality:

Sentence tokenizer - Split the text into sentences from a paragraph.

word tokenizer - Split the text into words.

tokenize sentence or word of different language - using the different pickle file other than English we can tokenize the text in sentences or words.

Step 1 - Import the library

import nltk.data

Step 2 - load the tokenizer

tokenize_spanish = nltk.data.load('tokenizers/punkt/PY3/spanish.pickle')

Here we are loading the spanish language tokenizer, and storing it in a variable

Step 3 - Take a sample text

Sample_text = "Hola a todos, su aprendizaje de tokenización de diferentes idiomas."

Here we have taken a sample text in spanish language and its english conversion is "Hello everyone your learning tokenization of different language".

Step 4 - Apply Tokenization

tokenize_spanish.tokenize(Sample_text)

['Hola a todos, su aprendizaje de tokenización de diferentes idiomas.']

What Users are saying..

profile image

Anand Kumpatla

Sr Data Scientist @ Doubleslash Software Solutions Pvt Ltd
linkedin profile url

ProjectPro is a unique platform and helps many people in the industry to solve real-life problems with a step-by-step walkthrough of projects. A platform with some fantastic resources to gain... Read More

Relevant Projects

Build ARCH and GARCH Models in Time Series using Python
In this Project we will build an ARCH and a GARCH model using Python

NLP Project to Build a Resume Parser in Python using Spacy
Use the popular Spacy NLP python library for OCR and text classification to build a Resume Parser in Python.

AWS MLOps Project to Deploy Multiple Linear Regression Model
Build and Deploy a Multiple Linear Regression Model in Python on AWS

MLOps Project on GCP using Kubeflow for Model Deployment
MLOps using Kubeflow on GCP - Build and deploy a deep learning model on Google Cloud Platform using Kubeflow pipelines in Python

NLP Project for Multi Class Text Classification using BERT Model
In this NLP Project, you will learn how to build a multi-class text classification model using using the pre-trained BERT model.

Hands-On Approach to Causal Inference in Machine Learning
In this Machine Learning Project, you will learn to implement various causal inference techniques in Python to determine, how effective the sprinkler is in making the grass wet.

Build a Multi-Class Classification Model in Python on Saturn Cloud
In this machine learning classification project, you will build a multi-class classification model in Python on Saturn Cloud to predict the license status of a business.

Medical Image Segmentation Deep Learning Project
In this deep learning project, you will learn to implement Unet++ models for medical image segmentation to detect and classify colorectal polyps.

Classification Projects on Machine Learning for Beginners - 1
Classification ML Project for Beginners - A Hands-On Approach to Implementing Different Types of Classification Algorithms in Machine Learning for Predictive Modelling

Build a Customer Churn Prediction Model using Decision Trees
Develop a customer churn prediction model using decision tree machine learning algorithms and data science on streaming service data.