How to tokenize non english language text?
MACHINE LEARNING RECIPES DATA CLEANING PYTHON DATA MUNGING PANDAS CHEATSHEET     ALL TAGS

How to tokenize non english language text?

How to tokenize non english language text?

This recipe helps you tokenize non english language text

Recipe Objective

How to tokenize non english language text?

As we have discussed earlier about tokenization it is the task of chopping the text into smaller peices which are called tokens, here the tokens can be either words, characters or subwords. There are different tokenizers with different functionality:

Sentence tokenizer - Split the text into sentences from a paragraph.

word tokenizer - Split the text into words.

tokenize sentence or word of different language - using the different pickle file other than English we can tokenize the text in sentences or words.

Step 1 - Import the library

import nltk.data

Step 2 - load the tokenizer

tokenize_spanish = nltk.data.load('tokenizers/punkt/PY3/spanish.pickle')

Here we are loading the spanish language tokenizer, and storing it in a variable

Step 3 - Take a sample text

Sample_text = "Hola a todos, su aprendizaje de tokenizaciĆ³n de diferentes idiomas."

Here we have taken a sample text in spanish language and its english conversion is "Hello everyone your learning tokenization of different language".

Step 4 - Apply Tokenization

tokenize_spanish.tokenize(Sample_text)
['Hola a todos, su aprendizaje de tokenizaciĆ³n de diferentes idiomas.']

Relevant Projects

Natural language processing Chatbot application using NLTK for text classification
In this NLP AI application, we build the core conversational engine for a chatbot. We use the popular NLTK text classification library to achieve this.

Word2Vec and FastText Word Embedding with Gensim in Python
In this NLP Project, you will learn how to use the popular topic modelling library Gensim for implementing two state-of-the-art word embedding methods Word2Vec and FastText models.

Customer Market Basket Analysis using Apriori and Fpgrowth algorithms
In this data science project, you will learn how to perform market basket analysis with the application of Apriori and FP growth algorithms based on the concept of association rule learning.

Build OCR from Scratch Python using YOLO and Tesseract
In this deep learning project, you will learn how to build your custom OCR (optical character recognition) from scratch by using Google Tesseract and YOLO to read the text from any images.

House Price Prediction Project using Machine Learning
Use the Zillow dataset to follow a test-driven approach and build a regression machine learning model to predict the price of the house based on other variables.

Ecommerce product reviews - Pairwise ranking and sentiment analysis
This project analyzes a dataset containing ecommerce product reviews. The goal is to use machine learning models to perform sentiment analysis on product reviews and rank them based on relevance. Reviews play a key role in product recommendation systems.

Predict Credit Default | Give Me Some Credit Kaggle
In this data science project, you will predict borrowers chance of defaulting on credit loans by building a credit score prediction model.

Demand prediction of driver availability using multistep time series analysis
In this supervised learning machine learning project, you will predict the availability of a driver in a specific area by using multi step time series analysis.

Locality Sensitive Hashing Python Code for Look-Alike Modelling
In this deep learning project, you will find similar images (lookalikes) using deep learning and locality sensitive hashing to find customers who are most likely to click on an ad.

Customer Churn Prediction Analysis using Ensemble Techniques
In this machine learning churn project, we implement a churn prediction model in python using ensemble techniques.