How to tokenize non english language text?
As we have discussed earlier about tokenization it is the task of chopping the text into smaller peices which are called tokens, here the tokens can be either words, characters or subwords. There are different tokenizers with different functionality:
Sentence tokenizer - Split the text into sentences from a paragraph.
word tokenizer - Split the text into words.
tokenize sentence or word of different language - using the different pickle file other than English we can tokenize the text in sentences or words.
tokenize_spanish = nltk.data.load('tokenizers/punkt/PY3/spanish.pickle')
Here we are loading the spanish language tokenizer, and storing it in a variable
Sample_text = "Hola a todos, su aprendizaje de tokenización de diferentes idiomas."
Here we have taken a sample text in spanish language and its english conversion is "Hello everyone your learning tokenization of different language".
['Hola a todos, su aprendizaje de tokenización de diferentes idiomas.']