How does XLNetTokenizer work in transformers?

This recipe explains how does XLNetTokenizer work in transformers.

Recipe Objective - How does XLNetTokenizer work in transformers?

Subword tokenization methods work on the idea that common words should not be broken down into smaller subwords, but rare words should be broken down into meaningful subwords.

For more related projects -

/projects/data-science-projects/keras-deep-learning-projects
/projects/data-science-projects/deep-learning-projects

Example of XLNetTokenizer:

# Importing XLNetTokenizer
from transformers import XLNetTokenizerFast
xlnet_tokenizer = XLNetTokenizerFast.from_pretrained("xlnet-base-cased")

# Passing input
xlnet_tokenizer.tokenize("Welcome to transformers tutorials!!!")

Output - 
['_Welcome', '_to', '_transform', 'ers', '_tutorial', 's', '!!!']

We can see that the words ["Welcome", "to", "tutorial"] are present in the tokenizer’s vocabulary, but the word "transformers" is not. Consequently, the tokenizer splits "transformers" into known subwords: ["transform" and "ers"]. The symbol "_" indicates the space.

In this way, we can perform XLNetTokenizer in transformers.

What Users are saying..

profile image

Savvy Sahai

Data Science Intern, Capgemini
linkedin profile url

As a student looking to break into the field of data engineering and data science, one can get really confused as to which path to take. Very few ways to do it are Google, YouTube, etc. I was one of... Read More

Relevant Projects

Loan Eligibility Prediction using Gradient Boosting Classifier
This data science in python project predicts if a loan should be given to an applicant or not. We predict if the customer is eligible for loan based on several factors like credit score and past history.

End-to-End Snowflake Healthcare Analytics Project on AWS-2
In this AWS Snowflake project, you will build an end to end retraining pipeline by checking Data and Model Drift and learn how to redeploy the model if needed

Predict Churn for a Telecom company using Logistic Regression
Machine Learning Project in R- Predict the customer churn of telecom sector and find out the key drivers that lead to churn. Learn how the logistic regression model using R can be used to identify the customer churn in telecom dataset.

Build an optimal End-to-End MLOps Pipeline and Deploy on GCP
Learn how to build and deploy an end-to-end optimal MLOps Pipeline for Loan Eligibility Prediction Model in Python on GCP

Time Series Analysis with Facebook Prophet Python and Cesium
Time Series Analysis Project - Use the Facebook Prophet and Cesium Open Source Library for Time Series Forecasting in Python

Build Portfolio Optimization Machine Learning Models in R
Machine Learning Project for Financial Risk Modelling and Portfolio Optimization with R- Build a machine learning model in R to develop a strategy for building a portfolio for maximized returns.

Predictive Analytics Project for Working Capital Optimization
In this Predictive Analytics Project, you will build a model to accurately forecast the timing of customer and supplier payments for optimizing working capital.

PyTorch Project to Build a GAN Model on MNIST Dataset
In this deep learning project, you will learn how to build a GAN Model on MNIST Dataset for generating new images of handwritten digits.

Word2Vec and FastText Word Embedding with Gensim in Python
In this NLP Project, you will learn how to use the popular topic modelling library Gensim for implementing two state-of-the-art word embedding methods Word2Vec and FastText models.

Skip Gram Model Python Implementation for Word Embeddings
Skip-Gram Model word2vec Example -Learn how to implement the skip gram algorithm in NLP for word embeddings on a set of documents.