What is Tokenizer in transformers?

This recipe explains what is Tokenizer in transformers.

Recipe Objective - What is Tokenizer in transformers?

The tokenizer is responsible for preparing input for the model. The library contains the markers for all models. Most tokenizers have two versions: a full Python implementation and a "fast" implementation supported by the Rust library tokenizer. The "fast" implementation allows:

Learn to use RNN for Text Classification with Source Code 

1. significant speedup, especially when performing batch tokenization and
2. additional mapping methods between the original string (characters and words) and the token space (for example, get the index of the token containing a certain character) or the range of characters corresponding to a specific tag). SentencePiece-based tokenizers are currently ineligible for the "quick" implementation (applicable to T5, ALBERT, CamemBERT, XLMRoBERTa, and XLNet models).

Types of tokenizer:

1. PreTrainedTokenizer
2. PreTrainedTokenizerFast
3. BatchEncoding

For more related projects -


Example -

Let's see how to make a tokenizer in transformers:

# Importing libraries
from transformers import BertTokenizer

# Creating tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

Output -
PreTrainedTokenizer(name_or_path='bert-base-uncased', vocab_size=30522, model_max_len=512, is_fast=False, padding_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'})

In this way, we can make a tokenizer in transformers.

What Users are saying..

profile image

Savvy Sahai

Data Science Intern, Capgemini
linkedin profile url

As a student looking to break into the field of data engineering and data science, one can get really confused as to which path to take. Very few ways to do it are Google, YouTube, etc. I was one of... Read More

Relevant Projects

Image Classification Model using Transfer Learning in PyTorch
In this PyTorch Project, you will build an image classification model in PyTorch using the ResNet pre-trained model.

MLOps Project for a Mask R-CNN on GCP using uWSGI Flask
MLOps on GCP - Solved end-to-end MLOps Project to deploy a Mask RCNN Model for Image Segmentation as a Web Application using uWSGI Flask, Docker, and TensorFlow.

Build a Review Classification Model using Gated Recurrent Unit
In this Machine Learning project, you will build a classification model in python to classify the reviews of an app on a scale of 1 to 5 using Gated Recurrent Unit.

House Price Prediction Project using Machine Learning in Python
Use the Zillow Zestimate Dataset to build a machine learning model for house price prediction.

Build CNN Image Classification Models for Real Time Prediction
Image Classification Project to build a CNN model in Python that can classify images into social security cards, driving licenses, and other key identity information.

Recommender System Machine Learning Project for Beginners-4
Collaborative Filtering Recommender System Project - Comparison of different model based and memory based methods to build recommendation system using collaborative filtering.

CycleGAN Implementation for Image-To-Image Translation
In this GAN Deep Learning Project, you will learn how to build an image to image translation model in PyTorch with Cycle GAN.

Digit Recognition using CNN for MNIST Dataset in Python
In this deep learning project, you will build a convolutional neural network using MNIST dataset for handwritten digit recognition.

Build Multi Class Text Classification Models with RNN and LSTM
In this Deep Learning Project, you will use the customer complaints data about consumer financial products to build multi-class text classification models using RNN and LSTM.

AWS MLOps Project to Deploy Multiple Linear Regression Model
Build and Deploy a Multiple Linear Regression Model in Python on AWS