How does BertTokenizer work in transformers?

This recipe explains how does BertTokenizer work in transformers.

Recipe Objective - How does BertTokenizer work in transformers?

Subword tokenization methods work on the idea that common words should not be broken down into smaller subwords, but rare words should be broken down into meaningful subwords.

Access Avocado Machine Learning Project for Price Prediction

For more related projects -

/projects/data-science-projects/neural-network-projects
/projects/data-science-projects/tensorflow-projects

Example of BertTokenizer:

# Importing BertTokenizer
from transformers import BertTokenizer
bert_tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

# Passing input
bert_tokenizer.tokenize("Welcome to Transformers tutorials!!!")

Output - 
['welcome', 'to', 'transformers', 'tutor', '##ials', '!', '!', '!']

The sentence was lowercased first because we're using the uncased model. We can see that the words ["welcome", "to", "transformers"] are present in the tokenizer’s vocabulary, but the word "tutorials" is not. Consequently, the tokenizer splits "tutorials" into known subwords: ["tutor" and "##ials"]. The symbol "##" indicates that the remainder of the token should be connected to the previous one without leaving any gap (for decoding or reversal of the tokenization).

In this way, we can perform BertTokenizer in transformers.

What Users are saying..

profile image

Abhinav Agarwal

Graduate Student at Northwestern University
linkedin profile url

I come from Northwestern University, which is ranked 9th in the US. Although the high-quality academics at school taught me all the basics I needed, obtaining practical experience was a challenge.... Read More

Relevant Projects

MLOps using Azure Devops to Deploy a Classification Model
In this MLOps Azure project, you will learn how to deploy a classification machine learning model to predict the customer's license status on Azure through scalable CI/CD ML pipelines.

Ecommerce product reviews - Pairwise ranking and sentiment analysis
This project analyzes a dataset containing ecommerce product reviews. The goal is to use machine learning models to perform sentiment analysis on product reviews and rank them based on relevance. Reviews play a key role in product recommendation systems.

Loan Eligibility Prediction in Python using H2O.ai
In this loan prediction project you will build predictive models in Python using H2O.ai to predict if an applicant is able to repay the loan or not.

MLOps Project on GCP using Kubeflow for Model Deployment
MLOps using Kubeflow on GCP - Build and deploy a deep learning model on Google Cloud Platform using Kubeflow pipelines in Python

Ensemble Machine Learning Project - All State Insurance Claims Severity Prediction
In this ensemble machine learning project, we will predict what kind of claims an insurance company will get. This is implemented in python using ensemble machine learning algorithms.

Locality Sensitive Hashing Python Code for Look-Alike Modelling
In this deep learning project, you will find similar images (lookalikes) using deep learning and locality sensitive hashing to find customers who are most likely to click on an ad.

GCP MLOps Project to Deploy ARIMA Model using uWSGI Flask
Build an end-to-end MLOps Pipeline to deploy a Time Series ARIMA Model on GCP using uWSGI and Flask

Build an End-to-End AWS SageMaker Classification Model
MLOps on AWS SageMaker -Learn to Build an End-to-End Classification Model on SageMaker to predict a patient’s cause of death.

Learn Hyperparameter Tuning for Neural Networks with PyTorch
In this Deep Learning Project, you will learn how to optimally tune the hyperparameters (learning rate, epochs, dropout, early stopping) of a neural network model in PyTorch to improve model performance.

Deep Learning Project for Time Series Forecasting in Python
Deep Learning for Time Series Forecasting in Python -A Hands-On Approach to Build Deep Learning Models (MLP, CNN, LSTM, and a Hybrid Model CNN-LSTM) on Time Series Data.