How does BertTokenizer work in transformers?

This recipe explains how does BertTokenizer work in transformers.

Recipe Objective - How does BertTokenizer work in transformers?

Subword tokenization methods work on the idea that common words should not be broken down into smaller subwords, but rare words should be broken down into meaningful subwords.

Access Avocado Machine Learning Project for Price Prediction

For more related projects -

/projects/data-science-projects/neural-network-projects
/projects/data-science-projects/tensorflow-projects

Example of BertTokenizer:

# Importing BertTokenizer
from transformers import BertTokenizer
bert_tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

# Passing input
bert_tokenizer.tokenize("Welcome to Transformers tutorials!!!")

Output - 
['welcome', 'to', 'transformers', 'tutor', '##ials', '!', '!', '!']

The sentence was lowercased first because we're using the uncased model. We can see that the words ["welcome", "to", "transformers"] are present in the tokenizer’s vocabulary, but the word "tutorials" is not. Consequently, the tokenizer splits "tutorials" into known subwords: ["tutor" and "##ials"]. The symbol "##" indicates that the remainder of the token should be connected to the previous one without leaving any gap (for decoding or reversal of the tokenization).

In this way, we can perform BertTokenizer in transformers.

What Users are saying..

profile image

Ameeruddin Mohammed

ETL (Abintio) developer at IBM
linkedin profile url

I come from a background in Marketing and Analytics and when I developed an interest in Machine Learning algorithms, I did multiple in-class courses from reputed institutions though I got good... Read More

Relevant Projects

Learn to Build an End-to-End Machine Learning Pipeline - Part 2
In this Machine Learning Project, you will learn how to build an end-to-end machine learning pipeline for predicting truck delays, incorporating Hopsworks' feature store and Weights and Biases for model experimentation.

Customer Market Basket Analysis using Apriori and Fpgrowth algorithms
In this data science project, you will learn how to perform market basket analysis with the application of Apriori and FP growth algorithms based on the concept of association rule learning.

Deep Learning Project- Real-Time Fruit Detection using YOLOv4
In this deep learning project, you will learn to build an accurate, fast, and reliable real-time fruit detection system using the YOLOv4 object detection model for robotic harvesting platforms.

Deploy Transformer BART Model for Text summarization on GCP
Learn to Deploy a Machine Learning Model for the Abstractive Text Summarization on Google Cloud Platform (GCP)

Natural language processing Chatbot application using NLTK for text classification
In this NLP AI application, we build the core conversational engine for a chatbot. We use the popular NLTK text classification library to achieve this.

Time Series Forecasting Project-Building ARIMA Model in Python
Build a time series ARIMA model in Python to forecast the use of arrival rate density to support staffing decisions at call centres.

Build a Music Recommendation Algorithm using KKBox's Dataset
Music Recommendation Project using Machine Learning - Use the KKBox dataset to predict the chances of a user listening to a song again after their very first noticeable listening event.

Build a Multi ClassText Classification Model using Naive Bayes
Implement the Naive Bayes Algorithm to build a multi class text classification model in Python.

Build Piecewise and Spline Regression Models in Python
In this Regression Project, you will learn how to build a piecewise and spline regression model from scratch in Python to predict the points scored by a sports team.

Build a Multi Class Image Classification Model Python using CNN
This project explains How to build a Sequential Model that can perform Multi Class Image Classification in Python using CNN