How to Preprocess pairs of sentences using transformers?

This recipe helps you to preprocess pairs of sentences using transformers.

Recipe Objective - How to Preprocess pairs of sentences in transformers?

A tokenizer is the most important tool for preprocessing of data. You can create one by utilizing the tokenizer class related to the model you would like to utilize, or by using the AutoTokenizer class directly. The tokenizer will separate a given text into tokens (words or parts of words, punctuation symbols, etc.). It will then transform those tokens into numbers so that it can construct a tensor out of them and feed it to the model. It will also provide any other inputs that the model may require to function effectively.

For more related projects -

/projects/data-science-projects/keras-deep-learning-projects
/projects/data-science-projects/tensorflow-projects

Example of preprocessing pairs of sentences:

# Example 1:
# Importing libraries
from transformers import AutoTokenizer

# Loading model
tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')

# Passing input to model
encoded_value = tokenizer("Hello world!","Bye world!")

# Printing the tokens(encoded values)
print(encoded_value)

# Decoding the encoded values to get input back
tokenizer.decode(encoded_value["input_ids"])

Output - 
{'input_ids': [101, 8667, 1362, 106, 102, 17774, 1362, 106, 102], 'token_type_ids': [0, 0, 0, 0, 0, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}
'[CLS] Hello world! [SEP] Bye world! [SEP]'

# Example 2:
# Importing libraries
from transformers import AutoTokenizer

# Loading model
tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')

# Creating lists of sentences
list1 = ["Hello I'm a first sentence", "Hello I'm second sentence"]
list2 = ["I'm a second half of first sentence", "I'm a second half of second sentence"]

# Passing input to model
encoded_value = tokenizer(list1,list2)

# Printing the tokens(encoded values)
print(encoded_value)

# Decoding the encoded values to get input back
for input_id in encoded_value["input_ids"]:
print(tokenizer.decode(input_id))

Output - 
{'input_ids': [[101, 8667, 146, 112, 182, 170, 1148, 5650, 102, 146, 112, 182, 170, 1248, 1544, 1104, 1148, 5650, 102], [101, 8667, 146, 112, 182, 1248, 5650, 102, 146, 112, 182, 170, 1248, 1544, 1104, 1248, 5650, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}
[CLS] Hello I'm a first sentence [SEP] I'm a second half of first sentence [SEP]
[CLS] Hello I'm second sentence [SEP] I'm a second half of second sentence [SEP]

In this way, we can preprocess pairs of sentences in transformers.

What Users are saying..

profile image

Abhinav Agarwal

Graduate Student at Northwestern University
linkedin profile url

I come from Northwestern University, which is ranked 9th in the US. Although the high-quality academics at school taught me all the basics I needed, obtaining practical experience was a challenge.... Read More

Relevant Projects

OpenCV Project to Master Advanced Computer Vision Concepts
In this OpenCV project, you will learn to implement advanced computer vision concepts and algorithms in OpenCV library using Python.

Machine Learning project for Retail Price Optimization
In this machine learning pricing project, we implement a retail price optimization algorithm using regression trees. This is one of the first steps to building a dynamic pricing model.

Census Income Data Set Project-Predict Adult Census Income
Use the Adult Income dataset to predict whether income exceeds 50K yr based oncensus data.

BigMart Sales Prediction ML Project in Python
The goal of the BigMart Sales Prediction ML project is to build and evaluate different predictive models and determine the sales of each product at a store.

Build a Review Classification Model using Gated Recurrent Unit
In this Machine Learning project, you will build a classification model in python to classify the reviews of an app on a scale of 1 to 5 using Gated Recurrent Unit.

Customer Market Basket Analysis using Apriori and Fpgrowth algorithms
In this data science project, you will learn how to perform market basket analysis with the application of Apriori and FP growth algorithms based on the concept of association rule learning.

Build Regression (Linear,Ridge,Lasso) Models in NumPy Python
In this machine learning regression project, you will learn to build NumPy Regression Models (Linear Regression, Ridge Regression, Lasso Regression) from Scratch.

LLM Project to Build and Fine Tune a Large Language Model
In this LLM project for beginners, you will learn to build a knowledge-grounded chatbot using LLM's and learn how to fine tune it.

Digit Recognition using CNN for MNIST Dataset in Python
In this deep learning project, you will build a convolutional neural network using MNIST dataset for handwritten digit recognition.

Hands-On Approach to Regression Discontinuity Design Python
In this machine learning project, you will learn to implement Regression Discontinuity Design Example in Python to determine the effect of age on Mortality Rate in Python.