How to Preprocess pairs of sentences using transformers?

This recipe helps you to preprocess pairs of sentences using transformers.

Recipe Objective - How to Preprocess pairs of sentences in transformers?

A tokenizer is the most important tool for preprocessing of data. You can create one by utilizing the tokenizer class related to the model you would like to utilize, or by using the AutoTokenizer class directly. The tokenizer will separate a given text into tokens (words or parts of words, punctuation symbols, etc.). It will then transform those tokens into numbers so that it can construct a tensor out of them and feed it to the model. It will also provide any other inputs that the model may require to function effectively.

For more related projects -

/projects/data-science-projects/keras-deep-learning-projects
/projects/data-science-projects/tensorflow-projects

Example of preprocessing pairs of sentences:

# Example 1:
# Importing libraries
from transformers import AutoTokenizer

# Loading model
tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')

# Passing input to model
encoded_value = tokenizer("Hello world!","Bye world!")

# Printing the tokens(encoded values)
print(encoded_value)

# Decoding the encoded values to get input back
tokenizer.decode(encoded_value["input_ids"])

Output - 
{'input_ids': [101, 8667, 1362, 106, 102, 17774, 1362, 106, 102], 'token_type_ids': [0, 0, 0, 0, 0, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}
'[CLS] Hello world! [SEP] Bye world! [SEP]'

# Example 2:
# Importing libraries
from transformers import AutoTokenizer

# Loading model
tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')

# Creating lists of sentences
list1 = ["Hello I'm a first sentence", "Hello I'm second sentence"]
list2 = ["I'm a second half of first sentence", "I'm a second half of second sentence"]

# Passing input to model
encoded_value = tokenizer(list1,list2)

# Printing the tokens(encoded values)
print(encoded_value)

# Decoding the encoded values to get input back
for input_id in encoded_value["input_ids"]:
print(tokenizer.decode(input_id))

Output - 
{'input_ids': [[101, 8667, 146, 112, 182, 170, 1148, 5650, 102, 146, 112, 182, 170, 1248, 1544, 1104, 1148, 5650, 102], [101, 8667, 146, 112, 182, 1248, 5650, 102, 146, 112, 182, 170, 1248, 1544, 1104, 1248, 5650, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}
[CLS] Hello I'm a first sentence [SEP] I'm a second half of first sentence [SEP]
[CLS] Hello I'm second sentence [SEP] I'm a second half of second sentence [SEP]

In this way, we can preprocess pairs of sentences in transformers.

What Users are saying..

profile image

Gautam Vermani

Data Consultant at Confidential
linkedin profile url

Having worked in the field of Data Science, I wanted to explore how I can implement projects in other domains, So I thought of connecting with ProjectPro. A project that helped me absorb this topic... Read More

Relevant Projects

Recommender System Machine Learning Project for Beginners-3
Content Based Recommender System Project - Building a Content-Based Product Recommender App with Streamlit

Build a Multi Class Image Classification Model Python using CNN
This project explains How to build a Sequential Model that can perform Multi Class Image Classification in Python using CNN

Build a Graph Based Recommendation System in Python-Part 2
In this Graph Based Recommender System Project, you will build a recommender system project for eCommerce platforms and learn to use FAISS for efficient similarity search.

Deep Learning Project for Beginners with Source Code Part 1
Learn to implement deep neural networks in Python .

Build OCR from Scratch Python using YOLO and Tesseract
In this deep learning project, you will learn how to build your custom OCR (optical character recognition) from scratch by using Google Tesseract and YOLO to read the text from any images.

Build a Multi Touch Attribution Machine Learning Model in Python
Identifying the ROI on marketing campaigns is an essential KPI for any business. In this ML project, you will learn to build a Multi Touch Attribution Model in Python to identify the ROI of various marketing efforts and their impact on conversions or sales..

Build CNN for Image Colorization using Deep Transfer Learning
Image Processing Project -Train a model for colorization to make grayscale images colorful using convolutional autoencoders.

Learn Object Tracking (SOT, MOT) using OpenCV and Python
Get Started with Object Tracking using OpenCV and Python - Learn to implement Multiple Instance Learning Tracker (MIL) algorithm, Generic Object Tracking Using Regression Networks Tracker (GOTURN) algorithm, Kernelized Correlation Filters Tracker (KCF) algorithm, Tracking, Learning, Detection Tracker (TLD) algorithm for single and multiple object tracking from various video clips.

Build an AI Chatbot from Scratch using Keras Sequential Model
In this NLP Project, you will learn how to build an AI Chatbot from Scratch using Keras Sequential Model.

Learn to Build an End-to-End Machine Learning Pipeline - Part 1
In this Machine Learning Project, you will learn how to build an end-to-end machine learning pipeline for predicting truck delays, addressing a major challenge in the logistics industry.