Explain working of BERT in nlp with the help of an example?

This recipe explains working of BERT in nlp with the help of an example
Last Updated: 06 Jul 2022

Get access to Data Science projects View all Data Science projects

MACHINE LEARNING RECIPES DATA CLEANING PYTHON DATA MUNGING PANDAS CHEATSHEET ALL TAGS

Recipe Objective

What is BERT?

BERT is the Bidirectional Encoder Representation from transformers. It is a Natural language processing model proposed by researches at Google Research in 2018. When it was proposed it achieves state-of-the-art accuracy on many NLP and NLU tasks which are:Q

General Language Understanding Evaluation

Stanford Q/A dataset SQuAD v1.1 and v2.0

Situation With Adversarial Generations

It is designed for pre-train deep bidirectional representation from unlabeled text by jointly conditioning on both left and right context.

Master the Art of Data Cleaning in Machine Learning

Recipe Objective

Step 1 - Install BERT and necessary libraries

!pip install bert-for-tf2 !pip install sentencepiece

Step 2 - Set for tensorflow 2.0

try: %tensorflow_version 2.x except Exception: pass import tensorflow as tf import tensorflow_hub as hub from tensorflow.keras import layers import bert % tensorflow_version 2.x

As we are going to work on tensorflow 2.0, we need to set it to the required one.

Step 3 - Import the necessary libraries

from tensorflow.keras import layers import bert import pandas as pd import tensorflow_hub as hub import re

Step 4 - Load the Dataset

reviews_data = pd.read_csv("/content/drive/MyDrive/Data sets/IMDB Dataset.csv") reviews_data.isnull().values.any() reviews_data.shape

(50000, 2)

For data we are going to use IMDB movie rating Data set

Step 5 - Remove punctuation and special character

def Mytext_preprocess(sentnc): text1 = remove_tags(sen) # Remove html tags text1 = re.sub('[^a-zA-Z]', ' ', text1) # Remove punctuations and numbers text1 = re.sub(r"\s+[a-zA-Z]\s+", ' ', text1) # Single character removal

-Here in the above we are removing punctuations and specials characters from our data set, there are html tags, extra spaces are present in our data so we need to remove them for better result.

re_tag = re.compile(r'<[^>]+>') def tags_remove(text2): return re_tag.sub('', text2)

Step 6 - Clean the text

movie_reviews = [] sentences = list(reviews_data['review']) for data in sentences: movie_reviews.append(preprocess_text(data))

Step 7 - Print the Review column values

print(reviews_data.columns.values)

['review' 'sentiment']

The movie_reviews here contains two columns review and sentiments. In review column it contains the text data while in sentiment column it contains the sentiments in the form of text.

Step 8 - Unique values of sentiment column

reviews_data.sentiment.unique()

array(['positive', 'negative'], dtype=object)

Step 9 - Convert the sentiment values with integers

import numpy as np y_var = reviews_data['sentiment'] y_var = np.array(list(map(lambda x: 1 if x=="positive" else 0, y_var)))

As we all know that algorithms work with integer values, so we need to convert the text data into integer, for that we are using numpy and aslo with the help of lambda function we will convert the 'positive' text as '1' and remaining all as '0'.

Step 10 - Print the reviews

print(reviews[10])

Phil the Alien is one of those quirky films where the humour is based around the oddness of everything rather than actual punchlines At first it was very odd and pretty funny but as the movie progressed didn find the jokes or oddness funny anymore Its low budget film thats never problem in itself there were some pretty interesting characters but eventually just lost interest imagine this film would appeal to stoner who is currently partaking For something similar but better try Brother from another planet

As we can see that the reviews variables consist of only text data in it while the other variable contains the corresponding labels for the same.

print(y_var[10])

Step 11 - Create BERT tokenizer

Tokenizer_Bert = bert.bert_tokenization.FullTokenizer layer_bert = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/1",trainable=False) file_vocab = layer_bert.resolved_object.vocab_file.asset_path.numpy() lower_case = layer_bert.resolved_object.do_lower_case.numpy() tokenized_result = Tokenizer_Bert(file_vocab, lower_case)

In above we create a FullTokenizer class from the bert.bert_tokenization module. Then by importing the BERT model from hub.KerasLayer we create a BERT embedding layer. We will not be training the BERT embedding, as trainable parameter is set to False. After that we create a BERT vocabulary file in the form a numpy array. We then set the text to lowercase and finally we pass our vocabulary file i.e file_vocab and to lower case i.e lower_case variables to the Tokenizer_Bert object.

Step 12 - Check the tokenizer is working or not

tokenized_result.tokenize("don't be so judgmental")

['don', "'", 't', 'be', 'so', 'judgment', '##al']

tokenized_result.convert_tokens_to_ids(tokenized_result.tokenize("dont be so judgmental"))

[2123, 2102, 2022, 2061, 8689, 2389]

Step 13 - Define a function for single text review

def reviews_tokenize(reviews_text): return tokenized_result.convert_tokens_to_ids(tokenized_result.tokenize(reviews_text))

The above function will accepts a single text review and returns the ids of the tokenized words in the review.

Step 14 - Tokenize all the reviews in the input dataset

reviews_tokenized = [reviews_tokenize(review) for review in reviews]

Step 15 - Prepare Data for training

reviews_with_len = [[review, y[i], len(review)] for i, review in enumerate(reviews_tokenized)]

Here the following script states that it create of lists of list where each sublist contains tokenized review, the label and length of the review.

Step 16 - Shuffle the reviews randomly

import random random.shuffle(reviews_with_len)

We need to shuffle the reviews randomly because in our data set there positive and negative both reviews are there, the first half of the reviews are positive while the last half contains negative reviews. Therefore, in order to have both positive and negative reviews in the training batches we need to shuffle the reviews.

Step 17 - Sort the data by the length of reviews

reviews_with_len.sort(key=lambda x: x[2])

Step 18 - Remove the length attribute from all the reviews

sorted_reviews_labels = [(review_lab[0], review_lab[1]) for review_lab in reviews_with_len]

Step 19 - Convert the Data set into tensorflow 2.0-compliant input dataset shape.

Convert_dataset = tf.data.Dataset.from_generator(lambda: sorted_reviews_labels, output_types=(tf.int32, tf.int32))

Step 20 - Pad our Converted Dataset for each batch

BATCH_SIZE = 32 dataset_batched = Convert_dataset.padded_batch(BATCH_SIZE, padded_shapes=((None, ), ())) next(iter(dataset_batched)) ## print the first batch

(,
 )

The padding for next batch will be different depending upon the size of the largest sentence in the batch. As the above output shows the first five and last five padded reviews. From the last five reviews, you can see that the total number of words in the largest sentence were 21. Therefore, in the first five reviews the 0s are added at the end of the sentences so that their total length is also 21

Step 21 - Divide the Datas set into train and test

import math TOTAL_BATCHES = math.ceil(len(sorted_reviews_labels) / BATCH_SIZE) TEST_BATCHES = TOTAL_BATCHES // 10 dataset_batched.shuffle(TOTAL_BATCHES) test_data = dataset_batched.take(TEST_BATCHES) train_data = dataset_batched.skip(TEST_BATCHES)

Step 22 - Create the model

class TEXT_MODEL(tf.keras.Model): def __init__(self, vocabulary_size, embedding_dimensions=128, cnn_filters=50, dnn_units=512, model_output_classes=2, dropout_rate=0.1, training=False, name="text_model"): super(TEXT_MODEL, self).__init__(name=name) self.embedding = layers.Embedding(vocabulary_size, embedding_dimensions) self.cnn_layer1 = layers.Conv1D(filters=cnn_filters, kernel_size=2, padding="valid", activation="relu") self.cnn_layer2 = layers.Conv1D(filters=cnn_filters, kernel_size=3, padding="valid", activation="relu") self.cnn_layer3 = layers.Conv1D(filters=cnn_filters, kernel_size=4, padding="valid", activation="relu") self.pool = layers.GlobalMaxPool1D() self.dense_1 = layers.Dense(units=dnn_units, activation="relu") self.dropout = layers.Dropout(rate=dropout_rate) if model_output_classes == 2: self.last_dense = layers.Dense(units=1, activation="sigmoid") else: self.last_dense = layers.Dense(units=model_output_classes, activation="softmax") def call(self, inputs, training): l = self.embedding(inputs) l_1 = self.cnn_layer1(l) l_1 = self.pool(l_1) l_2 = self.cnn_layer2(l) l_2 = self.pool(l_2) l_3 = self.cnn_layer3(l) l_3 = self.pool(l_3) concatenated = tf.concat([l_1, l_2, l_3], axis=-1) # (batch_size, 3 * cnn_filters) concatenated = self.dense_1(concatenated) concatenated = self.dropout(concatenated, training) model_output = self.last_dense(concatenated) return model_output

Step 23 - Define the values for hyperparameters

VOCAB_LENGTH = len(tokenized_result.vocab) EMB_DIM = 200 CNN_FILTERS = 100 DNN_UNITS = 256 OUTPUT_CLASSES = 2 DROPOUT_RATE = 0.2 NB_EPOCHS = 5

Step 24 - Create a Text model and pass hyperparameters values

My_text_model = TEXT_MODEL(vocabulary_size=VOCAB_LENGTH, embedding_dimensions=EMB_DIM, cnn_filters=CNN_FILTERS, dnn_units=DNN_UNITS, model_output_classes=OUTPUT_CLASSES, dropout_rate=DROPOUT_RATE)

Step 25 - Compile the model

if OUTPUT_CLASSES == 2: My_text_model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"]) else: My_text_model.compile(loss="sparse_categorical_crossentropy", optimizer="adam", metrics=["sparse_categorical_accuracy"])

Step 26 - Train the model

My_text_model.fit(train_data, epochs=NB_EPOCHS)

Epoch 1/5
1407/1407 [==============================] - 443s 315ms/step - loss: 0.3064 - accuracy: 0.8642
Epoch 2/5
1407/1407 [==============================] - 439s 312ms/step - loss: 0.1325 - accuracy: 0.9514
Epoch 3/5
1407/1407 [==============================] - 439s 312ms/step - loss: 0.0679 - accuracy: 0.9756
Epoch 4/5
1407/1407 [==============================] - 445s 316ms/step - loss: 0.0397 - accuracy: 0.9858
Epoch 5/5
1407/1407 [==============================] - 446s 317ms/step - loss: 0.0216 - accuracy: 0.9927

Step 27 - Print the results

results = My_text_model.evaluate(test_data) print(results)

156/156 [==============================] - 4s 27ms/step - loss: 0.4444 - accuracy: 0.8990
[0.4443937838077545, 0.8990384340286255]

from the above results we can see that we got an accuracy of 89%

What Users are saying..

Jingwei Li

Graduate Research assistance at Stony Brook University

ProjectPro is an awesome platform that helps me learn much hands-on industrial experience with a step-by-step walkthrough of projects. There are two primary paths to learn: Data Science and Big Data.... Read More

Relevant Projects

Machine Learning Projects

Data Science Projects

Python Projects for Data Science

Data Science Projects in R

Machine Learning Projects for Beginners

Deep Learning Projects

Neural Network Projects

Tensorflow Projects

NLP Projects

Kaggle Projects

IoT Projects

Big Data Projects

Hadoop Real-Time Projects Examples

Spark Projects

Data Analytics Projects for Students

Relevant Projects

Build a Similar Images Finder with Python, Keras, and Tensorflow

Build your own image similarity application using Python to search and find images of products that are similar to any given product. You will implement the K-Nearest Neighbor algorithm to find products with maximum similarity.

View Project Details

Build ARCH and GARCH Models in Time Series using Python

In this Project we will build an ARCH and a GARCH model using Python

View Project Details

Build a Logistic Regression Model in Python from Scratch

Regression project to implement logistic regression in python from scratch on streaming app data.

View Project Details

Build an End-to-End AWS SageMaker Classification Model

MLOps on AWS SageMaker -Learn to Build an End-to-End Classification Model on SageMaker to predict a patient’s cause of death.

View Project Details

BERT Text Classification using DistilBERT and ALBERT Models

This Project Explains how to perform Text Classification using ALBERT and DistilBERT

View Project Details

BigMart Sales Prediction ML Project in Python

The goal of the BigMart Sales Prediction ML project is to build and evaluate different predictive models and determine the sales of each product at a store.

View Project Details

Build a Multi Touch Attribution Machine Learning Model in Python

Identifying the ROI on marketing campaigns is an essential KPI for any business. In this ML project, you will learn to build a Multi Touch Attribution Model in Python to identify the ROI of various marketing efforts and their impact on conversions or sales..

View Project Details

Explain working of BERT in nlp with the help of an example?

Recipe Objective

Table of Contents

Step 1 - Install BERT and necessary libraries

Step 2 - Set for tensorflow 2.0

Step 3 - Import the necessary libraries

Step 4 - Load the Dataset

Step 5 - Remove punctuation and special character

Step 6 - Clean the text

Step 7 - Print the Review column values

Step 8 - Unique values of sentiment column

Step 9 - Convert the sentiment values with integers

Step 10 - Print the reviews

Step 11 - Create BERT tokenizer

Step 12 - Check the tokenizer is working or not

Step 13 - Define a function for single text review

Step 14 - Tokenize all the reviews in the input dataset

Step 15 - Prepare Data for training

Step 16 - Shuffle the reviews randomly

Step 17 - Sort the data by the length of reviews

Step 18 - Remove the length attribute from all the reviews

Step 19 - Convert the Data set into tensorflow 2.0-compliant input dataset shape.

Step 20 - Pad our Converted Dataset for each batch

Step 21 - Divide the Datas set into train and test

Step 22 - Create the model

Step 23 - Define the values for hyperparameters

Step 24 - Create a Text model and pass hyperparameters values

Step 25 - Compile the model

Step 26 - Train the model

Step 27 - Print the results

Jingwei Li

Relevant Projects

You might also like

Relevant Projects