Explain working of BERT with the help of an example?

Explain working of BERT with the help of an example?

Explain working of BERT with the help of an example?

This recipe explains working of BERT with the help of an example


Recipe Objective

What is BERT?

BERT is the Bidirectional Encoder Representation from transformers. It is a Natural language processing model proposed by researches at Google Research in 2018. When it was proposed it achieves state-of-the-art accuracy on many NLP and NLU tasks which are:

General Language Understanding Evaluation

Stanford Q/A dataset SQuAD v1.1 and v2.0

Situation With Adversarial Generations

It is designed for pre-train deep bidirectional representation from unlabeled text by jointly conditioning on both left and right context.

Step 1 - Install BERT and necessary libraries

!pip install bert-for-tf2 !pip install sentencepiece

Step 2 - Set for tensorflow 2.0

try: %tensorflow_version 2.x except Exception: pass import tensorflow as tf import tensorflow_hub as hub from tensorflow.keras import layers import bert % tensorflow_version 2.x

As we are going to work on tensorflow 2.0, we need to set it to the required one.

Step 3 - Import the necessary libraries

from tensorflow.keras import layers import bert import pandas as pd import tensorflow_hub as hub import re

Step 4 - Load the Dataset

reviews_data = pd.read_csv("/content/drive/MyDrive/Data sets/IMDB Dataset.csv") reviews_data.isnull().values.any() reviews_data.shape
(50000, 2)

For data we are going to use IMDB movie rating Data set

Step 5 - Remove punctuation and special character

def Mytext_preprocess(sentnc): text1 = remove_tags(sen) # Remove html tags text1 = re.sub('[^a-zA-Z]', ' ', text1) # Remove punctuations and numbers text1 = re.sub(r"\s+[a-zA-Z]\s+", ' ', text1) # Single character removal

-Here in the above we are removing punctuations and specials characters from our data set, there are html tags, extra spaces are present in our data so we need to remove them for better result.

re_tag = re.compile(r'<[^>]+>') def tags_remove(text2): return re_tag.sub('', text2)

Step 6 - Clean the text

movie_reviews = [] sentences = list(reviews_data['review']) for data in sentences: movie_reviews.append(preprocess_text(data))

Step 7 - Print the Review column values

['review' 'sentiment']

The movie_reviews here contains two columns review and sentiments. In review column it contains the text data while in sentiment column it contains the sentiments in the form of text.

Step 8 - Unique values of sentiment column

array(['positive', 'negative'], dtype=object)

Step 9 - Convert the sentiment values with integers

import numpy as np y_var = reviews_data['sentiment'] y_var = np.array(list(map(lambda x: 1 if x=="positive" else 0, y_var)))

As we all know that algorithms work with integer values, so we need to convert the text data into integer, for that we are using numpy and aslo with the help of lambda function we will convert the 'positive' text as '1' and remaining all as '0'.

Step 10 - Print the reviews

Phil the Alien is one of those quirky films where the humour is based around the oddness of everything rather than actual punchlines At first it was very odd and pretty funny but as the movie progressed didn find the jokes or oddness funny anymore Its low budget film thats never problem in itself there were some pretty interesting characters but eventually just lost interest imagine this film would appeal to stoner who is currently partaking For something similar but better try Brother from another planet 

As we can see that the reviews variables consist of only text data in it while the other variable contains the corresponding labels for the same.


Step 11 - Create BERT tokenizer

Tokenizer_Bert = bert.bert_tokenization.FullTokenizer layer_bert = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/1",trainable=False) file_vocab = layer_bert.resolved_object.vocab_file.asset_path.numpy() lower_case = layer_bert.resolved_object.do_lower_case.numpy() tokenized_result = Tokenizer_Bert(file_vocab, lower_case)

In above we create a FullTokenizer class from the bert.bert_tokenization module. Then by importing the BERT model from hub.KerasLayer we create a BERT embedding layer. We will not be training the BERT embedding, as trainable parameter is set to False. After that we create a BERT vocabulary file in the form a numpy array. We then set the text to lowercase and finally we pass our vocabulary file i.e file_vocab and to lower case i.e lower_case variables to the Tokenizer_Bert object.

Step 12 - Check the tokenizer is working or not

tokenized_result.tokenize("don't be so judgmental")
['don', "'", 't', 'be', 'so', 'judgment', '##al']
tokenized_result.convert_tokens_to_ids(tokenized_result.tokenize("dont be so judgmental"))
[2123, 2102, 2022, 2061, 8689, 2389]

Step 13 - Define a function for single text review

def reviews_tokenize(reviews_text): return tokenized_result.convert_tokens_to_ids(tokenized_result.tokenize(reviews_text))

The above function will accepts a single text review and returns the ids of the tokenized words in the review.

Step 14 - Tokenize all the reviews in the input dataset

reviews_tokenized = [reviews_tokenize(review) for review in reviews]

Step 15 - Prepare Data for training

reviews_with_len = [[review, y[i], len(review)] for i, review in enumerate(reviews_tokenized)]

Here the following script states that it create of lists of list where each sublist contains tokenized review, the label and length of the review.

Step 16 - Shuffle the reviews randomly

import random random.shuffle(reviews_with_len)

We need to shuffle the reviews randomly because in our data set there positive and negative both reviews are there, the first half of the reviews are positive while the last half contains negative reviews. Therefore, in order to have both positive and negative reviews in the training batches we need to shuffle the reviews.

Step 17 - Sort the data by the length of reviews

reviews_with_len.sort(key=lambda x: x[2])

Step 18 - Remove the length attribute from all the reviews

sorted_reviews_labels = [(review_lab[0], review_lab[1]) for review_lab in reviews_with_len]

Step 19 - Convert the Data set into tensorflow 2.0-compliant input dataset shape.

Convert_dataset = tf.data.Dataset.from_generator(lambda: sorted_reviews_labels, output_types=(tf.int32, tf.int32))

Step 20 - Pad our Converted Dataset for each batch

BATCH_SIZE = 32 dataset_batched = Convert_dataset.padded_batch(BATCH_SIZE, padded_shapes=((None, ), ())) next(iter(dataset_batched)) ## print the first batch

The padding for next batch will be different depending upon the size of the largest sentence in the batch. As the above output shows the first five and last five padded reviews. From the last five reviews, you can see that the total number of words in the largest sentence were 21. Therefore, in the first five reviews the 0s are added at the end of the sentences so that their total length is also 21

Step 21 - Divide the Datas set into train and test

import math TOTAL_BATCHES = math.ceil(len(sorted_reviews_labels) / BATCH_SIZE) TEST_BATCHES = TOTAL_BATCHES // 10 dataset_batched.shuffle(TOTAL_BATCHES) test_data = dataset_batched.take(TEST_BATCHES) train_data = dataset_batched.skip(TEST_BATCHES)

Step 22 - Create the model

class TEXT_MODEL(tf.keras.Model): def __init__(self, vocabulary_size, embedding_dimensions=128, cnn_filters=50, dnn_units=512, model_output_classes=2, dropout_rate=0.1, training=False, name="text_model"): super(TEXT_MODEL, self).__init__(name=name) self.embedding = layers.Embedding(vocabulary_size, embedding_dimensions) self.cnn_layer1 = layers.Conv1D(filters=cnn_filters, kernel_size=2, padding="valid", activation="relu") self.cnn_layer2 = layers.Conv1D(filters=cnn_filters, kernel_size=3, padding="valid", activation="relu") self.cnn_layer3 = layers.Conv1D(filters=cnn_filters, kernel_size=4, padding="valid", activation="relu") self.pool = layers.GlobalMaxPool1D() self.dense_1 = layers.Dense(units=dnn_units, activation="relu") self.dropout = layers.Dropout(rate=dropout_rate) if model_output_classes == 2: self.last_dense = layers.Dense(units=1, activation="sigmoid") else: self.last_dense = layers.Dense(units=model_output_classes, activation="softmax") def call(self, inputs, training): l = self.embedding(inputs) l_1 = self.cnn_layer1(l) l_1 = self.pool(l_1) l_2 = self.cnn_layer2(l) l_2 = self.pool(l_2) l_3 = self.cnn_layer3(l) l_3 = self.pool(l_3) concatenated = tf.concat([l_1, l_2, l_3], axis=-1) # (batch_size, 3 * cnn_filters) concatenated = self.dense_1(concatenated) concatenated = self.dropout(concatenated, training) model_output = self.last_dense(concatenated) return model_output

Step 23 - Define the values for hyperparameters

VOCAB_LENGTH = len(tokenized_result.vocab) EMB_DIM = 200 CNN_FILTERS = 100 DNN_UNITS = 256 OUTPUT_CLASSES = 2 DROPOUT_RATE = 0.2 NB_EPOCHS = 5

Step 24 - Create a Text model and pass hyperparameters values

My_text_model = TEXT_MODEL(vocabulary_size=VOCAB_LENGTH, embedding_dimensions=EMB_DIM, cnn_filters=CNN_FILTERS, dnn_units=DNN_UNITS, model_output_classes=OUTPUT_CLASSES, dropout_rate=DROPOUT_RATE)

Step 25 - Compile the model

if OUTPUT_CLASSES == 2: My_text_model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"]) else: My_text_model.compile(loss="sparse_categorical_crossentropy", optimizer="adam", metrics=["sparse_categorical_accuracy"])

Step 26 - Train the model

My_text_model.fit(train_data, epochs=NB_EPOCHS)
Epoch 1/5
1407/1407 [==============================] - 443s 315ms/step - loss: 0.3064 - accuracy: 0.8642
Epoch 2/5
1407/1407 [==============================] - 439s 312ms/step - loss: 0.1325 - accuracy: 0.9514
Epoch 3/5
1407/1407 [==============================] - 439s 312ms/step - loss: 0.0679 - accuracy: 0.9756
Epoch 4/5
1407/1407 [==============================] - 445s 316ms/step - loss: 0.0397 - accuracy: 0.9858
Epoch 5/5
1407/1407 [==============================] - 446s 317ms/step - loss: 0.0216 - accuracy: 0.9927

Step 27 - Print the results

results = My_text_model.evaluate(test_data) print(results)
156/156 [==============================] - 4s 27ms/step - loss: 0.4444 - accuracy: 0.8990
[0.4443937838077545, 0.8990384340286255]

from the above results we can see that we got an accuracy of 89%

Relevant Projects

Music Recommendation System Project using Python and R
Machine Learning Project - Work with KKBOX's Music Recommendation System dataset to build the best music recommendation engine.

Ecommerce product reviews - Pairwise ranking and sentiment analysis
This project analyzes a dataset containing ecommerce product reviews. The goal is to use machine learning models to perform sentiment analysis on product reviews and rank them based on relevance. Reviews play a key role in product recommendation systems.

Customer Market Basket Analysis using Apriori and Fpgrowth algorithms
In this data science project, you will learn how to perform market basket analysis with the application of Apriori and FP growth algorithms based on the concept of association rule learning.

Forecast Inventory demand using historical sales data in R
In this machine learning project, you will develop a machine learning model to accurately forecast inventory demand based on historical sales data.

Time Series Forecasting with LSTM Neural Network Python
Deep Learning Project- Learn to apply deep learning paradigm to forecast univariate time series data.

Build a Collaborative Filtering Recommender System in Python
Use the Amazon Reviews/Ratings dataset of 2 Million records to build a recommender system using memory-based collaborative filtering in Python.

Ensemble Machine Learning Project - All State Insurance Claims Severity Prediction
In this ensemble machine learning project, we will predict what kind of claims an insurance company will get. This is implemented in python using ensemble machine learning algorithms.

Predict Employee Computer Access Needs in Python
Data Science Project in Python- Given his or her job role, predict employee access needs using amazon employee database.

Mercari Price Suggestion Challenge Data Science Project
Data Science Project in Python- Build a machine learning algorithm that automatically suggests the right product prices.

Solving Multiple Classification use cases Using H2O
In this project, we are going to talk about H2O and functionality in terms of building Machine Learning models.