Explain working of BERT with the help of an example?
MACHINE LEARNING RECIPES DATA CLEANING PYTHON DATA MUNGING PANDAS CHEATSHEET     ALL TAGS

Explain working of BERT with the help of an example?

Explain working of BERT with the help of an example?

This recipe explains working of BERT with the help of an example

Recipe Objective

What is BERT?

BERT is the Bidirectional Encoder Representation from transformers. It is a Natural language processing model proposed by researches at Google Research in 2018. When it was proposed it achieves state-of-the-art accuracy on many NLP and NLU tasks which are:

General Language Understanding Evaluation

Stanford Q/A dataset SQuAD v1.1 and v2.0

Situation With Adversarial Generations

It is designed for pre-train deep bidirectional representation from unlabeled text by jointly conditioning on both left and right context.

Step 1 - Install BERT and necessary libraries

!pip install bert-for-tf2 !pip install sentencepiece

Step 2 - Set for tensorflow 2.0

try: %tensorflow_version 2.x except Exception: pass import tensorflow as tf import tensorflow_hub as hub from tensorflow.keras import layers import bert % tensorflow_version 2.x

As we are going to work on tensorflow 2.0, we need to set it to the required one.

Step 3 - Import the necessary libraries

from tensorflow.keras import layers import bert import pandas as pd import tensorflow_hub as hub import re

Step 4 - Load the Dataset

reviews_data = pd.read_csv("/content/drive/MyDrive/Data sets/IMDB Dataset.csv") reviews_data.isnull().values.any() reviews_data.shape
(50000, 2)

For data we are going to use IMDB movie rating Data set

Step 5 - Remove punctuation and special character

def Mytext_preprocess(sentnc): text1 = remove_tags(sen) # Remove html tags text1 = re.sub('[^a-zA-Z]', ' ', text1) # Remove punctuations and numbers text1 = re.sub(r"\s+[a-zA-Z]\s+", ' ', text1) # Single character removal

-Here in the above we are removing punctuations and specials characters from our data set, there are html tags, extra spaces are present in our data so we need to remove them for better result.

re_tag = re.compile(r'<[^>]+>') def tags_remove(text2): return re_tag.sub('', text2)

Step 6 - Clean the text

movie_reviews = [] sentences = list(reviews_data['review']) for data in sentences: movie_reviews.append(preprocess_text(data))

Step 7 - Print the Review column values

print(reviews_data.columns.values)
['review' 'sentiment']

The movie_reviews here contains two columns review and sentiments. In review column it contains the text data while in sentiment column it contains the sentiments in the form of text.

Step 8 - Unique values of sentiment column

reviews_data.sentiment.unique()
array(['positive', 'negative'], dtype=object)

Step 9 - Convert the sentiment values with integers

import numpy as np y_var = reviews_data['sentiment'] y_var = np.array(list(map(lambda x: 1 if x=="positive" else 0, y_var)))

As we all know that algorithms work with integer values, so we need to convert the text data into integer, for that we are using numpy and aslo with the help of lambda function we will convert the 'positive' text as '1' and remaining all as '0'.

Step 10 - Print the reviews

print(reviews[10])
Phil the Alien is one of those quirky films where the humour is based around the oddness of everything rather than actual punchlines At first it was very odd and pretty funny but as the movie progressed didn find the jokes or oddness funny anymore Its low budget film thats never problem in itself there were some pretty interesting characters but eventually just lost interest imagine this film would appeal to stoner who is currently partaking For something similar but better try Brother from another planet 

As we can see that the reviews variables consist of only text data in it while the other variable contains the corresponding labels for the same.

print(y_var[10])
0

Step 11 - Create BERT tokenizer

Tokenizer_Bert = bert.bert_tokenization.FullTokenizer layer_bert = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/1",trainable=False) file_vocab = layer_bert.resolved_object.vocab_file.asset_path.numpy() lower_case = layer_bert.resolved_object.do_lower_case.numpy() tokenized_result = Tokenizer_Bert(file_vocab, lower_case)

In above we create a FullTokenizer class from the bert.bert_tokenization module. Then by importing the BERT model from hub.KerasLayer we create a BERT embedding layer. We will not be training the BERT embedding, as trainable parameter is set to False. After that we create a BERT vocabulary file in the form a numpy array. We then set the text to lowercase and finally we pass our vocabulary file i.e file_vocab and to lower case i.e lower_case variables to the Tokenizer_Bert object.

Step 12 - Check the tokenizer is working or not

tokenized_result.tokenize("don't be so judgmental")
['don', "'", 't', 'be', 'so', 'judgment', '##al']
tokenized_result.convert_tokens_to_ids(tokenized_result.tokenize("dont be so judgmental"))
[2123, 2102, 2022, 2061, 8689, 2389]

Step 13 - Define a function for single text review

def reviews_tokenize(reviews_text): return tokenized_result.convert_tokens_to_ids(tokenized_result.tokenize(reviews_text))

The above function will accepts a single text review and returns the ids of the tokenized words in the review.

Step 14 - Tokenize all the reviews in the input dataset

reviews_tokenized = [reviews_tokenize(review) for review in reviews]

Step 15 - Prepare Data for training

reviews_with_len = [[review, y[i], len(review)] for i, review in enumerate(reviews_tokenized)]

Here the following script states that it create of lists of list where each sublist contains tokenized review, the label and length of the review.

Step 16 - Shuffle the reviews randomly

import random random.shuffle(reviews_with_len)

We need to shuffle the reviews randomly because in our data set there positive and negative both reviews are there, the first half of the reviews are positive while the last half contains negative reviews. Therefore, in order to have both positive and negative reviews in the training batches we need to shuffle the reviews.

Step 17 - Sort the data by the length of reviews

reviews_with_len.sort(key=lambda x: x[2])

Step 18 - Remove the length attribute from all the reviews

sorted_reviews_labels = [(review_lab[0], review_lab[1]) for review_lab in reviews_with_len]

Step 19 - Convert the Data set into tensorflow 2.0-compliant input dataset shape.

Convert_dataset = tf.data.Dataset.from_generator(lambda: sorted_reviews_labels, output_types=(tf.int32, tf.int32))

Step 20 - Pad our Converted Dataset for each batch

BATCH_SIZE = 32 dataset_batched = Convert_dataset.padded_batch(BATCH_SIZE, padded_shapes=((None, ), ())) next(iter(dataset_batched)) ## print the first batch
(,
 )

The padding for next batch will be different depending upon the size of the largest sentence in the batch. As the above output shows the first five and last five padded reviews. From the last five reviews, you can see that the total number of words in the largest sentence were 21. Therefore, in the first five reviews the 0s are added at the end of the sentences so that their total length is also 21

Step 21 - Divide the Datas set into train and test

import math TOTAL_BATCHES = math.ceil(len(sorted_reviews_labels) / BATCH_SIZE) TEST_BATCHES = TOTAL_BATCHES // 10 dataset_batched.shuffle(TOTAL_BATCHES) test_data = dataset_batched.take(TEST_BATCHES) train_data = dataset_batched.skip(TEST_BATCHES)

Step 22 - Create the model

class TEXT_MODEL(tf.keras.Model): def __init__(self, vocabulary_size, embedding_dimensions=128, cnn_filters=50, dnn_units=512, model_output_classes=2, dropout_rate=0.1, training=False, name="text_model"): super(TEXT_MODEL, self).__init__(name=name) self.embedding = layers.Embedding(vocabulary_size, embedding_dimensions) self.cnn_layer1 = layers.Conv1D(filters=cnn_filters, kernel_size=2, padding="valid", activation="relu") self.cnn_layer2 = layers.Conv1D(filters=cnn_filters, kernel_size=3, padding="valid", activation="relu") self.cnn_layer3 = layers.Conv1D(filters=cnn_filters, kernel_size=4, padding="valid", activation="relu") self.pool = layers.GlobalMaxPool1D() self.dense_1 = layers.Dense(units=dnn_units, activation="relu") self.dropout = layers.Dropout(rate=dropout_rate) if model_output_classes == 2: self.last_dense = layers.Dense(units=1, activation="sigmoid") else: self.last_dense = layers.Dense(units=model_output_classes, activation="softmax") def call(self, inputs, training): l = self.embedding(inputs) l_1 = self.cnn_layer1(l) l_1 = self.pool(l_1) l_2 = self.cnn_layer2(l) l_2 = self.pool(l_2) l_3 = self.cnn_layer3(l) l_3 = self.pool(l_3) concatenated = tf.concat([l_1, l_2, l_3], axis=-1) # (batch_size, 3 * cnn_filters) concatenated = self.dense_1(concatenated) concatenated = self.dropout(concatenated, training) model_output = self.last_dense(concatenated) return model_output

Step 23 - Define the values for hyperparameters

VOCAB_LENGTH = len(tokenized_result.vocab) EMB_DIM = 200 CNN_FILTERS = 100 DNN_UNITS = 256 OUTPUT_CLASSES = 2 DROPOUT_RATE = 0.2 NB_EPOCHS = 5

Step 24 - Create a Text model and pass hyperparameters values

My_text_model = TEXT_MODEL(vocabulary_size=VOCAB_LENGTH, embedding_dimensions=EMB_DIM, cnn_filters=CNN_FILTERS, dnn_units=DNN_UNITS, model_output_classes=OUTPUT_CLASSES, dropout_rate=DROPOUT_RATE)

Step 25 - Compile the model

if OUTPUT_CLASSES == 2: My_text_model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"]) else: My_text_model.compile(loss="sparse_categorical_crossentropy", optimizer="adam", metrics=["sparse_categorical_accuracy"])

Step 26 - Train the model

My_text_model.fit(train_data, epochs=NB_EPOCHS)
Epoch 1/5
1407/1407 [==============================] - 443s 315ms/step - loss: 0.3064 - accuracy: 0.8642
Epoch 2/5
1407/1407 [==============================] - 439s 312ms/step - loss: 0.1325 - accuracy: 0.9514
Epoch 3/5
1407/1407 [==============================] - 439s 312ms/step - loss: 0.0679 - accuracy: 0.9756
Epoch 4/5
1407/1407 [==============================] - 445s 316ms/step - loss: 0.0397 - accuracy: 0.9858
Epoch 5/5
1407/1407 [==============================] - 446s 317ms/step - loss: 0.0216 - accuracy: 0.9927

Step 27 - Print the results

results = My_text_model.evaluate(test_data) print(results)
156/156 [==============================] - 4s 27ms/step - loss: 0.4444 - accuracy: 0.8990
[0.4443937838077545, 0.8990384340286255]

from the above results we can see that we got an accuracy of 89%

Relevant Projects

Machine Learning Project to Forecast Rossmann Store Sales
In this machine learning project you will work on creating a robust prediction model of Rossmann's daily sales using store, promotion, and competitor data.

Time Series Python Project using Greykite and Neural Prophet
In this time series project, you will forecast Walmart sales over time using the powerful, fast, and flexible time series forecasting library Greykite that helps automate time series problems.

Build a Face Recognition System in Python using FaceNet
In this deep learning project, you will build your own face recognition system in Python using OpenCV and FaceNet by extracting features from an image of a person's face.

Time Series Analysis Project in R on Stock Market forecasting
In this time series project, you will build a model to predict the stock prices and identify the best time series forecasting model that gives reliable and authentic results for decision making.

Image Segmentation using Mask R-CNN with Tensorflow
In this Deep Learning Project on Image Segmentation Python, you will learn how to implement the Mask R-CNN model for early fire detection.

Machine Learning or Predictive Models in IoT - Energy Prediction Use Case
In this machine learning and IoT project, we are going to test out the experimental data using various predictive models and train the models and break the energy usage.

Build an Image Classifier for Plant Species Identification
In this machine learning project, we will use binary leaf images and extracted features, including shape, margin, and texture to accurately identify plant species using different benchmark classification techniques.

Resume parsing with Machine learning - NLP with Python OCR and Spacy
In this machine learning resume parser example we use the popular Spacy NLP python library for OCR and text classification.

Predict Macro Economic Trends using Kaggle Financial Dataset
In this machine learning project, you will uncover the predictive value in an uncertain world by using various artificial intelligence, machine learning, advanced regression and feature transformation techniques.

PySpark Tutorial - Learn to use Apache Spark with Python
PySpark Project-Get a handle on using Python with Spark through this hands-on data processing spark python tutorial.