From the visual search for improved product discoverability to face recognition on social networks- image classification is fueling a visual revolution online and has taken the world by storm. Image classification, a subfield of computer vision helps in processing and classifying objects based on trained algorithms. Image Classification had its Eureka moment back in 2012 when Alexnet won the ImageNet challenge and since then there has been an exponential growth in the field. While we humans take our ability to easily classify objects surrounding us because our brains have been trained unconsciously with the same set of images, the problem is not that easy after all. Several factors like view-point variation, size variation, occlusion(blending of objects with other objects in the image), differences in the direction and source of light make it difficult for machines to classify images correctly. Nonetheless, it is an exciting and growing field and there can't be a better way to learn the basics of image classification than to classify images in the MNIST dataset.
Before we go any further let's see what is MNIST dataset.
MNIST stands for Modified National Institute of Standards and Technology and is a database of 60,000 small square 28x28 pixel grayscale images. MNIST handwritten digits dataset is the most used for learning Image Recognition. It is labeled in the sense that each image of a handwritten digit has the corresponding numeral value attached to it. This helps our Algorithm/Neural Network to learn which image stands for which number (0-9) and to learn hidden patterns in human writing.
While the handwritten MNIST is the most popular one, there are 6 different extended variations of MNIST:
1) Fashion MNIST: This dataset from Zalando Research contains images of 10 classes consisting of clothing apparel and accessories like ankle boots, bags, coats, dresses, pullovers, sandals, shirts, sneakers, etc. instead of handwritten digits. The images are grayscale just like the original MNIST.
2) 3D MNIST: While the original MNIST has 28X28 grayscale (one channel) images, 3D MNIST has images with 3 channels (vis. Red, Green, Blue) like any other color-image out there. It provides a good way to start with 3D Computer Vision Problems.
3) EMNIST: EMNIST is a set of handwritten letters contrary to MNIST which only has handwritten digits. The structure is pretty much the same as MNIST containing grayscale 28X28 images.
4) Sign Language MNIST: It is like EMNIST, in the sense that it has images of sign language interpretations of the English alphabets(A-Z). It poses a little more challenging problem of hand gesture recognition and therefore has more useful real-world applications.
5) Colorectal Histology MNIST: The dataset serves a much more interesting MNIST problem for biologists by focusing on histology tiles from patients with colorectal cancer - affecting colon or rectum in the human body. In particular, the data has 8 different classes of cancerous tissue.
6) Skin Cancer MNIST: It is a medical dataset containing images of skin lesions/cancers along with their corresponding labels. This dataset was made for the 2018 Skin Lesion Detection Challenge. It can be used as a primary dataset for anyone trying to tackle a medical classification problem using deep learning.
Let’s get our hands dirty! While MNIST is also available in the CSV format, for the purpose of this notebook we'll use the original MNIST in ubyte.
Follow these simple steps to download and store MNIST on your local machine:
There are a lot of Deep Learning Frameworks out there that you can use like Keras, Mxnet, Pytorch.
You can install torch and torchvison from pytorch.org, choose the applicable OS, language, etc.
Okay, time to load some libraries we will be needing.
Pytorch has a very convenient way to load the MNIST data using datasets.MNIST instead of data structures such as NumPy arrays and lists. Deep learning models use a very similar DS called a Tensor. When compared to arrays tensors are more computationally efficient and can run on GPUs too. We will convert our MNIST images into tensors when loading them. There are lots of other transformations that you can do using torchvision.transforms like Reshaping, normalizing, etc. on your images but we won't need that since MNIST is a very primitive dataset.
Free access to solved code Python and R examples can be found here (these are ready-to-use for your Data Science and ML projects)
The train data has 60,000 images and the test has 10,000. Let's look at one.
Each image is made up of 28X28 pixels. The 1 in torch.size stands for the number of channels, since it's a grayscale image there's only one channel.
Before we go any further, the neural network we will be using is the most basic one. So, let’s have a quick introduction.
A multilayer perceptron has several Dense layers of neurons in it, hence the name multi-layer.
There are 3 basic components:
1. Input Layer- The input layer would take in the input signal to be processed. In our case, it's a tensor of image pixels.
2. Output Layer- The output layer does the required task of classification/regression. In our case, it outputs one of the 10 classes for digits 0-9 for a given input image.
3. Hidden Layers - There is an arbitrary number of hidden layers in between the input and output layer that do all the computations in a Multilayer Perceptron. The number of hidden layers and the number of neurons can be decided to keep in mind the fact that one layer's output is the next layer's input.
Now, we know the basics of architecture. To understand the working better let's take the example of our use case- image classification with MNIST.
I'll try to break down the process into different steps:
The activation function is used to clip the output in a definite range like 0-1 or -1 to 1, these ranges can be achieved by Sigmoid and Tanh respectively. The activation function we have used here is ReLu. The main advantage of using the ReLu function is that it does not activate all the neurons at the same time thus making it more computationally efficient than Tanh or Sigmoid.
In short, ReLu clips all the negative values and keeps the positive values just the same.
The process described above is a single forward pass through the network and instead of just sending one image as input in a pass, a batch of images is fed in a single pass.
But how does the network learn?
After a single pass through the network, the prediction of the model for that batch of images is compared with the actual labels of those images, and a loss is calculated. Based on the value of this loss, a gradient flow backward through the neural network to update weights(W and b) in each layer. This process is called Backpropagation.
In the next iteration, the neural network would do a slightly better job while predicting. This process of forward-pass and backpropagation keeps on repeating as we try to minimize our loss and we the end of our training.
Now, that we know most of the things, let's dive right into the code.
Loading data into batches
Defining loss function and the optimizer
There are a lot of loss functions out there like Binary Cross Entropy, Mean Squared Error, Hinged loss etc. The choice of the loss function depends on the problem at hand and the number of classes. Since we are dealing with a Multi-class classification problem, Pytorch's CrossEntropyLoss is our go-to loss function.
Let us talk about the elephant in the room -- the optimizer. Remember, I mentioned that during Backpropagation, we update the weights according to the loss throughout the iterations. We basically try to minimize loss as we move ahead through our training. This process is called optimization. Optimizers are algorithms that try to find the optimal way to minimize the loss by navigating the surface of our loss function. We use Adam because it's the best optimizer out there, as proven by different experiments in the scientific community.
Before I write a plethora of code for training, let me explain a few concepts that'll be used.
It could be expressed as number of training steps = number of training records/batch size, which is 600(60000/100) in our case. We'll train the model for 10 epochs- the model will see the full training data exactly 10 times.
The code for training is a few-lines in Keras. As you can see, in Pytorch it's way more because there are wrappers only for very essential stuff and the rest is left to the user to play with. In Pytorch, the user gets a better control over training and it also clears the fundamentals behind model training which is necessary for beginners.
The training loss keeps on decreasing throughout the epochs and we can conclude that our model is definitely learning. But to gauge the performance of our model we'll have to see how well it does on unseen(test) data.
Predictions are made on our test data after training completes in every epoch. Since our model continually keeps getting better, the test accuracy of the last epoch is the best.
It will be intuitive and fun to see the progression of loss and accuracy through the epochs.
We are at the end and have successfully trained an image recognition model on MNIST dataset.
There are several tricks you can try to improve the performance of the model like:
Handwriting recognition from images isn't only limited to MNIST or understanding the basics of Deep Learning - there is a whole field based around it called OCR or Optical Character Recognition. OCR is very useful in digitalizing handwritten documents and is also used by Google Lens to extract text from images.