Explain accuracy precision recall and f beta score

In this tutorial, we will learn about the performance metrics of a classification model. We will be learning about accuracy, precision, recall and f-beta score.

Explain Accuracy, Precision, Recall, and F-beta Score

A confusion matrix provides a wealth of information. It helps us understand how effectively the classification model is working through calculated metrics like accuracy, precision, recall, and f-beta score. However, one of the most popular questions among aspiring data scientists is when should these measures be used. The answer to this query can be found in this tutorial. Let's take a look at each of these metrics and see how they're used.

Learn to use RNN for Text Classification with Source Code 

1) Accuracy

Formula: (TP + TN) / (TP+TN+PF+FN)

Accuracy is one of the most used performance metrics. It is the ratio of correctly predicted observations to all the observations. However, deeming a model to be the best, solely based on accuracy is incorrect. Accuracy is a relevant measure when the dataset is symmetric and the number of FPs is almost the same as the number of FNs. However, in the case of asymmetric datasets, we need to resort to other performance metrics because we are concerned about the number of wrongly classified positive and negative predictions as well. For example, in the case of Covid-19 classification, what if wrongly classify the person as negative but the person goes on to fall ill and his condition becomes severe. He might even end up spreading the virus. This is precisely why we need to break down the accuracy formula even more.

Let us go through Type I and Type II errors before understanding Precision, Recall, and F-beta score.

Type I error – False Positive i.e. the case where we reject the null hypothesis
Type 2 error – False Negative i.e. the case where we do not reject the null hypothesis

With this in mind let us move on to Precision.

2) Precision

Formula: TP/ (TP+FP) i.e. TP/Total predicted positive

Precision is implicitly defined as the proportion of accurately detected positive cases among all predicted positive cases. It refers to how precisely is your model able to predict the actual positives. It focuses on Type I error. We must use precision as a performance metric when having False Positives is more concerning. For example – email spam detection. In email spam detection, if an email that is not a spam email is incorrectly classified as spam then the user might end up missing critical emails. In this case, it is more important for the model to be precise.

3) Recall

Formula: TP/ (TP+FN) i.e. TP / Total Actual positive

It's the proportion of accurately detected positive cases among all positive instances. With the same reasoning, we know that when a False Negative has a higher cost, the recall will be the performance metric we use to choose our best model. For example – Fraudulent Detection. A bank may face severe consequences if an actual positive (fraudulent transaction) is predicted as a negative (non-fraudulent) transaction. In the same way, predicting an actually Positive (Covid-19) person as negative is very dangerous. In these cases, we must focus on getting a higher recall.

Precision-Recall Trade-off

The values for both precision, as well as recall, lie between 0 and 1. In our scenario, we wish to prevent overlooking true positive cases by classifying passengers as COVID positive and negative. It would be particularly problematic if a person is genuinely positive but our model fails to detect it because there is a substantial risk of the virus spreading if these individuals are allowed to board the flight. So, even if there's a minuscule chance that a person has COVID, we can't risk identifying them as negative. As a result, we plan so that if the output probability is larger than 0.25, we designate them COVID positive. Therefore, recall is higher but precision is reduced.

Let us now consider an opposite scenario where we must designate a person positive only when we are certain that the person is positive. We can achieve this by setting the threshold of the probability higher (eg: 0.85). This means that a person is positive only when its probability is greater than 0.85 and negative otherwise. We can notice a trade-off between recall and precision for most of the classifiers as we change the threshold of the probability. It is sometimes more convenient to integrate precision and recall into a single statistic when comparing multiple models with varied precision-recall values. To calculate performance, we need a statistic that takes both recall and precision into account.

4) F-beta Score

Formula: ((1+beta2) * Precision * Recall) / (beta2 * Precision + Recall)

As previously stated, we require a statistic that considers both recall and precision, and the F-beta score fulfills this requirement. The weighted harmonic mean of precision and recall is known as the F-beta score. Its value lies between 1 and 0, where 1 is the best and 0 is the worst. The weight “beta” is assigned depending upon the case scenario. If precision is more important, beta is reduced to less than one. When beta is greater than one, recall is prioritized. However, if the beta is set as 1, we get something called an F1 score which is the harmonic mean of precision and recall and gives equal weightage to both of them.

Beta = 1 is the default value. The formula becomes –
F1 score = (2 * Precision * Recall) / (Precision + Recall)

To prioritize precision, you can set a smaller beta value such as 0.5. The formula becomes –
F0.5 score = (1.25 * Precision * Recall) / (0.25 * Precision + Recall)

To prioritize recall, you can set a larger beta value such as 2. The formula becomes –
F2 score = (5 * Precision * Recall) / (4 * Precision + Recall)

What Users are saying..

profile image

Gautam Vermani

Data Consultant at Confidential
linkedin profile url

Having worked in the field of Data Science, I wanted to explore how I can implement projects in other domains, So I thought of connecting with ProjectPro. A project that helped me absorb this topic... Read More

Relevant Projects

MLOps Project to Deploy Resume Parser Model on Paperspace
In this MLOps project, you will learn how to deploy a Resume Parser Streamlit Application on Paperspace Private Cloud.

Create Your First Chatbot with RASA NLU Model and Python
Learn the basic aspects of chatbot development and open source conversational AI RASA to create a simple AI powered chatbot on your own.

Build a Multi-Class Classification Model in Python on Saturn Cloud
In this machine learning classification project, you will build a multi-class classification model in Python on Saturn Cloud to predict the license status of a business.

BigMart Sales Prediction ML Project in Python
The goal of the BigMart Sales Prediction ML project is to build and evaluate different predictive models and determine the sales of each product at a store.

Build a Text Generator Model using Amazon SageMaker
In this Deep Learning Project, you will train a Text Generator Model on Amazon Reviews Dataset using LSTM Algorithm in PyTorch and deploy it on Amazon SageMaker.

AWS MLOps Project for Gaussian Process Time Series Modeling
MLOps Project to Build and Deploy a Gaussian Process Time Series Model in Python on AWS

Build a Churn Prediction Model using Ensemble Learning
Learn how to build ensemble machine learning models like Random Forest, Adaboost, and Gradient Boosting for Customer Churn Prediction using Python

Build Multi Class Text Classification Models with RNN and LSTM
In this Deep Learning Project, you will use the customer complaints data about consumer financial products to build multi-class text classification models using RNN and LSTM.

Image Segmentation using Mask R-CNN with Tensorflow
In this Deep Learning Project on Image Segmentation Python, you will learn how to implement the Mask R-CNN model for early fire detection.

Ola Bike Rides Request Demand Forecast
Given big data at taxi service (ride-hailing) i.e. OLA, you will learn multi-step time series forecasting and clustering with Mini-Batch K-means Algorithm on geospatial data to predict future ride requests for a particular region at a given time.