How to plot lift chart in R logistic regression

This recipe helps you plot lift chart in R logistic regression

Recipe Objective

How to plot lift chart in R (logistic regression)

Logistic Regression is a classification type supervised learning model. Logistic Regression is used when the independent variable x, can be a continuous or categorical variable, but the dependent variable (y) is a categorical variable. The logistics regression model makes a prediction on the data and classifies it into binary classes 1 and 0. The lift charts are used for comparing binary predictive models. The lift charts are a visual representation for measuring the model's performance. The lift chart measures effectiveness of our predictive classification model comparing it with the baseline model. This recipe demonstrates how to plot a lift chart in R. In the following example, a '**Healthcare case study**' is taken, logistic regression had to be applied on a data set.

Step 1 - Load the necessary libraries

install.packages("dplyr") # Install dplyr library("dplyr") # Load dplyr # Installing the package install.packages("caTools") # For Logistic regression install.packages("ROCR") # For ROC curve to evaluate model library(caTools) library(ROCR) library(MASS)

Step 2 - Read a csv dataset

data <- read.csv("https://storage.googleapis.com/dimensionless/Analytics/quality.csv") # reads the dataset

Step 3 - EDA : Exploratory Data Analysis

dim(data) # returns the number of rows and columns in the dataset print(head(data)) # over view of the dataset summary(data) # summary() function generates the statistical summary of the data

Step 4 - Creating a baseline model

Evaluating, how many patients received good care and how many recived poorcare.

baseline = table(data$PoorCare) baseline

GoodCare(0) : 98 PoorCare(1) : 33 As the number of patients receiving good care is more, we predict the patients are getting good care. Baseline model accuracy : 98/(98+33) = 75% Hence, our model accuracy must be higher then the baseline model accuracy.

Step 5- Create train and test dataset

split <- sample.split(data, SplitRatio = 0.8) split

The split method splits the data into train and test datasets with a ratio of 0.8 This means 80% of our dataset is passed in the training dataset and 20% in the testing dataset.

train <- subset(data, split == "TRUE") test <- subset(data, split == "FALSE")

The train dataset gets all the data points after split which are 'TRUE' and similarly the test dataset gets all the data points which are 'FALSE'.

dim(train) # dimension/shape of train dataset dim(test) # dimension/shape of test dataset head(train) head(test)

Step 6 -Create a model for logistics using the training dataset

model = glm(PoorCare~.,train , family="binomial") # we use the glm()-general linear model to create an instance of model summary(model) # summary of the model tells us the different statistical values for our independent variables after the model is created

Step 7- Make predictions on the model using the test dataset

After the model is created and fitted, this model is used for making predictions on the unseen data values i.e the test dataset.

pred_test <- predict(model,test,type="response") pred_test

Step 8 - Model Diagnostics

**Confusion matrix** : Confusion matrix is a performance metric technique for summarizing the performance of a classification algorithm. The number of correct and incorrect predictions are summarized with count values and listed down by each class of predicted and actual values It gives you insight not only into the errors being made by your classifier but more importantly the types of errors that are being made. TN:- Actually good care and for which we predict good care.
TP:- Actually Poor care and for which we predict poor care.
FP :- Predict poor care, but they're actually good care.
FN:- Predict good care, but they're actually poor care .
After the predictions on the test datasets are made, create a confusion matrix with thershold value = 0.5

table(Actualvalue=test$PoorCare,Predictedvalue=pred_test>0.5) # assuming thershold to be 0.5

Our confusion matrix states that the true positves and true negatives are 20 and respectively. But we have a false negative rate of 2, i.e the patients are predicted getting good care, but in fact they are not. Hence this rate must be reduced as much as possible.

accuracy = (18+3)/(18+3+1+6) # Accuracy = (TP+TN) / (TP+TN+FP+FN) : Out of all the classes, how much we predicted correctly, which must be high as possible accuracy

Our baseline model gave an accuracy of 75% and our predicted model gave a accuracy of 81.18% which is actually a good value.

sensitivity = 3/(3+6) # Sensitivity / true positive rate = TP / (FN+TP): It measures the proportion of actual positives that are correctly identified. sensitivity specificity = 18/(18+1) # Specificity / true negative rate = TN / (TN+FP): It measures the proportion of actual negatives that are correctly identified. specificity precision = 3/(1+3) #Precision = TP / (TP+FP) : Out of all the positive classes we have predicted correctly, how many are actually positive. precision

Step 9 - How to do thresholding : ROC Curve

pred_test <- predict(model,test,type="response")

ROC CURVE - ROC (Receiver Operator Characteristic Curve) can help in deciding the best threshold value. A ROC curve is plotted with FPR on the X-axis and TPR on the y-axis. A high threshold value gives - high specificity and low sensitivity A low threshold value gives - low specificity and high sensitivity.

ROCR_pred_test <- prediction(pred_test,test$PoorCare) ROCR_perf_test <- performance(ROCR_pred_test,'tpr','fpr') plot(ROCR_perf_test,colorize=TRUE,print.cutoffs.at=seq(0.1,by=0.1))

The threshold value can then be selected according to the requirement. We would want to reduce the FALSE NEGATIVES as much as possible in our case study; hence, hence, we can choose a threshold value that increases our TPR and reduces our FPR. i.e we can choose a threshold of 0.3 and create a confusion matrix and check the accuracy of the model.

Step 10 -Plot Lift chart

perf <- performance(ROCR_pred_test,"lift","rpp") plot(perf, main="Lift curve", colorize=T) # our baseline model is 0.75 and the actual lift, is plotted as below {"mode":"full","isActive":false}

What Users are saying..

profile image

Gautam Vermani

Data Consultant at Confidential
linkedin profile url

Having worked in the field of Data Science, I wanted to explore how I can implement projects in other domains, So I thought of connecting with ProjectPro. A project that helped me absorb this topic... Read More

Relevant Projects

CycleGAN Implementation for Image-To-Image Translation
In this GAN Deep Learning Project, you will learn how to build an image to image translation model in PyTorch with Cycle GAN.

Deep Learning Project for Text Detection in Images using Python
CV2 Text Detection Code for Images using Python -Build a CRNN deep learning model to predict the single-line text in a given image.

Learn How to Build a Linear Regression Model in PyTorch
In this Machine Learning Project, you will learn how to build a simple linear regression model in PyTorch to predict the number of days subscribed.

Predict Churn for a Telecom company using Logistic Regression
Machine Learning Project in R- Predict the customer churn of telecom sector and find out the key drivers that lead to churn. Learn how the logistic regression model using R can be used to identify the customer churn in telecom dataset.

Build a Graph Based Recommendation System in Python-Part 2
In this Graph Based Recommender System Project, you will build a recommender system project for eCommerce platforms and learn to use FAISS for efficient similarity search.

Loan Eligibility Prediction Project using Machine learning on GCP
Loan Eligibility Prediction Project - Use SQL and Python to build a predictive model on GCP to determine whether an application requesting loan is eligible or not.

Learn to Build an End-to-End Machine Learning Pipeline - Part 1
In this Machine Learning Project, you will learn how to build an end-to-end machine learning pipeline for predicting truck delays, addressing a major challenge in the logistics industry.

Text Classification with Transformers-RoBERTa and XLNet Model
In this machine learning project, you will learn how to load, fine tune and evaluate various transformer models for text classification tasks.

Time Series Forecasting with LSTM Neural Network Python
Deep Learning Project- Learn to apply deep learning paradigm to forecast univariate time series data.

Build an End-to-End AWS SageMaker Classification Model
MLOps on AWS SageMaker -Learn to Build an End-to-End Classification Model on SageMaker to predict a patient’s cause of death.