How to do logistic regression in R

This recipe helps you do logistic regression in R
Last Updated: 23 Jul 2021

Get access to Data Science projects View all Data Science projects

DATA SCIENCE PROJECTS IN R DATA CLEANING PYTHON DATA MUNGING MACHINE LEARNING RECIPES PANDAS CHEATSHEET ALL TAGS

Recipe Objective

How to do logistic regression in R.

Logistic Regression is a classification type supervised learning model. Logistic Regression is used when the independent variable x, can be a continuous or categorical variable, but the dependent variable (y) is a categorical variable. The logistic regression uses the logit function/sigmoid function given by f (x)= 1 / (1+e)^(-x). This function is used for converting the continuous output to probability. The dependent variable takes a binary value 0 / 1 and the data has to be classified as to which class it belongs. A threshold value t is used, wherein all values above the threshold are classified as class: 1 and all values below it are classified in class: 0. In order to check the accuracy of the model, performance metrics like confusion matrix, accuracy score, receiver operating characteristics (ROC) curve and F- measure are used. This recipe demonstrates an example of logistics regression in R. In the following example, a '**Healthcare case study**' is taken, logistic regression had to be applied on a data set.

Data Set Description

The variables in the dataset quality.csv are as follows:

MemberID: numbers the patients from 1 to 131, and is just an identifying number.
InpatientDays: is the number of inpatient visits, or number of days the person spent in the hospital.
ERVisits: is the number of times the patient visited the emergency room.
OfficeVisits: is the number of times the patient visited any doctor's office.
Narcotics: is the number of prescriptions the patient had for narcotics.
DaysSinceLastERVisit: is the number of days between the patient's last emergency room visit and the end of the study period (set to the length of the study period if they never visited the ER).
Pain: is the number of visits for which the patient complained about pain.
TotalVisits: is the total number of times the patient visited any healthcare provider.
ProviderCount: is the number of providers that served the patient.
MedicalClaims: is the number of days on which the patient had a medical claim.
ClaimLines: is the total number of medical claims.
StartedOnCombination: is whether or not the patient was started on a combination of drugs to treat their diabetes (TRUE or FALSE).
AcuteDrugGapSmall: is the fraction of acute drugs that were refilled quickly after the prescription ran out.
PoorCare: is the outcome or dependent variable, and is equal to 1 if the patient had poor care, and equal to 0 if the patient had good care.

The dependent variable is modeled as a binary variable:

1 if low-quality care, 0 if high-quality care

Step 1 - Load the necessary libraries

install.packages("dplyr") # Install dplyr library("dplyr") # Load dplyr # Installing the package install.packages("caTools") # For Logistic regression install.packages("ROCR") # For ROC curve to evaluate model library(caTools) library(ROCR) library(MASS)

Step 2 - Read a csv dataset

data <- read.csv("https://storage.googleapis.com/dimensionless/Analytics/quality.csv") # reads the dataset

Step 3 - EDA : Exploratory Data Analysis

dim(data) # returns the number of rows and columns in the dataset print(head(data)) # over view of the dataset summary(data) # summary() function generates the statistical summary of the data

Step 4 - Creating a baseline model

Evaluating, how many patients received good care and how many recived poorcare.

table(data$PoorCare)

GoodCare(0) : 98 PoorCare(1) : 33 As the number of patients receiving good care is more, we predict the patients are getting good care. Baseline model accuracy : 98/(98+33) = 75% Hence, our model accuracy must be higher then the baseline model accuracy.

Step 5- Create train and test dataset

split <- sample.split(data, SplitRatio = 0.8) split

The split method splits the data into train and test datasets with a ratio of 0.8 This means 80% of our dataset is passed in the training dataset and 20% in the testing dataset.

train <- subset(data, split == "TRUE") test <- subset(data, split == "FALSE")

The train dataset gets all the data points after split which are 'TRUE' and similarly the test dataset gets all the data points which are 'FALSE'.

dim(train) # dimension/shape of train dataset dim(test) # dimension/shape of test dataset head(train) head(test)

Step 6 -Create a model for logistics using the training dataset

Step 7- Make predictions on the model using the test dataset

After the model is created and fitted, this model is used for making predictions on the unseen data values i.e the test dataset.

pred_test <- predict(model,test,type="response") pred_test

Step 8 - Model Diagnostics

Confusion matrix : Confusion matrix is a performance metric technique for summarizing the performance of a classification algorithm. The number of correct and incorrect predictions are summarized with count values and listed down by each class of predicted and actual values It gives you insight not only into the errors being made by your classifier but more importantly the types of errors that are being made. TN:- Actually good care and for which we predict good care.
TP:- Actually Poor care and for which we predict poor care.
FP :- Predict poor care, but they're actually good care.
FN:- Predict good care, but they're actually poor care .
After the predictions on the test datasets are made, create a confusion matrix with thershold value = 0.5

table(Actualvalue=test$PoorCare,Predictedvalue=pred_test>0.5) # assuming thershold to be 0.5

Our confusion matrix states that the true positves and true negatives are 20 and respectively. But we have a false negative rate of 2, i.e the patients are predicted getting good care, but in fact they are not. Hence this rate must be reduced as much as possible.

accuracy = (20+2)/(20+2+2+3) # Out of all the classes, how much we predicted correctly, which must be high as possible accuracy

Our baseline model gave an accuracy of 75% and our predicted model gave a accuracy of 81.18% which is actually a good value.

sensitivity = 2/(2+2) # Sensitivity / true positive rate : It measures the proportion of actual positives that are correctly identified. sensitivity specificity = 20/(20+3) # Specificity / true negative rate : It measures the proportion of actual negatives that are correctly identified. specificity precision = 2/(2+3) #Precision : Out of all the positive classes we have predicted correctly, how many are actually positive. precision

Step 9 - How to do thresholding : ROC Curve

pred_test <- predict(model,test,type="response")

ROC CURVE - ROC (Receiver Operator Characteristic Curve) can help in deciding the best threshold value. A ROC curve is plotted with FPR on the X-axis and TPR on the y-axis. A high threshold value gives - high specificity and low sensitivity A low threshold value gives - low specificity and high sensitivity.

ROCR_pred_test <- prediction(pred_test,test$PoorCare) ROCR_perf_test <- performance(ROCR_pred_test,'tpr','fpr') plot(ROCR_perf_test,colorize=TRUE,print.cutoffs.at=seq(0.1,by=0.1))

The threshold value can then be selected according to the requirement. The threshold value can then be selected according to the requirement. We would want to reduce the FALSE NEGATIVES as much as possible in our case study, we can choose a threshold value that increases our TPR and reduces our FPR. i.e we can choose a threshold of 0.3 and create a confusion matrix and check the accuracy of the model.

{"mode":"full","isActive":false}

What Users are saying..

Savvy Sahai

Data Science Intern, Capgemini

As a student looking to break into the field of data engineering and data science, one can get really confused as to which path to take. Very few ways to do it are Google, YouTube, etc. I was one of... Read More

Relevant Projects

Machine Learning Projects

Data Science Projects

Python Projects for Data Science

Data Science Projects in R

Machine Learning Projects for Beginners

Deep Learning Projects

Neural Network Projects

Tensorflow Projects

NLP Projects

Kaggle Projects

IoT Projects

Big Data Projects

Hadoop Real-Time Projects Examples

Spark Projects

Data Analytics Projects for Students

Relevant Projects

Learn How to Build a Logistic Regression Model in PyTorch

In this Machine Learning Project, you will learn how to build a simple logistic regression model in PyTorch for customer churn prediction.

View Project Details

Build an Image Classifier for Plant Species Identification

In this machine learning project, we will use binary leaf images and extracted features, including shape, margin, and texture to accurately identify plant species using different benchmark classification techniques.

View Project Details

Image Classification Model using Transfer Learning in PyTorch

In this PyTorch Project, you will build an image classification model in PyTorch using the ResNet pre-trained model.

View Project Details

Hands-On Approach to Regression Discontinuity Design Python

In this machine learning project, you will learn to implement Regression Discontinuity Design Example in Python to determine the effect of age on Mortality Rate in Python.

View Project Details

Tensorflow Transfer Learning Model for Image Classification

Image Classification Project - Build an Image Classification Model on a Dataset of T-Shirt Images for Binary Classification

View Project Details

Machine Learning project for Retail Price Optimization

In this machine learning pricing project, we implement a retail price optimization algorithm using regression trees. This is one of the first steps to building a dynamic pricing model.

View Project Details

Time Series Project to Build a Multiple Linear Regression Model

Learn to build a Multiple linear regression model in Python on Time Series Data

View Project Details

Multilabel Classification Project for Predicting Shipment Modes

Multilabel Classification Project to build a machine learning model that predicts the appropriate mode of transport for each shipment, using a transport dataset with 2000 unique products. The project explores and compares four different approaches to multilabel classification, including naive independent models, classifier chains, natively multilabel models, and multilabel to multiclass approaches.

View Project Details

Learn to Build an End-to-End Machine Learning Pipeline - Part 1

In this Machine Learning Project, you will learn how to build an end-to-end machine learning pipeline for predicting truck delays, addressing a major challenge in the logistics industry.

View Project Details

Deploying Machine Learning Models with Flask for Beginners

In this MLOps on GCP project you will learn to deploy a sales forecasting ML Model using Flask.

View Project Details

How to do logistic regression in R

Recipe Objective

Data Set Description

Step 1 - Load the necessary libraries

Step 2 - Read a csv dataset

Step 3 - EDA : Exploratory Data Analysis

Step 4 - Creating a baseline model

Step 5- Create train and test dataset

Step 6 -Create a model for logistics using the training dataset

Step 7- Make predictions on the model using the test dataset

Step 8 - Model Diagnostics

Step 9 - How to do thresholding : ROC Curve

Savvy Sahai

Relevant Projects

You might also like

Relevant Projects