How to get Classification LogLoss Metric in R

This recipe helps you get Classification LogLoss Metric in R
Last Updated: 16 Jun 2022

Get access to Data Science projects View all Data Science projects

DATA SCIENCE PROJECTS IN R DATA CLEANING PYTHON DATA MUNGING MACHINE LEARNING RECIPES PANDAS CHEATSHEET ALL TAGS

Recipe Objective

How to get Classification LogLoss Metric?

Classification: Logistic Regression is a classification type supervised learning model. Logistic Regression is used when the independent variable x, can be a continuous or categorical variable, but the dependent variable (y) is a categorical variable. The performance of a classification type model can be evaluated from a confusion matrix like — precision, recall, f1 — score. There is another performance metric in classification called the LogLoss metric (also known as cross entroy loss), that considers the probabilities of the models and not just the final output. As it is based on probabilities, the values of LogLoss lies in between 0-1. The more log loss value closer to 0, the better the model is, as it is a measure of uncertainty, hence it must be as low as possible. This recipe demonstrates an example of how to get Classification LogLoss metric in R. In the following example, a '**Healthcare case study**' is taken, logistic regression had to be applied on a data set.

Data Set Description

The variables in the dataset quality.csv are as follows:

MemberID: numbers the patients from 1 to 131, and is just an identifying number.
InpatientDays: is the number of inpatient visits, or number of days the person spent in the hospital.
ERVisits: is the number of times the patient visited the emergency room.
OfficeVisits: is the number of times the patient visited any doctor's office.
Narcotics: is the number of prescriptions the patient had for narcotics.
DaysSinceLastERVisit: is the number of days between the patient's last emergency room visit and the end of the study period (set to the length of the study period if they never visited the ER).
Pain: is the number of visits for which the patient complained about pain.
TotalVisits: is the total number of times the patient visited any healthcare provider.
ProviderCount: is the number of providers that served the patient.
MedicalClaims: is the number of days on which the patient had a medical claim.
ClaimLines: is the total number of medical claims.
StartedOnCombination: is whether or not the patient was started on a combination of drugs to treat their diabetes (TRUE or FALSE).
AcuteDrugGapSmall: is the fraction of acute drugs that were refilled quickly after the prescription ran out.
PoorCare: is the outcome or dependent variable, and is equal to 1 if the patient had poor care, and equal to 0 if the patient had good care.

The dependent variable is modeled as a binary variable:

1 if low-quality care, 0 if high-quality care

Recipe Objective

Step 1 - Load the necessary libraries

install.packages("dplyr") # Install dplyr library("dplyr") # Load dplyr # Installing the package install.packages("caTools") # For Logistic regression library(caTools) library(MASS)

Step 2 - Read a csv dataset

data <- read.csv("https://storage.googleapis.com/dimensionless/Analytics/quality.csv") # reads the dataset

Step 3 - EDA : Exploratory Data Analysis

dim(data) # returns the number of rows and columns in the dataset print(head(data)) # over view of the dataset summary(data) # summary() function generates the statistical summary of the data

Step 4 - Creating a baseline model

Evaluating, how many patients received good care and how many recived poorcare.

table(data$PoorCare)

GoodCare(0) : 98 PoorCare(1) : 33 As the number of patients receiving good care is more, we predict the patients are getting good care. Baseline model accuracy : 98/(98+33) = 75% Hence, our model accuracy must be higher then the baseline model accuracy.

Step 5- Create train and test dataset

split <- sample.split(data, SplitRatio = 0.8) split

The split method splits the data into train and test datasets with a ratio of 0.8 This means 80% of our dataset is passed in the training dataset and 20% in the testing dataset.

train <- subset(data, split == "TRUE") test <- subset(data, split == "FALSE")

The train dataset gets all the data points after split which are 'TRUE' and similarly the test dataset gets all the data points which are 'FALSE'.

dim(train) # dimension/shape of train dataset dim(test) # dimension/shape of test dataset

Step 6 -Create a model for logistics using the training dataset

model = glm(PoorCare~.,train , family="binomial") # we use the glm()-general linear model to create an instance of model summary(model) # summary of the model tells us the different statistical values for our independent variables after the model is created

Step 7- Make predictions on the model using the test dataset

After the model is created and fitted, this model is used for making predictions on the unseen data values i.e the test dataset.

pred_test <- predict(model,test,type="response") pred_test

Step 8 - Model Diagnostics

After the predictions on the test datasets are made, create a confusion matrix with thershold value = 0.5 table(Actualvalue=test$PoorCare,Predictedvalue=pred_test>0.5) # assuming thershold to be 0.5 Our confusion matrix states that the true positves and true negatives are 20 and respectively. But we have a false negative rate of 2, i.e the patients are predicted getting good care, but in fact they are not. Hence this rate must be reduced as much as possible.

accuracy = (20+2)/(20+2+2+3) # Out of all the classes, how much we predicted correctly, which must be high as possible accuracy

Our baseline model gave an accuracy of 75% and our predicted model gave a accuracy of 81.18% which is actually a good value. Now, another performance metric we can use is the log loss metric, that will consider the probabilities of the model.

LogLoss=function(actual, predicted) { result=-1/length(actual)*(sum((actual*log(predicted)+(1-actual)*log(1-predicted)))) return(result) } LogLoss(test$PoorCare, pred_test)

In this case the classification is neutral (t = 0.5) : it assigns equal probability to both classes, resulting in a Log Loss of 0.68317.

{"mode":"full","

isActive

":false}

What Users are saying..

Abhinav Agarwal

Graduate Student at Northwestern University

I come from Northwestern University, which is ranked 9th in the US. Although the high-quality academics at school taught me all the basics I needed, obtaining practical experience was a challenge.... Read More