How to get Classification LogLoss Metric in R

This recipe helps you get Classification LogLoss Metric in R

Recipe Objective

How to get Classification LogLoss Metric?

Classification: Logistic Regression is a classification type supervised learning model. Logistic Regression is used when the independent variable x, can be a continuous or categorical variable, but the dependent variable (y) is a categorical variable. The performance of a classification type model can be evaluated from a confusion matrix like — precision, recall, f1 — score. There is another performance metric in classification called the LogLoss metric (also known as cross entroy loss), that considers the probabilities of the models and not just the final output. As it is based on probabilities, the values of LogLoss lies in between 0-1. The more log loss value closer to 0, the better the model is, as it is a measure of uncertainty, hence it must be as low as possible. This recipe demonstrates an example of how to get Classification LogLoss metric in R. In the following example, a '**Healthcare case study**' is taken, logistic regression had to be applied on a data set.

Data Set Description

The variables in the dataset quality.csv are as follows:

    • MemberID: numbers the patients from 1 to 131, and is just an identifying number.
    • InpatientDays: is the number of inpatient visits, or number of days the person spent in the hospital.
    • ERVisits: is the number of times the patient visited the emergency room.
    • OfficeVisits: is the number of times the patient visited any doctor's office.
    • Narcotics: is the number of prescriptions the patient had for narcotics.
    • DaysSinceLastERVisit: is the number of days between the patient's last emergency room visit and the end of the study period (set to the length of the study period if they never visited the ER).
    • Pain: is the number of visits for which the patient complained about pain.
    • TotalVisits: is the total number of times the patient visited any healthcare provider.
    • ProviderCount: is the number of providers that served the patient.
    • MedicalClaims: is the number of days on which the patient had a medical claim.
    • ClaimLines: is the total number of medical claims.
    • StartedOnCombination: is whether or not the patient was started on a combination of drugs to treat their diabetes (TRUE or FALSE).
    • AcuteDrugGapSmall: is the fraction of acute drugs that were refilled quickly after the prescription ran out.
    • PoorCare: is the outcome or dependent variable, and is equal to 1 if the patient had poor care, and equal to 0 if the patient had good care.

The dependent variable is modeled as a binary variable:

      • 1 if low-quality care, 0 if high-quality care

Step 1 - Load the necessary libraries

install.packages("dplyr") # Install dplyr library("dplyr") # Load dplyr # Installing the package install.packages("caTools") # For Logistic regression library(caTools) library(MASS)

Step 2 - Read a csv dataset

data <- read.csv("https://storage.googleapis.com/dimensionless/Analytics/quality.csv") # reads the dataset

Step 3 - EDA : Exploratory Data Analysis

dim(data) # returns the number of rows and columns in the dataset print(head(data)) # over view of the dataset summary(data) # summary() function generates the statistical summary of the data

Step 4 - Creating a baseline model

Evaluating, how many patients received good care and how many recived poorcare.

table(data$PoorCare)

GoodCare(0) : 98 PoorCare(1) : 33 As the number of patients receiving good care is more, we predict the patients are getting good care. Baseline model accuracy : 98/(98+33) = 75% Hence, our model accuracy must be higher then the baseline model accuracy.

Step 5- Create train and test dataset

split <- sample.split(data, SplitRatio = 0.8) split

The split method splits the data into train and test datasets with a ratio of 0.8 This means 80% of our dataset is passed in the training dataset and 20% in the testing dataset.

train <- subset(data, split == "TRUE") test <- subset(data, split == "FALSE")

The train dataset gets all the data points after split which are 'TRUE' and similarly the test dataset gets all the data points which are 'FALSE'.

dim(train) # dimension/shape of train dataset dim(test) # dimension/shape of test dataset

Step 6 -Create a model for logistics using the training dataset

model = glm(PoorCare~.,train , family="binomial") # we use the glm()-general linear model to create an instance of model summary(model) # summary of the model tells us the different statistical values for our independent variables after the model is created

Step 7- Make predictions on the model using the test dataset

After the model is created and fitted, this model is used for making predictions on the unseen data values i.e the test dataset.

pred_test <- predict(model,test,type="response") pred_test

Step 8 - Model Diagnostics

After the predictions on the test datasets are made, create a confusion matrix with thershold value = 0.5 table(Actualvalue=test$PoorCare,Predictedvalue=pred_test>0.5) # assuming thershold to be 0.5 Our confusion matrix states that the true positves and true negatives are 20 and respectively. But we have a false negative rate of 2, i.e the patients are predicted getting good care, but in fact they are not. Hence this rate must be reduced as much as possible.

accuracy = (20+2)/(20+2+2+3) # Out of all the classes, how much we predicted correctly, which must be high as possible accuracy

Our baseline model gave an accuracy of 75% and our predicted model gave a accuracy of 81.18% which is actually a good value. Now, another performance metric we can use is the log loss metric, that will consider the probabilities of the model.

LogLoss=function(actual, predicted) { result=-1/length(actual)*(sum((actual*log(predicted)+(1-actual)*log(1-predicted)))) return(result) } LogLoss(test$PoorCare, pred_test)

In this case the classification is neutral (t = 0.5) : it assigns equal probability to both classes, resulting in a Log Loss of 0.68317.

      {"mode":"full","

isActive

    ":false}

What Users are saying..

profile image

Abhinav Agarwal

Graduate Student at Northwestern University
linkedin profile url

I come from Northwestern University, which is ranked 9th in the US. Although the high-quality academics at school taught me all the basics I needed, obtaining practical experience was a challenge.... Read More

Relevant Projects

AWS MLOps Project to Deploy a Classification Model [Banking]
In this AWS MLOps project, you will learn how to deploy a classification model using Flask on AWS.

BERT Text Classification using DistilBERT and ALBERT Models
This Project Explains how to perform Text Classification using ALBERT and DistilBERT

Learn to Build Generative Models Using PyTorch Autoencoders
In this deep learning project, you will learn how to build a Generative Model using Autoencoders in PyTorch

Recommender System Machine Learning Project for Beginners-3
Content Based Recommender System Project - Building a Content-Based Product Recommender App with Streamlit

Customer Market Basket Analysis using Apriori and Fpgrowth algorithms
In this data science project, you will learn how to perform market basket analysis with the application of Apriori and FP growth algorithms based on the concept of association rule learning.

NLP Project to Build a Resume Parser in Python using Spacy
Use the popular Spacy NLP python library for OCR and text classification to build a Resume Parser in Python.

Topic modelling using Kmeans clustering to group customer reviews
In this Kmeans clustering machine learning project, you will perform topic modelling in order to group customer reviews based on recurring patterns.

Abstractive Text Summarization using Transformers-BART Model
Deep Learning Project to implement an Abstractive Text Summarizer using Google's Transformers-BART Model to generate news article headlines.

Llama2 Project for MetaData Generation using FAISS and RAGs
In this LLM Llama2 Project, you will automate metadata generation using Llama2, RAGs, and AWS to reduce manual efforts.

Natural language processing Chatbot application using NLTK for text classification
In this NLP AI application, we build the core conversational engine for a chatbot. We use the popular NLTK text classification library to achieve this.