How to compare different classification models using logloss and how to pick the best one in R

This recipe helps you compare different classification models using logloss and how to pick the best one in R

Recipe Objective

How to compare different classification models using logloss and how to pick the best one. Classification: Logistic Regression is a classification type supervised learning model. Logistic Regression is used when the independent variable x, can be a continuous or categorical variable, but the dependent variable (y) is a categorical variable. The performance of a classification type model can be evaluated from a confusion matrix like — precision, recall, f1 — score. There is another performance metric in classification called the LogLoss metric (also known as cross entropy loss), that considers the probabilities of the models and not just the final output. As it is based on probabilities, the values of LogLoss lies in between 0-1. The more log loss value closer to 0, the better the model is, as it is a measure of uncertainty, hence it must be as low as possible. This recipe demonstrates an example of how to compare different classification models using logloss and how to pick the best one. In the following example, a '**Healthcare case study**' is taken, logistic regression had to be applied on a data set.

Learn How to do Exploratory Data Analysis

Data Set Description

The variables in the dataset quality.csv are as follows:

    • MemberID: numbers the patients from 1 to 131, and is just an identifying number.
    • InpatientDays: is the number of inpatient visits, or number of days the person spent in the hospital.
    • ERVisits: is the number of times the patient visited the emergency room.
    • OfficeVisits: is the number of times the patient visited any doctor's office.
    • Narcotics: is the number of prescriptions the patient had for narcotics.
    • DaysSinceLastERVisit: is the number of days between the patient's last emergency room visit and the end of the study period (set to the length of the study period if they never visited the ER).
    • Pain: is the number of visits for which the patient complained about pain.
    • TotalVisits: is the total number of times the patient visited any healthcare provider.
    • ProviderCount: is the number of providers that served the patient.
    • MedicalClaims: is the number of days on which the patient had a medical claim.
    • ClaimLines: is the total number of medical claims.
    • StartedOnCombination: is whether or not the patient was started on a combination of drugs to treat their diabetes (TRUE or FALSE).
    • AcuteDrugGapSmall: is the fraction of acute drugs that were refilled quickly after the prescription ran out.
    • PoorCare: is the outcome or dependent variable, and is equal to 1 if the patient had poor care, and equal to 0 if the patient had good care.

The dependent variable is modeled as a binary variable:

      • 1 if low-quality care, 0 if high-quality care

Step 1 - Load the necessary libraries

install.packages("dplyr") # Install dplyr
library("dplyr") # Load dplyr
install.packages("caTools") # For Logistic regression
library(caTools)
library(MASS)
install.packages('randomForest') # For generating random forest model
library(randomForest)
install.packages('caret') # classification and regression training : The library caret has a function to make prediction.
library(caret)
install.packages('e1071', dependencies=TRUE)

Step 2 - Read a csv dataset

data <- read.csv("https://storage.googleapis.com/dimensionless/Analytics/quality.csv") # reads the dataset

Step 3 - EDA : Exploratory Data Analysis

dim(data) # returns the number of rows and columns in the dataset
print(head(data)) # over view of the dataset
summary(data) # summary() function generates the statistical summary of the data

Step 4- Create train and test dataset

split <- sample.split(data, SplitRatio = 0.8)
split

The split method splits the data into train and test datasets with a ratio of 0.8 This means 80% of our dataset is passed in the training dataset and 20% in the testing dataset.

train <- subset(data, split == "TRUE")
test <- subset(data, split == "FALSE")

The train dataset gets all the data points after split which are 'TRUE' and similarly the test dataset gets all the data points which are 'FALSE'.

dim(train) # dimension/shape of train dataset
dim(test) # dimension/shape of test dataset

Step 5 -Create a glm model

model_glm = glm(PoorCare~.,train, family="binomial") # we use the glm()-general linear model to create an instance of model
summary(model_glm) # summary of the model tells us the different statistical values for our independent variables after the model is created

Step 6 - Create a random forest model

model_rf <- randomForest(PoorCare~.,data = train) # we use the randomForest() - to create an instance of model
summary(model_rf)

Step 7- Make predictions on the model using the test dataset

After the model is created and fitted, this model is used for making predictions on the unseen data values i.e the test dataset.

pred_test_glm <- predict(model_glm,test,type="response")
pred_test_glm
pred_test_rf <- predict(model_rf,test,type="class")
pred_test_rf

Step 8 - LogLoss function

Now, another performance metric we can use is the log loss metric, that will consider the probabilities of the model.

LogLoss=function(actual, predicted)
{ result=-1/length(actual)*(sum((actual*log(predicted)+(1-actual)*log(1-predicted)))) return(result) }
LogLoss(test$PoorCare, pred_test_glm)
LogLoss(test$PoorCare, pred_test_rf)

Here, we can compare both the models of classification data, i.e. the glm and randomforest model using the LogLoss function. As it is based on probabilities, the values of LogLoss lies in between 0-1. The more log loss value closer to 0, the better the model is, as it is a measure of uncertainty, hence it must be as low as possible. Here, the randomForest returns a Logloss value of 0.48 while glm returns a value of 0.50, hence we can conclude that the randomForest gives a better model prediction.

What Users are saying..

profile image

Ameeruddin Mohammed

ETL (Abintio) developer at IBM
linkedin profile url

I come from a background in Marketing and Analytics and when I developed an interest in Machine Learning algorithms, I did multiple in-class courses from reputed institutions though I got good... Read More

Relevant Projects

Recommender System Machine Learning Project for Beginners-4
Collaborative Filtering Recommender System Project - Comparison of different model based and memory based methods to build recommendation system using collaborative filtering.

Build a Churn Prediction Model using Ensemble Learning
Learn how to build ensemble machine learning models like Random Forest, Adaboost, and Gradient Boosting for Customer Churn Prediction using Python

End-to-End Snowflake Healthcare Analytics Project on AWS-1
In this Snowflake Healthcare Analytics Project, you will leverage Snowflake on AWS to predict patient length of stay (LOS) in hospitals. The prediction of LOS can help in efficient resource allocation, lower the risk of staff/visitor infections, and improve overall hospital functioning.

Machine Learning Project to Forecast Rossmann Store Sales
In this machine learning project you will work on creating a robust prediction model of Rossmann's daily sales using store, promotion, and competitor data.

Build a Credit Default Risk Prediction Model with LightGBM
In this Machine Learning Project, you will build a classification model for default prediction with LightGBM.

Build an Image Classifier for Plant Species Identification
In this machine learning project, we will use binary leaf images and extracted features, including shape, margin, and texture to accurately identify plant species using different benchmark classification techniques.

Build CNN Image Classification Models for Real Time Prediction
Image Classification Project to build a CNN model in Python that can classify images into social security cards, driving licenses, and other key identity information.

Abstractive Text Summarization using Transformers-BART Model
Deep Learning Project to implement an Abstractive Text Summarizer using Google's Transformers-BART Model to generate news article headlines.

Recommender System Machine Learning Project for Beginners-3
Content Based Recommender System Project - Building a Content-Based Product Recommender App with Streamlit

MLOps Project on GCP using Kubeflow for Model Deployment
MLOps using Kubeflow on GCP - Build and deploy a deep learning model on Google Cloud Platform using Kubeflow pipelines in Python