Machine Learning Tutorial: Logistic Regression

Logistic Regression

If you are hired as a statistical consultant and asked to quantify the relationship between advertising budgets and sales of a particular product that’s normal regression problem as the dependent variable sales is continuous in nature, however there are many research and educational topics /areas where the dependent variable will be categorical in nature like whether the customer will convert or not whether the patient is prone to cancer or not for that matter any event occurred or not. In that setting the dependent variable is discrete either takes “Yes” or “No” ; “High”,”Medium”,”Low” kind of values.

Handling this particular kind of data requires different technique than regression as mentioned above and it’s called classification as we just classify the outcomes of the dependent into classes predefined (“Yes” “No”; ”Cancer” “Not Cancer”) however methods used to classify them predict the probability of that particular event belongs to a particular category usually between 0.0 to 1.0.

Based on few cutoffs (0.5) we finally assign/classify the outcome to be a member of a modeled group. Like any statistical techniques logistic regression also has few assumptions to be followed:-

Dependent variable to be categorical in nature
Independent variables can take continuous or categorical values by nature , where the categorical variables needs to be dummy coded depending on the software
Based on the guidelines created cases/values per independent variable should be at least 10
Preferred ratios can be 20 or 50 sometimes based on the kind of computation technique we use to solve or converge a logistic equation
Unlike linear discriminant analysis logistic regression does not make any assumptions of normality, linearity, and homogeneity of variance for the independent variables. Instead, it assumes that the binomial distribution describes the distribution of the errors that equal the actual Y minus the predicted Y , this can be taken to be robust as long as the sample set we considered is random

Drawbacks of fitting linear equation to this kind of data

Before we decide or conclude on how to define a logistic equation we need to understand or introspect the current linear equation to understand what are the drawbacks, accordingly we can come up with an equation which fits this particular setting

Assuming reader understands the usual notation of a statistical learning where Y is represented as dependent variable and X1,X2 etc., are considered independent variables used to predict the dependent. If we consider a case where Y takes values 0 and 1 where 0 represents not converted and 1 represents converted and apply normal least square approximation or linear equation which is

Y=a + BX + e

Here we observe a problem with this approach in terms of predicting the class as we can observe for large values of X1 and X2 we see predictions varying above 1 and for small or zero values it gives negative values , which are don’t make sense as we have only two levels defined for our Y 0 and 1. If we try to fit any straight line to this kind of dichotomous dependent variable we face this similar problem, to avoid this we need to come up with a function which always returns either 0 or 1 for whatever values of the independent variables.

Solution in form of Logit Function

The core basis to understand logistic regression is instead of regressing on the direct dichotomous Y variable we try to regress on the logit of Y which is ratio natural logarithm of odds ratio.Mathematical understanding of Odds ratio

Converted	80	20%
Not Converted	120	80%
Total	200	100%

Here probability of conversion is 20% (80/200) while the odds of conversion are (80/120) 66%, which can be interpreted as a customer being converted is 66% more likely that a customer being not converted. We regress on the log of odds which is

ln[p/(1-p)] = a + BX + e or (In(80/120) – log of odds ratio)
[p/(1-p)] = exp(a + BX + e)
p = 1/[1 + exp(-a - BX)]

ln is the natural logarithm, logexp, where exp=2.71828…
p is the probability that the event Y occurs, p(Y=1)
p/(1-p) is the "odds ratio"
ln[p/(1-p)] is the log odds ratio, or "logit"
All other components of the model are the same.

The value of the coefficient B determines the direction of the relationship between X and the logit of Y. When B is greater than zero, larger (or smaller) X values are associated with larger (or smaller) logits of Y. Conversely, if B is less than zero, larger (or smaller) X values are associated with smaller (or larger) logits of Y. There are many other link functions like logit namely probit, inverse logit, poison all these are part of generalized linear modelling. In this article we just talk about logit.

Get Closer To Your Dream of Becoming a Data Scientist with 70+ Solved End-to-End ML Projects

Interpretation of the odds ratio and log odds

Because of the increase in the complexity of the transformations made above the toughest part of logistic regression will be to interpret the output, it can be understood as one unit change in the independent variable value increases the log odds by B times where “B” is the coefficient of estimation. Till now we spoke about forming the equation and interpreting the same, but we also need to look into how we solve the equation. In linear regression we use OLS method to reduce the errors here we use Maximum likelihood estimate to do the same.

Maximum Likelihood function

The basic intuition behind using maximum likelihood to fit a logistic regression model is as follows: we seek estimates for a and B such that the predicted probability ˆp of default for each individual, using equation above, corresponds as closely as possible to the individual’s observed default status. In other words, we try to find a and B such that plugging these estimates into the model for p(X), given in above equation, yields a number close to one for all individuals who defaulted, and a number close to zero for all individuals who did not. This intuition can be formalized using a mathematical equation called a likelihood function:

Likelihood function

L(a, B) =

The estimates a and B are chosen to maximize this likelihood function. There are few more regression types we need to understand based on the nature of the dependent (Nominal,Binary,Ordinal) and independent ( one or more) variables.

Get FREE Access to Machine Learning Example Codes for Data Cleaning, Data Munging, and Data Visualization

Types of logistic regressions

Multi-Logistic regression

If we try to predict the Y using multiple independent variable X1, X2...etc., it’s called muti logistic regression

Where X1,X2…Xp are independent variables and B0,B1..Bp are the coefficients

Type	Dependent	Independent
Multi Logistic Regression	Categorical	More than one X variable
Binary Logistic Regression	Categorical with only two levels(Converted, Not Converted)	One or More X variables
Ordinal Logistic Regression	Ordinal with more than two levels (High,Meduim,Low)	One or More X variables
Multinomial Logistic Regression	Nominal with more than two levels (Green,Blue,Red)	One or More X variables

Once the coefficients are established we can go ahead and do predictions to complete the task at our hands, but wait there are few more things we need to look into like the although we looked at the descriptive way of getting the results there is also inferential statistics which plays a key role in estimating the statistical significance of the estimated output. Unlike linear regression which had definite outcome for each equation, maximum likelihood method used to calculate logistic regression is an iterative fitting process that attempts to cycle through repetitions to find an answer and converge. Sometimes it might not converge or give incorrect results with high coefficient values (thousands, millions) all because of multicollinearity, categories of predictors having no cases or zero cells, and complete separation whereby the two groups are perfectly separated by the scores on one or more independent variables.

Learn Data Science by working on interesting Data Science Projects

These things usually fall under summary of model, Analysis of variance of model, statistical tests of individual predictors, goodness-of-fit statistics, an assessment of the predicted probabilities and robustness of the model etc., different names given to same concept of telling what the confidence of our predictions is / significance of the parameters is how we quantify it. We will understand these things better with a real life example.

Description of the data

This data set from the U.S. Fish and Wildlife Service contains information on North and South Carolinians who like to fish for bass. (Ref:- http://www.appstate.edu/~whiteheadjc/service/logit/beginner.htm)

Summary of data:-

We try to do our first iteration now with the above data to predict the species of the flower based on the independent variables given cost,catch,income,employed,education,married,age which are continuous and nc,sex which are categorical in nature converted to factors in R. We try to run our first iteration in R

Interpretation of output of the model

The output shows the coefficients, their standard errors, the z-statistic (sometimes called a Wald z-statistic), and the associated p-values. Both cost and employed are statistically significant there p value is statistically significant. Tests like likelihood ratio test, wald test, scores are inferential statistics used to estimate an improvement in the current model to other iterations

Pr(>|z|)

Null hypothesis before running a logistic regression is there is no relationship between independent variables and dependent variables which stated scientifically coefficients are “zero”. Probability of such even occurring and it the probability is very less or rare (usually below 0.05) its assumed that null hypothesis is wrong as the phenomenon is very rare to happen but it happened so reject the null hypothesis and declare that particular variable is statistically significant.

Explore More Data Science and Machine Learning Projects for Practice. Fast-Track Your Career Transition with ProjectPro

Estimates

The logistic regression coefficients give the change in the log odds of the outcome for a one unit increase in the predictor variable.

For every one unit change in cost, the log odds of people who like to fish (versus non-likers) changes by -0.0017.
For a one unit increase in employed, the log odds of people who like to fish (versus non-likers) incraeses by 1.24.
The indicator variables for sex have a slightly different interpretation. For example, being a male (assuming male as 1 and female as 0) versus female, changes the log odds of likihood to fish changes 0.706.

After we conclude that our model is statistically significant we go ahead and do the predictions, we can evaluate the performance of this binary predictions using by predicting on a hold out sample (usually 70% data is used as in sample to train and 30 % is kept aside to check the robustness which is called holdout sample) or by doing k cross fold validations. Once we build a model, we need to test the accuracy of the model on hold out sample. The following matrix a.k.a. “Confusion Matrix” helps us to classify the values which were correctly predicted using the model built.

	Predicted 0	Predicted 1
Actual 0	True Negatives TN	False Negatives FN
Actual 1	False Positives FP	True Positives TP

Recommended Tutorials:

These classifications are used to calculate accuracy, precision (also called positive predictive value), recall (also called sensitivity), specificity and negative predictive value. There are few metrics which are used to identify the accuracy or fit of the Logistic model based on the ‘Confusion Matrix’ discussed above:

Accuracy	(TP + TN)/ Total number of observations
Precision or Positive Predictive value	TP/ (TP +FP)
Negative predictive rate	TN/ (TN + FN)
Sensitivity	TP/ (TP + FN)
Specificity	TN/ (TN + FP)

As we discussed before that Logistic Regression just predicts the probability values, Hence in order to assign a category we need to find a probability cutoff value. Above which we can say that an event has occurred and below which we can assume the absence of the event. This cutoff is generally taken as 0.5 but optimal cutoff for better accuracy can be found using “Sensitivity – Specificity Curve”.

Access Data Science and Machine Learning Project Code Examples