How to verify assumptions of linear regression using R plots

This recipe helps you verify assumptions of linear regression using R plots

Recipe Objective

How to verify assumptions of linear regression using R plots.

Linear Regression is a supervised learning algorithm used for continuous variables. The simple Linear Regression describes the relation between 2 variables, an independent variable (x) and a dependent variable (y). The equation for simple linear regression is**y = mx+ c** , where m is the slope and c is the intercept. The linear regression makes some assumptions about the data before and then makes predictions In this recipe, a dataset where the relation between the cost of bags w.r.t Width, Length, Height, Weight1, Weight of the bags is to be determined using simple linear regression. This recipe provides the steps to validate the assumptions of linear regression using R plots.

Step 1 - Install the necessary libraries

install.packages("ggplot2") install.packages("dplyr") library(ggplot2) library(dplyr)

Step 2 - Read a csv file and explore the data

data <- read.csv("/content/Data_1.csv") head(data) # head() returns the top 6 rows of the dataframe summary(data) # returns the statistical summary of the data columns plot(data$Width,data$Cost) #the plot() gives a visual representation of the relation between the variable Width and Cost cor(data$Width,data$Cost) # correlation between the two variables # the output gives a positive correlation , stating there is a high correlation between the two variables

Step 3 - Train and Test data

train_data <- read.csv('/content/train_data.csv') head(train_data) test_data <- read.csv('/content/test_data.csv') head(test_data)

Step 4 - Create a linear regression model

Here, a simple linear regression model is created with, y(dependent variable) - Cost x(independent variable) - Width

model <- lm(Cost ~ Width, data=train_data)

summary gives the summary result of training model , the performance metrics r2 and rmse obtained helps us to check how well our metrics is performing

summary(model) data.graph<-ggplot(data, aes(x=Width, y=Cost))+ geom_point() data.graph data.graph <- data.graph + geom_smooth(method="lm", col="black") data.graph # Add the linear regression line to the plotted data

Step 5 - Make predictions on the test dataset

y_pred <- predict(model,test_data)

The predicted values for Cost are:

y_pred

Step 6 - Assumptions of linear regression

**Assumption 1 -** There must be a linear relation between the dependent variable(y) and the independent variable (x) : use the correlation function-cor() The correlation seems to be good - strong positive correlation, hence assumption is satisfied

cor(data$Cost, data$Width)

**Assumption 2** - Normality - Using the hist() to check whether the dependent variable follows a normal distribution The distribution of data is roughly bell-shaped, so we can say it fulfills the assumption criteria and proceed with the linear regression.

hist(data$Width)

**Assumption 3** - Linearity : Check the linearity by plotting the scatter plots. Use the plot() function. The scatter plot gives, a some what good linear relation between the 2 variables, hence we can proceed with the linear regression plot(data$Width,data$Cost)

**Assumption 4** - The mean of residuals is zero If the mean of residuals is zero or close to zero, the assumption is satisfied Here, as the mean of residuals is close to zero, we can say that the assumption holds true for this model

mean(model$residuals)

**Assumption 5** — Check for homoscedasticity For a simple linear regression model, The par (mfrow=c (2,2,) produces four plots. The top-left and bottom-left plot shows how the residuals vary as the fitted values increase. From the top-left plot, as the fitted values along x increase, the residuals increases and then decrease. This pattern is indicated by the red line, which should be approximately flat if the disturbances are homoscedastic. The third plot, i.e., the bottom left, also checks this, here the disturbance term in the Y axis is standardized. As the line of homoscedasticity appears to be a bit straight, we say the assumption holds true.

par(mfrow=c(2,2)) # set 2 rows and 2 column plot layout mod_1 <- lm(Width ~ Cost, data=data) # linear model plot(mod_1)

Step 7 - Checking the assumptions using 'gvlma' package

install.packages('gvlma') library(gvlma) mod <- lm(Height ~ Cost, data=data) gvlma::gvlma(mod)

"Output of the assumptions using 'gvlma' is :

Call:
lm(formula = Height ~ Cost, data = data)

Coefficients:
(Intercept)         Cost  
   5.516366     0.008673  


ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance =  0.05 

Call:
 gvlma::gvlma(x = mod) 

                     Value   p-value                   Decision
Global Stat        72.0578 8.327e-15 Assumptions NOT satisfied!
Skewness            0.7632 3.823e-01    Assumptions acceptable.
Kurtosis            1.5753 2.094e-01    Assumptions acceptable.
Link Function      69.5354 1.110e-16 Assumptions NOT satisfied!
Heteroscedasticity  0.1840 6.680e-01    Assumptions acceptable.

Global Stat: It checks whether or not the relationship between the dependent variable (y) and independent variable (x) is roughly linear. Skewness: assumptions state that the distribution of the residuals is normal. Kurtosis: assumptions state that the distribution of the residuals is normal. Link function: It checks whether the dependent variable is continuous or categorical. Heteroskedasticity : Here, the error variance is equally random, and hence we have homoskedasticity.

{"mode":"full","isActive":false}

What Users are saying..

profile image

Ed Godalle

Director Data Analytics at EY / EY Tech
linkedin profile url

I am the Director of Data Analytics with over 10+ years of IT experience. I have a background in SQL, Python, and Big Data working with Accenture, IBM, and Infosys. I am looking to enhance my skills... Read More

Relevant Projects

Multi-Class Text Classification with Deep Learning using BERT
In this deep learning project, you will implement one of the most popular state of the art Transformer models, BERT for Multi-Class Text Classification

Forecasting Business KPI's with Tensorflow and Python
In this machine learning project, you will use the video clip of an IPL match played between CSK and RCB to forecast key performance indicators like the number of appearances of a brand logo, the frames, and the shortest and longest area percentage in the video.

PyTorch Project to Build a LSTM Text Classification Model
In this PyTorch Project you will learn how to build an LSTM Text Classification model for Classifying the Reviews of an App .

Create Your First Chatbot with RASA NLU Model and Python
Learn the basic aspects of chatbot development and open source conversational AI RASA to create a simple AI powered chatbot on your own.

AWS MLOps Project for ARCH and GARCH Time Series Models
Build and deploy ARCH and GARCH time series forecasting models in Python on AWS .

Build an Image Segmentation Model using Amazon SageMaker
In this Machine Learning Project, you will learn to implement the UNet Architecture and build an Image Segmentation Model using Amazon SageMaker

Learn How to Build a Linear Regression Model in PyTorch
In this Machine Learning Project, you will learn how to build a simple linear regression model in PyTorch to predict the number of days subscribed.

Learn to Build Generative Models Using PyTorch Autoencoders
In this deep learning project, you will learn how to build a Generative Model using Autoencoders in PyTorch

Build a Customer Churn Prediction Model using Decision Trees
Develop a customer churn prediction model using decision tree machine learning algorithms and data science on streaming service data.

Predict Churn for a Telecom company using Logistic Regression
Machine Learning Project in R- Predict the customer churn of telecom sector and find out the key drivers that lead to churn. Learn how the logistic regression model using R can be used to identify the customer churn in telecom dataset.