How to verify assumptions of linear regression using R plots

This recipe helps you verify assumptions of linear regression using R plots
Last Updated: 23 Jul 2021

Get access to Data Science projects View all Data Science projects

DATA SCIENCE PROJECTS IN R DATA CLEANING PYTHON DATA MUNGING MACHINE LEARNING RECIPES PANDAS CHEATSHEET ALL TAGS

Recipe Objective

How to verify assumptions of linear regression using R plots.

Linear Regression is a supervised learning algorithm used for continuous variables. The simple Linear Regression describes the relation between 2 variables, an independent variable (x) and a dependent variable (y). The equation for simple linear regression is**y = mx+ c** , where m is the slope and c is the intercept. The linear regression makes some assumptions about the data before and then makes predictions In this recipe, a dataset where the relation between the cost of bags w.r.t Width, Length, Height, Weight1, Weight of the bags is to be determined using simple linear regression. This recipe provides the steps to validate the assumptions of linear regression using R plots.

Step 1 - Install the necessary libraries

install.packages("ggplot2") install.packages("dplyr") library(ggplot2) library(dplyr)

Step 2 - Read a csv file and explore the data

data <- read.csv("/content/Data_1.csv") head(data) # head() returns the top 6 rows of the dataframe summary(data) # returns the statistical summary of the data columns plot(data$Width,data$Cost) #the plot() gives a visual representation of the relation between the variable Width and Cost cor(data$Width,data$Cost) # correlation between the two variables # the output gives a positive correlation , stating there is a high correlation between the two variables

Step 3 - Train and Test data

train_data <- read.csv('/content/train_data.csv') head(train_data) test_data <- read.csv('/content/test_data.csv') head(test_data)

Step 4 - Create a linear regression model

Here, a simple linear regression model is created with, y(dependent variable) - Cost x(independent variable) - Width

model <- lm(Cost ~ Width, data=train_data)

summary gives the summary result of training model , the performance metrics r2 and rmse obtained helps us to check how well our metrics is performing

summary(model) data.graph<-ggplot(data, aes(x=Width, y=Cost))+ geom_point() data.graph data.graph <- data.graph + geom_smooth(method="lm", col="black") data.graph # Add the linear regression line to the plotted data

Step 5 - Make predictions on the test dataset

y_pred <- predict(model,test_data)

The predicted values for Cost are:

y_pred

Step 6 - Assumptions of linear regression

**Assumption 1 -** There must be a linear relation between the dependent variable(y) and the independent variable (x) : use the correlation function-cor() The correlation seems to be good - strong positive correlation, hence assumption is satisfied

cor(data$Cost, data$Width)

**Assumption 2** - Normality - Using the hist() to check whether the dependent variable follows a normal distribution The distribution of data is roughly bell-shaped, so we can say it fulfills the assumption criteria and proceed with the linear regression.

hist(data$Width)

**Assumption 3** - Linearity : Check the linearity by plotting the scatter plots. Use the plot() function. The scatter plot gives, a some what good linear relation between the 2 variables, hence we can proceed with the linear regression plot(data$Width,data$Cost)

**Assumption 4** - The mean of residuals is zero If the mean of residuals is zero or close to zero, the assumption is satisfied Here, as the mean of residuals is close to zero, we can say that the assumption holds true for this model

mean(model$residuals)

**Assumption 5** — Check for homoscedasticity For a simple linear regression model, The par (mfrow=c (2,2,) produces four plots. The top-left and bottom-left plot shows how the residuals vary as the fitted values increase. From the top-left plot, as the fitted values along x increase, the residuals increases and then decrease. This pattern is indicated by the red line, which should be approximately flat if the disturbances are homoscedastic. The third plot, i.e., the bottom left, also checks this, here the disturbance term in the Y axis is standardized. As the line of homoscedasticity appears to be a bit straight, we say the assumption holds true.

par(mfrow=c(2,2)) # set 2 rows and 2 column plot layout mod_1 <- lm(Width ~ Cost, data=data) # linear model plot(mod_1)

Step 7 - Checking the assumptions using 'gvlma' package

install.packages('gvlma') library(gvlma) mod <- lm(Height ~ Cost, data=data) gvlma::gvlma(mod)

"Output of the assumptions using 'gvlma' is :

Call:
lm(formula = Height ~ Cost, data = data)

Coefficients:
(Intercept)         Cost  
   5.516366     0.008673  


ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance =  0.05 

Call:
 gvlma::gvlma(x = mod) 

                     Value   p-value                   Decision
Global Stat        72.0578 8.327e-15 Assumptions NOT satisfied!
Skewness            0.7632 3.823e-01    Assumptions acceptable.
Kurtosis            1.5753 2.094e-01    Assumptions acceptable.
Link Function      69.5354 1.110e-16 Assumptions NOT satisfied!
Heteroscedasticity  0.1840 6.680e-01    Assumptions acceptable.

Global Stat: It checks whether or not the relationship between the dependent variable (y) and independent variable (x) is roughly linear. Skewness: assumptions state that the distribution of the residuals is normal. Kurtosis: assumptions state that the distribution of the residuals is normal. Link function: It checks whether the dependent variable is continuous or categorical. Heteroskedasticity : Here, the error variance is equally random, and hence we have homoskedasticity.

{"mode":"full","isActive":false}

What Users are saying..

Ed Godalle

Director Data Analytics at EY / EY Tech

I am the Director of Data Analytics with over 10+ years of IT experience. I have a background in SQL, Python, and Big Data working with Accenture, IBM, and Infosys. I am looking to enhance my skills... Read More

Relevant Projects

Machine Learning Projects

Data Science Projects

Python Projects for Data Science

Data Science Projects in R

Machine Learning Projects for Beginners

Deep Learning Projects

Neural Network Projects

Tensorflow Projects

NLP Projects

Kaggle Projects

IoT Projects

Big Data Projects

Hadoop Real-Time Projects Examples

Spark Projects

Data Analytics Projects for Students

Relevant Projects

Multi-Class Text Classification with Deep Learning using BERT

In this deep learning project, you will implement one of the most popular state of the art Transformer models, BERT for Multi-Class Text Classification

View Project Details

Forecasting Business KPI's with Tensorflow and Python

In this machine learning project, you will use the video clip of an IPL match played between CSK and RCB to forecast key performance indicators like the number of appearances of a brand logo, the frames, and the shortest and longest area percentage in the video.

View Project Details

How to verify assumptions of linear regression using R plots

Recipe Objective

Step 1 - Install the necessary libraries

Step 2 - Read a csv file and explore the data

Step 3 - Train and Test data

Step 4 - Create a linear regression model

Step 5 - Make predictions on the test dataset

Step 6 - Assumptions of linear regression

Step 7 - Checking the assumptions using 'gvlma' package

Ed Godalle

Relevant Projects

You might also like

Relevant Projects