How to plot residuals of a linear regression in R

This recipe helps you plot residuals of a linear regression in R

Recipe Objective

How to plot residuals of a linear regression in R.

Linear Regression is a supervised learning algorithm used for continuous variables. The simple Linear Regression describes the relation between 2 variables, an independent variable (x) and a dependent variable (y). The equation for simple linear regression is**y = mx+ c** , where m is the slope and c is the intercept. The model is then trained and predictions are made over the test dataset,(y_pred) and a line between x and y_pred is fitted over. The difference between the actual values and the fitted values is known as residual values or errors / **RESIDUAL SUM OF SQUARES (RSS)**, and this must be as low as possible. Residual plots: Residual plots are plotted to analyze if the residuals in a regression problem are following normal distribution or not, and if it exhibits heteroscedasticity i.e. unequal scatter of residuals or errors. In this recipe, a dataset is considered where the relation between the cost of bags w.r.t width ,of the bags is to be determined using simple linear regression and residuals are plotted.

Access Linear Regression ML Project for Beginners with Source Code

Step 1 - Install the necessary libraries

install.packages("caTools") # For Linear regression install.packages("ggplot2") install.packages("dplyr") library(ggplot2) library(dplyr) library(caTools)

Step 2 - Read a csv file and do EDA : Exploratory Data Analysis

The dataset attached contains the data of 160 different bags associated with ABC industries. The bags have certain attributes which are described below: 1. Height – The height of the bag 2. Width – The width of the bag 3. Length – The length of the bag 4. Weight – The weight the bag can carry 5. Weight1 – Weight the bag can carry after expansion The company now wants to predict the cost they should set for a new variant of these kinds of bags.

data <- read.csv("/content/Data_1.csv") dim(data) # returns the shape of the data, i.e the total number of rows,columns print(head(data)) # head() returns the top 6 rows of the dataframe summary(data) # returns the statistical summary of the data columns

Step 3 - Train and Test data

The training data is used for building a model, while the testing data is used for making predictions. This means after fitting a model on the training data set, finding of the errors and minimizing those error, the model is used for making predictions on the unseen data which is the test data.

split <- sample.split(data, SplitRatio = 0.8) split

The split method splits the data into train and test datasets with a ratio of 0.8 This means 80% of our dataset is passed in the training dataset and 20% in the testing dataset.

train <- subset(data, split == "TRUE") test <- subset(data, split == "FALSE")

The train dataset gets all the data points after split which are 'TRUE' and similarly the test dataset gets all the data points which are 'FALSE'.

dim(train) # dimension/shape of train dataset dim(test) # dimension/shape of test dataset

Step 4 - Create a linear regression model

Here, a simple linear regression model is created with, y(dependent variable) - Cost x(independent variable) - Width

model <- lm(data, data=train)

summary gives the summary result of training model , the performance metrics r2 and rmse obtained helps us to check how well our metrics is performing

summary(model) res <- resid(model) # get list of residuals

Step 5 - Plot fitted vs residual plot

# produce a residual vs fitted plot for visulaizting heteroscedasticity #produce residual vs. fitted plot plot(fitted(model), res) #add a horizontal line at 0 abline(0,0)

Step 6 - Plot a Q-Q plot

A Q-Q plot helps determine if the residuals generated follow a normal distribution or not. The data points must fall along a rough straight line of 45 degree angles, for our data to be normally distributed.

#create Q-Q plot for residuals qqnorm(res) #add a straight diagonal line to the plot qqline(res)

Residuals tend to stray away from the plotted line, indicating they are not normally distributed.

Step 7 - Plot a density plot

A density plot helps visualize if the residuals are normally distributed or not. They must be approximately a bell-shaped curve to follow a normal distribution.

#Create density plot of residuals plot(density(res))

The density plot shows a rough bell-shaped symmetry with some values skewed to the right.

{"mode":"full","isActive":false}

What Users are saying..

profile image

Abhinav Agarwal

Graduate Student at Northwestern University
linkedin profile url

I come from Northwestern University, which is ranked 9th in the US. Although the high-quality academics at school taught me all the basics I needed, obtaining practical experience was a challenge.... Read More

Relevant Projects

Build an optimal End-to-End MLOps Pipeline and Deploy on GCP
Learn how to build and deploy an end-to-end optimal MLOps Pipeline for Loan Eligibility Prediction Model in Python on GCP

Learn Object Tracking (SOT, MOT) using OpenCV and Python
Get Started with Object Tracking using OpenCV and Python - Learn to implement Multiple Instance Learning Tracker (MIL) algorithm, Generic Object Tracking Using Regression Networks Tracker (GOTURN) algorithm, Kernelized Correlation Filters Tracker (KCF) algorithm, Tracking, Learning, Detection Tracker (TLD) algorithm for single and multiple object tracking from various video clips.

End-to-End Snowflake Healthcare Analytics Project on AWS-2
In this AWS Snowflake project, you will build an end to end retraining pipeline by checking Data and Model Drift and learn how to redeploy the model if needed

Recommender System Machine Learning Project for Beginners-1
Recommender System Machine Learning Project for Beginners - Learn how to design, implement and train a rule-based recommender system in Python

Ensemble Machine Learning Project - All State Insurance Claims Severity Prediction
In this ensemble machine learning project, we will predict what kind of claims an insurance company will get. This is implemented in python using ensemble machine learning algorithms.

MLOps Project on GCP using Kubeflow for Model Deployment
MLOps using Kubeflow on GCP - Build and deploy a deep learning model on Google Cloud Platform using Kubeflow pipelines in Python

AWS MLOps Project to Deploy Multiple Linear Regression Model
Build and Deploy a Multiple Linear Regression Model in Python on AWS

Census Income Data Set Project-Predict Adult Census Income
Use the Adult Income dataset to predict whether income exceeds 50K yr based oncensus data.

Time Series Forecasting Project-Building ARIMA Model in Python
Build a time series ARIMA model in Python to forecast the use of arrival rate density to support staffing decisions at call centres.

PyCaret Project to Build and Deploy an ML App using Streamlit
In this PyCaret Project, you will build a customer segmentation model with PyCaret and deploy the machine learning application using Streamlit.