How to do linear regression in R

This recipe helps you do linear regression in R
Last Updated: 21 Jul 2022

Get access to Data Science projects View all Data Science projects

DATA SCIENCE PROJECTS IN R DATA CLEANING PYTHON DATA MUNGING MACHINE LEARNING RECIPES PANDAS CHEATSHEET ALL TAGS

Recipe Objective

How to do linear regression in R?

Linear Regression is a supervised learning algorithm used for continuous variables. The simple Linear Regression describes the relation between 2 variables, an independent variable (x) and a dependent variable (y). The equation for simple linear regression is **y = mx+ c** , where m is the slope and c is the intercept. In Linear regression, a scatter plot is plotted between the x and y initially and a best fit line is drawn over it. The model is then trained and predictions are made over the test dataset,(y_pred) and a line between x and y_pred is fitted over. The difference between the actual values and the fitted values is known as residual values or errors / RESIDUAL SUM OF SQUARES (RSS) and this must be as low as possible. To keep RSS minimal, there are two methods used i.e - OLS (ordinary least square) - Gradient descent method. The accuracy of this model is checked using the **performance metrics** R squared and RMSE -root mean squared error. R squared ranges between 0-1 and must be as high as possible as it represents the proportion of information in the data that can be explained by the model. RMSE determines how far the predicted data points are from the actual data points on the best fit line. In this recipe, a dataset where the relation between the cost of bags w.r.t width ,of the bags is to be determined using simple linear regression.

Learn How to do Exploratory Data Analysis

Step 1 - Install the necessary libraries

install.packages("ggplot2") install.packages("dplyr") install.packages("caTools") # For Linear regression library(caTools) library(ggplot2) library(dplyr)

Step 2 - Read a csv file and do EDA : Exploratory Data Analysis

The dataset attached contains the data of 160 different bags associated with ABC industries. The bags have certain attributes which are described below: 1. Height – The height of the bag 2. Width – The width of the bag 3. Length – The length of the bag 4. Weight – The weight the bag can carry 5. Weight1 – Weight the bag can carry after expansion The company now wants to predict the cost they should set for a new variant of these kinds of bags.

data <- read.csv("R_220_Data_1.csv") dim(data) # returns the shape of the data, i.e the total number of rows,columns print(head(data)) # head() returns the top 6 rows of the dataframe summary(data) # returns the statistical summary of the data columns

Step 3 - Plot a scatter plot between x and y

plot(data$Width,data$Cost) #the plot() gives a visual representation of the relation between the variable Width and Cost cor(data$Width,data$Cost) # correlation between the two variables # the output gives a positive correlation , stating there is a high correlation between the two variables

Step 4 - Train and Test data

The training data is used for building a model, while the testing data is used for making predictions. This means after fitting a model on the training data set, finding of the errors and minimizing those error, the model is used for making predictions on the unseen data which is the test data.

split <- sample.split(data, SplitRatio = 0.8) split

The split method splits the data into train and test datasets with a ratio of 0.8 This means 80% of our dataset is passed in the training dataset and 20% in the testing dataset.

train <- subset(data, split == "TRUE") test <- subset(data, split == "FALSE")

The train dataset gets all the data points after split which are 'TRUE' and similarly the test dataset gets all the data points which are 'FALSE'.

dim(train) # dimension/shape of train dataset print(head(train)) dim(test) # dimension/shape of test dataset print(head(test))

Step 5 - Create a linear regression model

Here, a simple linear regression model is created with, y(dependent variable) - Cost x(independent variable) - Width model <- lm(Cost ~ Width, data=train) summary gives the summary result of training model , the performance metrics r2 and rmse obtained helps us to check how well our metrics is performing

summary(model)

Step 6 - Add regression line to the plot

data.graph<-ggplot(data, aes(x=Width, y=Cost))+ geom_point() data.graph data.graph <- data.graph + geom_smooth(method="lm", col="black") data.graph # Add the linear regression line to the plotted data

Step 7 - Make predictions on the test dataset

y_pred <- predict(model,test) # predictions are made on the testing data set

The predicted values for Cost are:

y_pred

Step 8 - Finding RMSE

rmse_val <- sqrt(mean(y_pred-data$Width)^2) rmse_val SSE = sum((y_pred-test$Cost)^2) SST = sum((y_pred-mean(test$Cost))^2) r2_test = 1 - SSE/SST print(r2_test) {"mode":"full","isActive":false}

What Users are saying..

Savvy Sahai

Data Science Intern, Capgemini

As a student looking to break into the field of data engineering and data science, one can get really confused as to which path to take. Very few ways to do it are Google, YouTube, etc. I was one of... Read More