How to do linear regression in R

This recipe helps you do linear regression in R

Recipe Objective

How to do linear regression in R?

Linear Regression is a supervised learning algorithm used for continuous variables. The simple Linear Regression describes the relation between 2 variables, an independent variable (x) and a dependent variable (y). The equation for simple linear regression is **y = mx+ c** , where m is the slope and c is the intercept. In Linear regression, a scatter plot is plotted between the x and y initially and a best fit line is drawn over it. The model is then trained and predictions are made over the test dataset,(y_pred) and a line between x and y_pred is fitted over. The difference between the actual values and the fitted values is known as residual values or errors / RESIDUAL SUM OF SQUARES (RSS) and this must be as low as possible. To keep RSS minimal, there are two methods used i.e - OLS (ordinary least square) - Gradient descent method. The accuracy of this model is checked using the **performance metrics** R squared and RMSE -root mean squared error. R squared ranges between 0-1 and must be as high as possible as it represents the proportion of information in the data that can be explained by the model. RMSE determines how far the predicted data points are from the actual data points on the best fit line. In this recipe, a dataset where the relation between the cost of bags w.r.t width ,of the bags is to be determined using simple linear regression.

Learn How to do Exploratory Data Analysis

Step 1 - Install the necessary libraries

install.packages("ggplot2") install.packages("dplyr") install.packages("caTools") # For Linear regression library(caTools) library(ggplot2) library(dplyr)

Step 2 - Read a csv file and do EDA : Exploratory Data Analysis

The dataset attached contains the data of 160 different bags associated with ABC industries. The bags have certain attributes which are described below: 1. Height – The height of the bag 2. Width – The width of the bag 3. Length – The length of the bag 4. Weight – The weight the bag can carry 5. Weight1 – Weight the bag can carry after expansion The company now wants to predict the cost they should set for a new variant of these kinds of bags.

data <- read.csv("R_220_Data_1.csv") dim(data) # returns the shape of the data, i.e the total number of rows,columns print(head(data)) # head() returns the top 6 rows of the dataframe summary(data) # returns the statistical summary of the data columns

Step 3 - Plot a scatter plot between x and y

plot(data$Width,data$Cost) #the plot() gives a visual representation of the relation between the variable Width and Cost cor(data$Width,data$Cost) # correlation between the two variables # the output gives a positive correlation , stating there is a high correlation between the two variables

Step 4 - Train and Test data

The training data is used for building a model, while the testing data is used for making predictions. This means after fitting a model on the training data set, finding of the errors and minimizing those error, the model is used for making predictions on the unseen data which is the test data.

split <- sample.split(data, SplitRatio = 0.8) split

The split method splits the data into train and test datasets with a ratio of 0.8 This means 80% of our dataset is passed in the training dataset and 20% in the testing dataset.

train <- subset(data, split == "TRUE") test <- subset(data, split == "FALSE")

The train dataset gets all the data points after split which are 'TRUE' and similarly the test dataset gets all the data points which are 'FALSE'.

dim(train) # dimension/shape of train dataset print(head(train)) dim(test) # dimension/shape of test dataset print(head(test))

Step 5 - Create a linear regression model

 

Here, a simple linear regression model is created with, y(dependent variable) - Cost x(independent variable) - Width model <- lm(Cost ~ Width, data=train) summary gives the summary result of training model , the performance metrics r2 and rmse obtained helps us to check how well our metrics is performing

summary(model)

Step 6 - Add regression line to the plot

data.graph<-ggplot(data, aes(x=Width, y=Cost))+ geom_point() data.graph data.graph <- data.graph + geom_smooth(method="lm", col="black") data.graph # Add the linear regression line to the plotted data

Step 7 - Make predictions on the test dataset

y_pred <- predict(model,test) # predictions are made on the testing data set

The predicted values for Cost are:

y_pred

Step 8 - Finding RMSE

rmse_val <- sqrt(mean(y_pred-data$Width)^2) rmse_val SSE = sum((y_pred-test$Cost)^2) SST = sum((y_pred-mean(test$Cost))^2) r2_test = 1 - SSE/SST print(r2_test) {"mode":"full","isActive":false}

What Users are saying..

profile image

Savvy Sahai

Data Science Intern, Capgemini
linkedin profile url

As a student looking to break into the field of data engineering and data science, one can get really confused as to which path to take. Very few ways to do it are Google, YouTube, etc. I was one of... Read More

Relevant Projects

MLOps Project to Deploy Resume Parser Model on Paperspace
In this MLOps project, you will learn how to deploy a Resume Parser Streamlit Application on Paperspace Private Cloud.

Mastering A/B Testing: A Practical Guide for Production
In this A/B Testing for Machine Learning Project, you will gain hands-on experience in conducting A/B tests, analyzing statistical significance, and understanding the challenges of building a solution for A/B testing in a production environment.

Predict Churn for a Telecom company using Logistic Regression
Machine Learning Project in R- Predict the customer churn of telecom sector and find out the key drivers that lead to churn. Learn how the logistic regression model using R can be used to identify the customer churn in telecom dataset.

Recommender System Machine Learning Project for Beginners-4
Collaborative Filtering Recommender System Project - Comparison of different model based and memory based methods to build recommendation system using collaborative filtering.

Build a Face Recognition System in Python using FaceNet
In this deep learning project, you will build your own face recognition system in Python using OpenCV and FaceNet by extracting features from an image of a person's face.

Demand prediction of driver availability using multistep time series analysis
In this supervised learning machine learning project, you will predict the availability of a driver in a specific area by using multi step time series analysis.

NLP Project for Beginners on Text Processing and Classification
This Project Explains the Basic Text Preprocessing and How to Build a Classification Model in Python

BERT Text Classification using DistilBERT and ALBERT Models
This Project Explains how to perform Text Classification using ALBERT and DistilBERT

Credit Card Default Prediction using Machine learning techniques
In this data science project, you will predict borrowers chance of defaulting on credit loans by building a credit score prediction model.

Image Classification Model using Transfer Learning in PyTorch
In this PyTorch Project, you will build an image classification model in PyTorch using the ResNet pre-trained model.