How to do Linear Regression in Machine Learning in Python?

Your 101 guide for learning how to do linear regression in machine learning in Python. | ProjectPro

How to do Linear Regression in Machine Learning in Python?
 |  BY Manika

Machine learning has evolved as a transformative force in today's data-driven world, enabling us to uncover hidden insights and make data-informed decisions. At the heart of this revolutionary field lies a multitude of machine learning algorithms, each designed to tackle a plethora of challenges. Among these algorithms, Linear Regression in Machine Learning stands as one of the simplest yet most powerful tools in a data scientist's toolkit.


Linear Regression Model Project in Python for Beginners Part 1

Downloadable solution code | Explanatory videos | Tech Support

Start Project

Linear Regression provides a solid foundation for understanding the relationship between variables, making predictions, and extracting valuable information from data. Whether you’ve just stepped into the world of data science, exploring predictive analytics, or simply curious about how machines learn from data, understanding Linear Regression is a crucial first step.

In this blog, we will guide you on how to unravel the intricacies of Linear Regression in Machine Learning. You will explore its fundamental principles, uncover the mathematics behind it, and demonstrate how it can be wielded as a powerful tool for predictive modeling. The blog will also delve into practical aspects, showing you how to run Linear Regression in Python and how to interpret linear regression coefficients and results. Additionally, it will equip you with the knowledge and confidence to harness the predictive capabilities of Linear Regression model. So, let's dive in and demystify Linear Regression, one step at a time.

What is Linear Regression in Machine Learning?

Linear Regression is a fundamental and widely used statistical method in machine learning and data science. It's a supervised learning algorithm used for modeling the relationship between a dependent variable (target) and one or more independent variables (features) by fitting a linear equation to the observed data. The primary goal of linear regression is to find the best-fitting straight line (or hyperplane in higher dimensions) that minimizes the sum of squared differences between the predicted and actual values of the target variable.

Understanding Linear Regression in Machine Learning

Linear Regression possesses two fundamental attributes that make it unique in the world of machine learning. First and foremost, its simplicity is a double-edged sword. Its straightforwardness in both understanding and implementation makes it an ideal starting point for individuals venturing into the realm of machine learning, serving as a stepping stone for more complex techniques. Secondly, the interpretability it offers is invaluable. Linear regression provides clear insights into the relationships between variables. By examining the coefficients associated with each feature, one can easily discern and quantify the impact of these features on the target variable. This transparency fosters a deeper understanding of the underlying data and aids in making informed decisions.

Applications of Linear Regression in Machine Learning

Linear Regression finds itself in myriad real-world scenarios. In economics, it helps analyze relationships between income and expenditure, GDP and unemployment rates, or interest rates and housing prices. Financial analysts rely on it for predicting stock prices, assessing portfolio risk, and uncovering factors influencing investment returns. In the field of medicine, Linear Regression aids in predicting patient outcomes based on medical parameters, tracking disease progression, or evaluating treatment effectiveness. These are just a few instances of Linear Regression's widespread utility, spanning across disciplines as diverse as marketing, environmental science, engineering, social sciences, and geology.  Here is an interesting insight by Sabyasachi Sengupta about the use of linear regression model in Biological sciences.

Linear Regression Expert Tip by Sabyasachi Sengupta

Linear Regression Fundamentals

Linear Regression is a popular technique in the domain of machine learning and statistical analysis that is used for modelling and understanding the relationship between two or more variables. Before diving into the mathematical details of linear regression, it is important to understand the concepts of independent and dependent variables.

Independent and Dependent Variables

An independent variable is often denoted as ‘X’ and is a variable whose values are under control in an experiment or a study. It is the variable that one can manipulate that has an impact on another variable, the dependent variable. When it comes to linear regression, independent variables are the features or predictors used to make predictions or model an outcome. 

On the other hand, a dependent variable is denoted as ‘Y’ that we are trying to predict or generate values for. It is the outcome or response that may be influence by changes in the independent variable. In linear regression, the ultimate task is to find a relationship between the independent variables and dependent variable that allows to make predictions based on the values of the independent variables.

Let us now dig into the details of what linear regression assumes and try to understand linear regression  better.

Here's what valued users are saying about ProjectPro

I think that they are fantastic. I attended Yale and Stanford and have worked at Honeywell,Oracle, and Arthur Andersen(Accenture) in the US. I have taken Big Data and Hadoop,NoSQL, Spark, Hadoop Admin, Hadoop projects. I have been happy with every project. They have really brought me into the...

Ray han

Tech Leader | Stanford / Yale University

I come from a background in Marketing and Analytics and when I developed an interest in Machine Learning algorithms, I did multiple in-class courses from reputed institutions though I got good theoretical knowledge, the practical approach, real word application, and deployment knowledge were...

Ameeruddin Mohammed

ETL (Abintio) developer at IBM

Not sure what you are looking for?

View All Projects

Assumptions of Linear Regression

Here are some key assumptions associated with Linear Regression. These serve as formal criteria when constructing a Linear Regression model, ensuring optimal outcomes from the provided dataset.

  • Linear Relationship between Features and Target: Linear Regression algorithm presupposes a linear connection between the dependent and independent variables, implying that changes in the latter result in proportional changes in the former.

  • Minimal or Absent Multicollinearity among Features: Multicollinearity denotes a high degree of correlation between independent variables. This phenomenon can complicate the identification of genuine relationships between predictors and the target variable. Therefore, the model assumes either limited or no multicollinearity among the features, allowing for clearer relationships to be established.

  • Homoscedasticity Assumption: Homoscedasticity refers to a scenario where the error term exhibits consistent variability across all values of independent variables. In cases of homoscedasticity, scatter plots should display an absence of discernible patterns or trends in data distribution.

  • Normal Distribution of Error Terms: Linear regression expects error terms to adhere to a normal distribution pattern. Deviations from normal distribution can lead to imprecise confidence intervals, potentially affecting coefficient estimation. A q-q plot can be employed to verify normality, with a straight-line plot indicating normally distributed errors.

  • Absence of Autocorrelations: The Linear Regression model assumes the absence of autocorrelation within error terms. Autocorrelation arises when residual errors exhibit interdependencies. The presence of autocorrelation can significantly reduce model accuracy, typically occurring when there is a relationship between residual errors.

Now, here is an important insight into Linear regression model by Venkat Raman that you must check out before we move ahead to the next section that dives into the mathematical details of the linear regression model.

Expert tip by Venkat Raman on Linear Regression in Machine Learning

How to Find Linear Regression Equation

The linear regression equation in regression analysis involves regularization and optimization of the coefficients. To find the linear regression equation, we first need to understand the dataset and identify the independent variables and dependent variable. Next, we need to pick an equation that is going to define the relationship between the variables. We will discuss three in this section: Simple Linear Regression, Ridge Linear Regression, and Lasso Linear Regression. Next, we will explore the different methods used to evaluate the coefficients of the linear regression model. Finally, we will see evaluate the model fit by calculating the residuals and error terms.

Linear Regression Model Representation

Let us discuss the three models: simple linear regression, ridge linear regression and lasso linear regression.

1) Simple Linear Regression 

We use the following equation to express the linear relationship among the variables:

Y = β0 + β1X1 + β2X2 + ... + βn*Xn + ε

Y: Dependent variable (the variable you want to predict).

X1, X2, ... Xn: Independent variables (features).

β0: Intercept (the point where the regression line crosses the Y-axis).

β1, β2, ... βn: Coefficients (representing the impact of each independent variable).

ε (epsilon): Error term (the difference between the actual and predicted values

Please note that if the simple linear regression model has more than two independent variables involved, the model is referred to multiple linear regression model. Linear Regression, while powerful, can struggle with multicollinearity (high correlation between independent variables) and overfitting. Ridge and Lasso Regression techniques act as effective remedies, mitigating these issues by introducing regularization to stabilize and improve model performance.They add penalty terms to the linear regression objective function to control the magnitude of the coefficients. Let us discuss them both.

2) Ridge Regression

Ridge Regression aims to minimize the sum of squared differences between predicted and actual values (similar to ordinary linear regression), but it adds a penalty term proportional to the square of the magnitude of the coefficients (L2 regularization):

Objective = Sum of Squared Residuals + α * Σ(β²)

α (alpha) is the regularization parameter that controls the strength of the penalty term. A higher α leads to smaller coefficients. Ridge Regression helps mitigate multicollinearity by encouraging coefficients to be spread out across correlated features, rather than favoring one over the others. It doesn't drive coefficients to exactly zero but reduces their magnitudes.

3) Lasso Regression

Lasso Regression also minimizes the sum of squared differences but adds a penalty term proportional to the absolute value of the coefficients (L1 regularization):

Objective = Sum of Squared Residuals + α * Σ|β|

α (alpha) is the regularization parameter, as in Ridge Regression. Lasso Regression not only addresses multicollinearity but also performs feature selection by driving some coefficients to exactly zero. This makes it useful when you suspect that only a subset of features truly influences the target variable.

Both Ridge and Lasso can be implemented using machine learning libraries like scikit-learn in Python. You would choose the regularization strength (α) based on cross-validation or other selection methods, and then fit the model to your data.

Evaluating Coefficients

Evaluating coefficients in linear regression mathematical equation involves assessing their magnitude, sign, and statistical significance to understand their impact on the dependent variable and the overall linear model. We will dive into the three popular methods used to evaluate the coefficients of linear regression model, namely: Stochastic Gradient Descent, Elastic Net equation, and the method of least squares. 

1) Stochastic Gradient Descent

Regression via Stochastic Gradient Descent (SGD) is a popular optimization technique for training linear regression models, particularly in scenarios with large datasets having multiple features. Instead of computing gradients using the entire dataset (as in batch gradient descent), SGD updates the model's parameters using a single randomly chosen data point at each iteration. Here's how regression via SGD works:

Objective Function: In linear regression, the objective is to minimize the sum of squared differences between the predicted values and the actual target values. The cost function for linear regression can be written as:

Objective = Sum of Squared Residuals = Σ(Yi - Ŷi

Where:

Yi is the actual target value for data point i.

Ŷi is the predicted target value for data point i.

To find the optimal parameters (coefficients) for the linear regression model, SGD iteratively updates the coefficients by computing the gradient of the objective function with respect to each coefficient. The update rule for SGD for linear regression is as follows:

βj_new = βj_old - α * (-2 * Xi * (Yi - Ŷi))

Where:

βj_new is the new value of the j-th coefficient.

βj_old is the old value of the j-th coefficient.

α (alpha) is the learning rate, which controls the step size in the parameter updates.

Xi is the value of the j-th feature for data point i.

Yi is the actual target value for data point i.

Ŷi is the predicted target value for data point i.

Iterations: SGD repeats this update process for a specified number of iterations or until a convergence criterion is met. At each iteration, it randomly selects a single data point from the dataset and updates the coefficients accordingly.

2) Elastic Net Equation

Elastic Net is a linear regression technique that combines both L1 (Lasso) and L2 (Ridge) regularization methods to address some of their individual limitations. It is used to perform linear regression with a penalty on the coefficients of the features. In Elastic Net, the linear regression equation is extended to include both L1 and L2 regularization terms, resulting in the following objective/ cost function:

Objective = Sum of Squared Residuals + α * [λ1 * |β| + λ2 * β²]

Sum of Squared Residuals: This is the same as in ordinary linear regression and represents the sum of the squared differences between the predicted and actual values.

α: A hyperparameter that controls the balance between L1 and L2 regularization. α = 0 corresponds to Ridge regression, α = 1 corresponds to Lasso regression, and values between 0 and 1 are used to balance both regularization techniques.

λ1: The regularization parameter for L1 regularization (Lasso). It controls the strength of the L1 penalty on the absolute values of the coefficients.

λ2: The regularization parameter for L2 regularization (Ridge). It controls the strength of the L2 penalty on the squared values of the coefficients.

3) Method of Least Squares

We can use the method of least squares to estimate the coefficients (β0, β1, β2, ... βn). This method minimizes the sum of squared differences between the actual Y values and the predicted values based on the linear equation.

β1 = Σ((Xi - XÌ„)(Yi - Ȳ)) / Σ((Xi - XÌ„)^2)

β0 = Ȳ - β1*XÌ„

Xi: Value of the independent variable.

XÌ„: Mean of the independent variable.

Yi: Value of the dependent variable.

Ȳ: Mean of the dependent variable.

Residuals and Error Terms

Once you have picked a model and evaluated the coefficients of the linear regression equation, it is time to test the model fir by calculating the residuals, which are the differences between the actual Y data points and the predicted Y data points using the estimated coefficients.

Residual (εi) = Yi - Ŷi

Yi: Actual value of the dependent variable.

Ŷi: Predicted value of the dependent variable based on the linear equation.

Next, we examine the distribution of residuals to assess the model's fit. Ideally, residuals should be normally distributed and exhibit no discernible patterns (homoscedasticity).

Linear Regression in Machine Learning with Example

To learn Linear Regression in Machine Learning effectively, it is crucial to understand how to implement the model on a real-world dataset for predicting values of an output variable. As an illustrative example, let's consider predicting house prices in Boston using Linear Regression. This involves gathering a dataset containing relevant input variable features like square footage, number of bedrooms, and neighborhood characteristics, along with corresponding house prices. So, let us get started with a machine learning project on using linear regression for predicting house prices using Python by taking cues from the GitHub Repository of Bhavesh Bhat.

Dataset Selection

The first step of any project in machine learning is to first identify the business problem and find a relevant dataset. In this case of implementing linear regression in machine learning using python, we will use the Boston Housing Dataset. The "Boston Housing" dataset is a commonly used dataset in machine learning for regression tasks. It is often used for predicting housing prices based on various features of neighborhoods in Boston. You can access this dataset through popular Python libraries like scikit-learn.

The dataset contains the following features (X):

CRIM: Per capita crime rate by town

ZN: Proportion of residential land zoned for large lots

INDUS: Proportion of non-retail business acres per town

CHAS: Charles River dummy variable (1 if tract bounds river; 0 otherwise)

NOX: Nitrogen oxide concentration (parts per 10 million)

RM: Average number of rooms per dwelling

AGE: Proportion of owner-occupied units built before 1940

DIS: Weighted distance to employment centers

RAD: Index of accessibility to radial highways

TAX: Property tax rate (per $10,000)

PTRATIO: Pupil-teacher ratio by town

B: Proportion of residents of African American descent

LSTAT: Percentage of lower status population

The target variable (y) is the median value of owner-occupied homes in thousands of dollars.

We will use this dataset to build and evaluate regression models to predict prices of houses in Boston based on these features. It's a classic dataset for regression tasks and a great way to practice and learn about regression algorithms in machine learning. So, let us build a function in Python that will load the feature values and target values into variables X and Y respectively along with adding feature names into a separate variable named ‘names’. 

Function for Loading Boston Dataset in Python

Data Preprocessing

When it comes to implementing data science projects in the practical world, you are highly unlikely to find a clean dataset that you do not have to preprocess before feeding it as an input to a machine learning algorithm. So, in this case as well, we implement a few data preprocessing techniques to prepare the dataset. Let us first import the necessary Python libraries.

Importing necessary libraries in Python

We will now scale the features for the ML algorithm implementatio with the help of Standard Scaler function in the sklearn library. Next, we will split the dataset into train and test subset again with the help of sklearn library. 

Img Name: Code for Scaling data and splitting the dataset into training and testing subset

Model Training

We will now learn how to import linear regression in Python (sklearn linear regression) and how to run linear regression in python. We will implement various linear regression models that we discussed in the previous section like ridge regression, lasso regression, linear regression optimized using Elastic Net and Stochastic Gradient Descent. You can implement these models one by one and then analyze the results or proceed the way we have.

 

Code for training linear regression machine learning model

Model Evaluation

To evaluate the model performance, we will define two functions: pretty_print_linear and root_mean_square_error. 

Functions for evaluating model performance in Machine Learning

The pretty_print_linear function takes a list of coefficients, an optional list of feature names, and a sorting parameter. If feature names are not provided, it generates default names. The function pairs each coefficient with its corresponding feature name and, if specified, sorts them by absolute value. It then returns a formatted string representing a linear equation, with coefficients and feature names separated by " + " signs. This function is useful for presenting linear regression equations in a readable and organized manner.

The root_mean_square_error function calculates the root mean squared error (RMSE) between two sets of values, typically the predicted values (y_pred) and the actual target values (y_test). It computes the squared differences between corresponding elements, takes the average, and then calculates the square root of the result. RMSE is a measure of how well a regression model's predictions align with the actual data, providing insight into the model's accuracy.

We’ll now use these functions to evaluate the model performance of linear regression machine learning algorithms.

Implementingfunctions for evaluating model performance

Output:

Equation for different linear regression models and root mean square error

Interpretation of Results: How to report Linear Regression results

The final step is to focus on how to plot linear regression in python and visualize the results using graphs. Graphs provide a visual representation of model performance, allowing data scientists and stakeholders to quickly assess trends and patterns. They offer a clear and intuitive way to make informed decisions about model selection, tuning, and optimization in machine learning. So, let us plot one for each algorithm. These plots will help you visualize the performance of each algorithm. The green line in the plot represents the predicted values for the model and the red data points are the values from the test dataset. 

Code for plotting graphs to test the model performance

 

Simple line regression model plot

Lasso Regression Model Plot

Ridge Regression Model Plot 

Linear Regression using Elastic Net Model Plot

Linear Regression using Stochastic Gradient Descent Plot

Linear Regression in Machine Learning Projects

Working on projects in Linear Regression is essential for hands-on learning, as it bridges theory with practical implementation. Our list of prepared projects offers a diverse range of real-world applications, enabling you to gain valuable experience, sharpen your skills, and showcase your expertise to potential employers, all while addressing tangible, impactful problems.

Sales Forecasting

Sales forecasting is a critical task for businesses to manage inventory, plan resources, and optimize operations. In this project, historical sales data for a product or store is collected and analyzed. The data undergoes preprocessing, including handling time series aspects, creating lag features to capture trends, and encoding categorical variables like product categories or store locations. After splitting the dataset into training and testing sets, a Linear Regression model is trained using the training data. The model's performance is assessed using metrics like Mean Absolute Error (MAE) or Mean Absolute Percentage Error (MAPE). Ultimately, the trained model can be employed to make accurate sales predictions, aiding businesses in decision-making and inventory management.

Medical Diagnosis

Medical diagnosis using machine learning is a valuable application that can assist healthcare professionals in early detection and decision-making. For this project, medical data, such as patient records, test results, and clinical attributes, is collected and prepared. Data preprocessing steps include handling missing values and encoding categorical features related to diagnoses or patient demographics. The dataset is then divided into training and testing subsets. A Linear Regression model is trained to predict medical outcomes or conditions, and the model's performance is evaluated using relevant medical metrics, such as sensitivity, specificity, or the F1-score. The resulting model can be used for disease diagnosis or predicting health outcomes based on patient information.

Customer Churn Prediction

Customer churn, or the loss of customers, is a critical concern for businesses in various industries. In this project, customer data, including historical interactions, behaviors, and demographics, is collected and preprocessed. Data preprocessing involves handling missing values and encoding categorical variables like customer categories or subscription types. The dataset is then split into training and testing sets. A Linear Regression model is trained using the training data to predict the likelihood of customer churn. Model performance is evaluated using metrics like accuracy, precision, recall, and F1-score. This predictive model can assist businesses in identifying at-risk customers and implementing retention strategies.

Loan Default Prediction

Loan default prediction is crucial for financial institutions to assess credit risk and make informed lending decisions. To implement this project, lending data, including information about borrowers, loan terms, and historical loan outcomes, is collected and preprocessed. Data preprocessing may involve handling missing values and encoding categorical variables such as loan types or borrower attributes. The dataset is then divided into training and testing sets. A Linear Regression model is trained using the training data to predict the probability of loan default. Model performance is evaluated using metrics like accuracy, precision, recall, and the area under the ROC curve (AUC-ROC). The model can help financial institutions make more accurate assessments of borrowers' creditworthiness.

Weather Forecasting

Weather forecasting is a classic application of linear regression in meteorology. In this project, historical weather data, including variables like temperature, humidity, precipitation, and atmospheric pressure, is collected and processed. Data preprocessing may involve handling missing values and encoding categorical variables for weather conditions (e.g., sunny, rainy, snowy). The dataset is split into training and testing subsets, taking into account the temporal aspect of weather data. A Linear Regression model is trained using the training data to predict future weather conditions. Model performance can be evaluated using metrics like Mean Absolute Error (MAE) or Root Mean Squared Error (RMSE). Accurate weather predictions have broad applications, from agriculture to disaster preparedness.

Learn More about Linear Regression in Machine Learning with ProjectPro

Now that you have learned all you need to know about linear regression, it is time to t ake your machine learning skills to the next level by exploring various algorithms and their practical applications. So, why wait and look anywhere else when you have ProjectPro. Dive into the world of data science and big data with ProjectPro, where you can access a treasure trove of solved projects meticulously crafted by industry experts and professionals. Whether you want to master Linear Regression or delve into other advanced techniques, ProjectPro offers a comprehensive platform for hands-on learning and real-world problem-solving. Explore our projects to enhance your expertise and embark on a journey of continuous growth in the field of data science and machine learning.

FAQs on Linear Regression in Machine Learning

1) What does linear regression tell you?

Linear regression provides insights into the relationship between a dependent variable and one or more independent variables by fitting a linear equation. It quantifies the impact of independent variables on the dependent variable, allowing for predictions and understanding how changes in the independent variables affect the outcome.

2) Is linear regression a good machine learning model?

Linear regression is a valuable machine learning model, especially for its simplicity and interpretability. It serves as a strong baseline model and is well-suited for scenarios where relationships between variables are approximately linear. However, its performance may be limited when dealing with complex, nonlinear data, where more advanced models could be more suitable.

PREVIOUS

NEXT

Access Solved Big Data and Data Science Projects

About the Author

Manika

Manika Nagpal is a versatile professional with a strong background in both Physics and Data Science. As a Senior Analyst at ProjectPro, she leverages her expertise in data science and writing to create engaging and insightful blogs that help businesses and individuals stay up-to-date with the

Meet The Author arrow link