top of page
Nikhil Adithyan

Step-by-step guide to Simple and Multiple Linear Regression in Python

Updated: Dec 14, 2020

Build and evaluate SLR and MLR machine learning models in Python



Linear Regression


Linear regression is a statistical model that examines the linear relationship between two (Simple Linear Regression ) or more (Multiple Linear Regression) variables — a dependent variable and independent variable(s). Linear relationship means that when one (or more) independent variables increases (or decreases), the dependent variable increases (or decreases) too:



As you can see, a linear relationship can be positive (independent variable goes up, the dependent variable goes up) or negative (independent variable goes up, the dependent variable goes down)


Math part


A relationship between variables Y and X is represented by this equation:


Y`i = mX + b

In this equation, Y is the dependent variable — or the variable we are trying to predict or estimate; X is the independent variable — the variable we are using to make predictions; m is the slope of the regression line — it represents the effect X has on Y. In other words, if X increases by 1 unit, Y will increase by exactly m units. b is a constant, also known as the Y-intercept. If X equals 0, Y would be equal to b.


This leads us to a Simple Linear Regression (SLR). In an SLR model, we build a model based on data — the slope and Y-intercept derive from the data; furthermore, we don’t need the relationship between X and Y to be exactly linear. SLR models also include the errors in the data (also known as residuals). Residuals are basically the differences between the true value of Y and the predicted/estimated value of Y. It is important to note that in linear regression, we are trying to predict a continuous variable. In a regression model, we are trying to minimize these errors by finding the “line of best fit” — the regression line from the errors would be minimal. We are trying to minimize the length of the black lines (or more accurately, the distance of the blue dots) from the red line — as close to zero as possible. It is related to (or equivalent to) minimizing the mean squared error (MSE) or the sum of squares of error (SSE), also called the “residual sum of squares”.

In most cases, we will have more than one independent variable — we’ll have multiple variables; it can be as little as two independent variables and up to hundreds (or theoretically even thousands) of variables. in those cases, we will use a Multiple Linear Regression model (MLR). The regression equation is pretty much the same as the simple regression equation, just with more variables:


Y’i = b0 + b1X1i + b2X2i

Now, we are set to build and evaluate SLR and MLR models in Python.


Python Implementation 


There are two main ways to build a linear regression model in python which is by using “Statsmodel ”or “Scikit-learn”. In this article, we’ll be building SLR and MLR models in both Statsmodel and Scikit-learn to predict the CO2 emissions of cars. Before building our model, it is necessary to import and process the data and identify variables for our regression model.


Step — 1: Importing and Processing the Data



Output:


Now, we have an idea of what our dataset is about. Next, it is necessary to have a look at a statistical summary of our dataset.


Output:


Now, we have a clear idea of both structure and statistical summary of our dataset. Next, we have to remove some character columns which may disrupt our regression model.


So, we have cleaned and processed our data and we are now ready for some visualizations in order to find some linear relationships between variables.


Step — 2: Finding Linear Relationships 


With CO2 emissions as the dependent variable, we have to find some positive or negative linear relationships by implementing scatter plots. These variables are further used for building our SLR and MLR models. For statistical visualizations, it is best to make use of the seaborn library and so, let’s import it.


Output:


Now, we are going to plot any single independent variable against our dependent variable which is C02 emissions to find linear relationships between them. Let’s do it in python!


(i) Engine size / CO2 emissions:


Output:


By plotting the Engine size variable against our dependent variable, we can observe a positive linear relationship. Hence, we can take engine size as an independent variable for our model.


(ii) Fuel Consumption Comb (L/100 km) / C02 emissions:


Output:


Similar to Engine size, Fuel Consumption Comb (L/100 km) also represents a positive linear relationship. Hence, it can be taken as an independent variable for our model


(iii) Fuel Consumption Hwy (L/100 km) / CO2 emissions:


Output:


As Fuel Consumption Hwy (L/100 km) against CO2 emissions reveals a positive relationship, it can be granted as an independent variable for building our model.


(iv) Fuel Consumption City (L/100 km) / CO2 Emissions:


Output:


Like all the above variables, Fuel Consumption City (L/100 km) when plotted against CO2 emissions, shows a positive linear relationship. So, this can also be considered as an independent variable for our model.


Now, we have four independent variables that can be used to train and build our regression model. Without wasting a moment, let’s build our machine learning model in Python!


Step — 3: SLR Model


To build a Simple Linear Regression (SLR) model, we must have an independent variable and a dependent variable. For our SLR model, we are going to take Engine size as the independent variable and undoubtedly CO2 emissions as the dependent variable. Let’s define our variables in Python:


As I said before, we will be building a model using statsmodels at first and followed by scikit-learn.


(i) Statsmodels:


Code Explanation: At first, we have imported our primary package which is “statsmodels.api”. Next, we have defined a variable “slr_model” to store our Ordinary Least Squares (OLS) model, and finally, we stored our fitted model to a variable “slr_reg”. 


Now let’s see the results of our model’s performance.


Output:


When analyzing our results summary, we can notice that the R-squared of the model is 0.943 (94.3%) which clearly reveals that our model is doing well and can be used for real-world cases for solving problems.


(ii) Scikit-learn:


Like how we used the OLS model in statsmodels, using scikit-learn, we are going to use the ‘train_test_split’ algorithm to process our model. Let’s do it in Python!


Code Explanation: Firstly, we are importing our primary packages which are “LinearRegression” and “train_test_split”. Using the train_test_split algorithm, we are classifying the training dataset and the testing dataset whose size is 30% of the original dataset. Inside the train_test_split algorithm, I’ve passed a command “random_state = 0” which means, there should be no automatic random shuffling of data when classifying train and test data. Next, we are storing our linear model to the variable “lr” and fitting the model to the variables. Finally, we are storing the predicted values to the variable “yhat”.


Now to check the accuracy of our scikit-learn model, we are going to calculate the slope and intercept and fit that values into our model also, we are going to calculate the R-Squared value of the model. Let’s do it in Python!


Output:


Now let’s calculate the R-squared value of our model by scikit-learn. Follow the code to calculate the R-squared value:


Output:

R-squared : 0.7162770226132333 

We can notice that the value of R-squared in the scikit-learn model is different from the statsmodels model. This is because we didn't add a constant value to the independent variable in the statsmodels model. In the upcoming MLR model, we will be adding a constant value to the independent variable in the statsmodels.

We have successfully created our SLR model using both statsmodel package and the scikit-learn package. Now let’s dive into building the Multiple Linear Regression (MLR) model.


MLR Model


To build a Multiple Linear Regression (MLR) model, we must have more than one independent variable and a dependent variable. For our MLR model, we are going to take four independent variables and undoubtedly CO2 emissions as the dependent variable. Let’s define our variables in Python:



Remember that, adding more and more independent variables to the model might result in “Overfitting”. In our CO2 data, we have only a small number of attributes but in case of huge data, we must be more cautious about picking independent variables. So, it is highly recommended to choose only relevant independent variables to the dependent variable.


(i) Statsmodels:



You can notice that we have added a constant value to our independent variable. Now that we have fitted our model and let’s view the results summary.


Output:


When analyzing our results summary, we can notice that the R-squared of the model is 0.874 (94.3%) and this value is derived by including the constant value of the independent variable. So we can say that this model can be used to solve real-world cases.


(ii) Scikit-learn


The code implementation and the algorithm used are the same as the SLR model but adding extra attributes to the independent variable.


To check the accuracy of the scikit-learn model, we can calculate the R-squared score and we can introduce a new way which is by distribution plot. Firstly, let’s calculate the R-squared value in Python:


Output:

R-Squared : 0.8655946234480003  

We can observe that the R-squared value of the scikit-learn model is almost similar to the statsmodels model whose value is 0.87. This is because we have added a constant value to the independent variable while building the statsmodels MLR model.


The second method to check the accuracy of the MLR scikit-learn model is by constructing a distribution plot by combining the predicted values and the actual values. Follow the code to produce a distribution plot:


Output:


This distribution plot reveals that our prediction values have performed almost precisely to our actual values but there are some outliers that can be noticed. This is because we have built a very basic model on Linear Regression to precisely predict the outcomes.


Final Thoughts! 


We have successfully run through a whole bunch of processes aiming to build and evaluate SLR and MLR models in python and of course, we have achieved our goal. Apart from SLR and MLR, there is much more to discover on Linear Regression like Polynomial and Non-polynomial regression, Ridge, and so on. In this article, we have evaluated our model using just a few methods but, there are more to dive into. Also, the math behind Linear Regression is an ocean of formulas. Even though there are powerful packages in python to deal with formulas, you can’t always depend on them. Learning and gaining a good insight into the math portion will be worthwhile. I hope, this article would help you and never ever stop learning. If you forgot to follow any of the code sections, don’t worry I’ve provided the full code below.


Happy Machine Learning!


Full code:



2 comments

2 Comments


209847 VEERABADRAN V
209847 VEERABADRAN V
Oct 11, 2020

Good One Nikhil

Like

Rathinagiri Subbiah
Rathinagiri Subbiah
Oct 11, 2020

Looks really nice coding for predictive analytics Nikil. Informative too.


Like
bottom of page