Linear regression in Python with Scikit-learn (With examples, code, and notebook)
Scikit-learn is a handy and robust library with efficient tools for machine learning. It provides a variety of supervised and unsupervised machine learning algorithms. The library is written in Python and is built on Numpy, Pandas, Matplotlib, and Scipy. In this tutorial, we will discuss linear regression with Scikit-learn.
What is linear regression?
Linear regression is a type of predictive analysis that attempts to predict the value of a dependent variable with another independent variable. It estimates the coefficients of a linear equation involving one or more independent variables that best predict the dependent variable and fits a straight line or surface that reduces the variation between the predicted and the actual output values.
Linear regression assumptions
For successful linear regression, four assumptions must be met. They include:
- There should be a linear relationship between the independent and dependent variables.
- The residuals should be independent, with no correlations between them.
- Residuals should have a constant variance at every level of x.
- Residuals of the model should be normally distributed.
Installing Scikit-learn
To install Scikit-learn, ensure that you have Numpy(See our Numpy tutorial) and Scipy installed. Scipy can be installed with pip
or conda
like below:
Install Scipy with pip
:
pip install scipy
Install Scipy with conda
:
conda install scipy
If you just installed or had Numpy and Scipy installed, proceed to install Scikit-learn with the following commands:
Install via pip
:
pip install -U scikit-learn
Install with conda
:
conda install -c conda-forge scikit-learn
Models provided by Scikit-learn
Models provided by Scikit-learn include:
- Unsupervised learning algorithms.
- Clustering.
- Cross-validation.
- Dimensionality reduction.
- Feature extraction.
- Feature selection.
Scikit-learn modeling process
Modeling in Scikit-learn involves the following steps:
- Loading datasets.
- Splitting the dataset.
- Training the model.
We will learn more about modeling in Scikit-learn later in the article.
Loading datasets
Scikit-learn provides example datasets, such as the iris
and digits
used for classification, the California housing dataset
, and the Ames housing dataset
for regression.
A dataset has two components, namely:
- Features are the variables of the data. Feature names give a list of all the names of the features in the feature matrix – a collection of features commonly represented with 'X'.
- Response/target is the output variable. Target names represent the possible values from a response column(response vector, commonly represented with 'y'). We usually have one response column.
Let's load the California housing dataset
:
from sklearn.datasets import fetch_california_housing
housing_data = fetch_california_housing()
X = housing_data.data # represent the feature matrix
y = housing_data.target # represent the response vector/target
feature_names = housing_data.feature_names
target_names = housing_data.target_names
print('Feature names: ', feature_names)
print('\nTarget names: ', target_names, '(Median house value for households)')
print("\nFirst 5 rows of X:\n", X[:5])
print('\nShape of dataset', X.shape)
Splitting datasets
Splitting the dataset is crucial in determining the accuracy of a model. If we were to train the model with the raw dataset and predict the response for the same dataset, the model would suffer flaws like overfitting, thus compromising its accuracy. For this reason, we have to split the data into training and testing sets. This way, we can use the training set to train the model and the test set to test the model.
Scikit-learn provides a train_test_split
function for splitting the datasets into train and test subsets.
We will split the California housing dataset in the ratio of 70:30, where 70% will be the training set and 30% for the testing set.
from sklearn.datasets import fetch_california_housing
housing_data = fetch_california_housing()
X = housing_data.data # represent the feature matrix
y = housing_data.target # represent the response vector/target
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=1)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)
From the example, in the train_test_split
function, we have provided:
- X and y representing the feature matrix and response vector, respectively.
test_size
which represents the ratio of test data(0.3 for 30%) and the total data given.random_size
indicating that we want the data splitting to be similar every time.
Training the model
After the dataset is split, we need to train a prediction model. At this stage, we choose a class of a model from the appropriate estimator class in Scikit-learn. For this example, we will use the LinearRegression
class.
from sklearn.datasets import fetch_california_housing
housing_data = fetch_california_housing()
X = housing_data.data # represent the feature matrix
y = housing_data.target # represent the response vector/target
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=1) # split the data
# importing the linearRegression class
from sklearn.linear_model import LinearRegression
regressor = LinearRegression() # instantiate the Linear Regression model
regressor.fit(X_train, y_train) # training the model
# expose the model to new values and predict the target vector
y_predictions = regressor.predict(X_test)
print('Predictions:', y_predictions)
# get the coefficients and intercept
print("Coefficients:\n", regressor.coef_)
print('Intercept:\n', regressor.intercept_)
Linear regression
As we mentioned above, linear regression is a supervised machine learning algorithm that tries to predict the relationship between a dependent variable and one or more independent variables. Linear regression establishes the relationship between these two variables by fitting the best fit line, also called the regression line.
The Ordinary Least Squares regression(OLS) is a common technique for estimating linear regression equations coefficients. It fits the linear model with coefficients w = (w1, …, wp) to minimize the residual sum of squares between the observed targets in the dataset and the predicted targets.
Mathematically, we can represent a linear regression as:
y= a0+a1x+ ε
Where:
- y is the dependent variable.
- x is the independent variable (predictor variable).
- a0 is the intercept of the line.
- a1 is the linear regression coefficient (scale factor to each input value).
- ε is the random error.
The linear_model.LinearRegression
module is used to implement linear regression. LinearRegression
takes the following parameters:
fit_intercept
: A boolean value that enables calculation of the intercept when set to True otherwise, no intercept is calculated.copy_X
: Default is True; thus, X is copied.n_jobs
: Number of jobs to use for computation.positive
: If set to True, all coefficients become positive.
The LinearRegression
class also has the following attributes:
coef_
: gives the estimated coefficients.intercept_
: gives the independent term in the linear model.
There are two types of linear regression:
- Simple linear regression uses only one independent variable to predict a dependent variable.
- Multiple linear regression is an extension of simple linear regression with multiple independent variables to predict a dependent variable.
Building a simple linear regression model with Scikit-learn
With the basics out of the way, let's look at how to build a simple linear regression model in Scikit-learn. We kick off by loading the dataset.
Loading a sample dataset
Load the Student study hours dataset from Kaggle. The dataset is a CSV file with two columns, Hours and Scores. We will use it to build a simple linear regression model to predict the Scores(dependent/target variable) based on the number of Hours(independent variable) a student takes to study.
import pandas as pd
stud_scores = pd.read_csv('student_scores.csv')
stud_scores.head()
Before building the linear regression model, we must first understand the data:
stud_scores.shape
From the info()
method, we can see that both Hours and Scores are numeric, which is crucial for linear regression.
The describe()
method will give a statistical overview of the data. For instance, we can see that the average study time of a student is 5 hours, the minimum score is 17, and the maximum score is 95.
Check if the data has missing values:
We have no missing values.
Check for correlation and plot a heatmap:
Checking for correlation helps us understand the relationship between the variables. There is a strong positive correlation between Hours and Scores. Below is a heatmap of the correlation with Seaborn:
import seaborn as sns
sns.heatmap(stud_scores.corr(), annot=True)
We can also plot a scatter plot to determine whether linear regression is the ideal method for predicting the Scores based on the Hours of study:
import seaborn as sns
sns.relplot(x='Scores', y='Hours', data=stud_scores,
height=3.8, aspect=1.8, kind='scatter')
sns.set_style('dark')
There is a linearly increasing relationship between the dependent and independent variables; thus, linear regression is the best model for the prediction.
Splitting the data
Scikit-learn provides a train_test_split
function for splitting the datasets into train and test subsets. We will split our dataset in the ratio of 70:30.
Create the feature matrix(X) and the response vector(y):
X = stud_scores.iloc[:,:-1].values # feature matrix
y = stud_scores.iloc[:,1].values # response vector
Split the data:
# SPLITTING THE DATA
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=1)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)
Fitting simple linear regression
Import the LinearRegression
class from the linear_model
to train the model. Instantiate an object of the class named regressor.
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
The regressor object is also called an estimator. An estimator is any object that fits a model based on some training data and is capable of inferring some properties on new data. All estimators implement a fit
method.
The fit
method takes the training data as the argument. We use this method to estimate some parameters of a model. For instance, we have passed in the X_train
and the y_train
– the independent and dependent variables. The model learns the correlations between the predictor and target variables.
Getting the coefficient/slope and the intercept
Now that we have fitted the model, we can check the slope and the intercept of the simple linear fit.
Coefficient:
regressor.coef_
The coefficient shows that, on average, the score increased by approximately 10.41 points for every hour the student studied.
Intercept:
regressor.intecept_
Linear regression model fit line
The Seaborn regplot
function enables us to visualize the linear fit of the model. It will draw a scatter plot of the variables and then fit the linear regression model. The regression line will be plotted with a 95% confidence interval.
sns.regplot(x='Hours', y='Scores', data=stud_scores, ci=None, scatter_kws={'s':100, 'facecolor':'red'})
Predicting test set result
At this point, the model is now trained and ready to predict the output of new observations. Remember, we split our dataset into train and test sets. We will provide test sets to the model and check its performance.
y_pred = regressor.predict(X_test)
y_pred
Comparing the test values and the predicted values:
comparison_df = pd.DataFrame({"Actual":y_test,"Predicted":y_pred})
comparison_df
Checking the residuals:
residuals = y_test - y_pred
residuals
Comparing the test data and the predicted values with a scatter plot:
import matplotlib.pyplot as plt
sns.scatterplot(x=y_test, y = y_pred, ci=None, s=140)
plt.xlabel('y_test data')
plt.ylabel('Predictions')
The values seem to align linearly, which shows that the model is acceptable.
Evaluating linear regression models
There are various metrics in place that we can use to evaluate linear regression models. Since models can't be 100 percent efficient, evaluating the model on different metrics can help us optimize the performance, fine-tune it, and obtain better results. The metrics we can use include:
- Mean Absolute Error(MAE) calculates the absolute difference between the actual and predicted values. We get the sum of all the prediction errors and divide them by the total number of data points.
To get the MAE of a model, import the mean_absolute_error
class from the sklearn.metrics
module.
from sklearn.metrics import mean_absolute_error
print('MAE:', mean_absolute_error(y_test,y_pred)
- Mean Squared Error(MSE): This is the most used metric. It finds the squared difference between actual and predicted values. We get the sum of the square of all prediction errors and divide it by the number of data points.
To get the MSE from the model, import the mean_squared_error
class from sklearn.metrics
module.
from sklearn.metrics import mean_squared_error
print("MSE",mean_squared_error(y_test,y_pred))
- Root Mean Squared Error(RMSE) is the square root of MSE.
Since MSE is calculated by the square of error, the square root brings it back to the same level of prediction error. We need the NumPy square root function to compute it.
import numpy as np
print("RMSE",np.sqrt(mean_squared_error(y_test,y_pred)))
- R Squared(R2): R2 is also called the coefficient of determination or goodness of fit score regression function. It measures how much irregularity in the dependent variable the model can explain. The R2 value is between 0 to 1, and a bigger value shows a better fit between prediction and actual value.
From the sklearn.metrics
module, import the r2_score
function, and find the goodness of fit of the model.
from sklearn.metrics import r2_score
r2 = r2_score(y_test,y_pred)
print(r2)
The model has a pretty good score, meaning it was excellent in predicting the Scores.
Remarks on the model
We can conclude that the simple linear model we built works fine in predicting the Scores based on the Hours of study since the errors were relatively low and the R2 score was high.
Complete code for the simple linear regression
# import all the modules we will need
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
# NOTE: methods from sklearn.metrics module can be imported as one liner like:
# from sklearn.metrics import mean_absolute_error,mean_squared_error, r2_score
# LOADING DATASET
stud_scores = pd.read_csv('student_scores.csv')
stud_scores.head()
stud_scores.info()
stud_scores.describe()
# CREATING FEATURE MATRIX AND RESPONSE VECTOR
X = stud_scores.iloc[:,:-1].values # feature matrix
y = stud_scores.iloc[:,1].values # response vector
# SPLITTING THE DATA
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=1)
# FITTING LINEAR REGRESSION MODEL / TRAINING
regressor = LinearRegression()
regressor.fit(X_train, y_train)
# GETTING THE COEFFICIENTS AND INTERCEPT
print('Coefficients: ', regressor.coef_)
print('Intercept: ',regressor.intercept_)
# PREDICTION OF TEST RESULT
y_pred = regressor.predict(X_test)
print('Predictions:\n', y_pred)
# COMPARING TEST DATA AND PREDICTED DATA
comparison_df = pd.DataFrame({"Actual":y_test,"Predicted":y_pred})
print('Actual test data vs predicted: \n', comparison_df)
# EVALUATING MODEL METRICS
print('MAE:', mean_absolute_error(y_test,y_pred))
print("MSE",mean_squared_error(y_test,y_pred))
print("RMSE",np.sqrt(mean_squared_error(y_test,y_pred)))
r2 = r2_score(y_test,y_pred)
print('Model Score: ', r2)
# FITTING LINEAR REGRESSION LINE
sns.regplot(x='Hours', y='Scores', data=stud_scores, ci=None,
scatter_kws={'s':100, 'facecolor':'red'})
Aside:
Before we can explore multiple linear regression, there are certain concepts that we need to understand as they will be essential in knowing how to carry out multiple linear regression perfectly. These concepts are:
- Multicollinearity in regression analysis.
- Dummy variables and Dummy variable trap.
Multicollinearity in regression analysis
Multicollinearity in regression analysis occurs when two or more predictors or independent variables are highly correlated such that they do not give unique or independent information in the regression model. The independent variables should be independent; thus, if the degree of correlation between them is high, problems can occur when fitting and interpreting the regression model.
For instance, if we wanted to predict the maximum verticle jump of an athlete with predictor variables like shoe size and height, we will encounter an instance of high colinearity between the shoe size and the height, as usually, tall people tend to have larger shoe sizes.
The main goal of regression analysis is to isolate the relationship between each independent variable and the dependent variable. We interpret regression coefficients as the mean change in the dependent variable for each 1 unit change in an independent variable when all other independent variables are constant.
So if we cannot change the value of a given predictor variable without changing another predictor variable, then there is a problem caused by high collinearity.
The most common way to detect multicollinearity is by using the variance inflation factor (VIF), which measures the correlation and strength of correlation between the predictor variables in a regression model.
Dummy variables and Dummy variable trap
Dummy variables
Typically in linear regression, we carry out linear regression with numeric data. Numeric data is easy to handle for linear regression. However, sometimes we may use categorical data as predictor variables to make predictions, for example, Gender(male, female).
Categorical data can not be used directly for regression and needs to be transformed into numeric data. The solution is to use dummy variables. We create dummy variables for regression analysis that take on one of two values: zero or one.
Take, for example, this dataset:
If we were to use the State category, we would need to convert it into some dummy variable:
To do this, Scikit-learn provides LabelEncoder
and OneHotEncoder
utility classes.
Dummy variable trap
A dummy variable trap is a scenario where we have highly correlated attributes(Multicollinear), and one variable predicts the value of other variables.
The number of dummies we have to create equals K-1, where K represents the number of different values the categorical variable can take. When we use the OneHotEncoder utility class, one variable can be predicted by other variables, which we can exclude(K-1).
For instance, from the sample dataset we have displayed above, the State category can take up to 3 variables(California, Florida, and New York). So we will create K-1 = 3-1= 2 dummy variables to get the following result:
Using all the dummy variables for regression results in the dummy variable trap!
Building a multiple linear regression model with Scikit-learn
This section will focus on multiple independent variables to predict a single target.
Since we have p predictor variables, we can represent multiple linear regression with the equation below:
Y = β0 + β1X1 + β2X2 + … + βpXp + ε
Where:
- Y: The response variable.
- Xj: The jth predictor variable.
- βj: The average effect on Y of a one unit increase in Xj, holding all other predictors fixed.
- ε: The error term.
Loading dataset
We will load the 50 startups dataset from Kaggle. The dataset is a CSV file with data collected from New York, California, and Florida with around 50 business Startups – 17 in each state. The variables used in the dataset are Profit, R&D spending, Administration Spending, and Marketing Spending. Our main goal is to predict the profits.
Let's first import all the modules we will need:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error,mean_squared_error, r2_score
Load the dataset using Pandas.
startups_df = pd.read_csv('50_Startups.csv')
startups_df.head()
Info:
We have five variables where four of them have continuous data, and one is categorical. We will convert the categorical variable later.
Overview of statistical data:
Correlation and heatmap:
Extracting independent variables(X) and dependent(y)
Since we are predicting profits, the profit variable will be the dependent variable and the rest independent variables.
X = startups_df.iloc[:, :-1] # Independent varibles
y = startups_df.iloc[:, -1] # dependent variable
X.head()
Encoding dummy variables
As we can observe, we have a State column with categorical data. We need to assign dummy variables to it as we can not directly use categorical data for regression. We will use the OneHotEncoder
utility classes from Scikit-learn sklearn.preprocessing
module.
One hot encoding:
# creating instance of one-hot-encoder
enc = OneHotEncoder(drop='first') # drop the first dummy variable (K-1)
enc_df = pd.DataFrame(enc.fit_transform(X[['State']]).toarray())
enc_df.columns = ['Florida', 'New York']
# merge with main df on key values
X = X.join(enc_df)
X.head()
We can now drop the State column:
X = X.drop('State', axis=1)
X.head()
Splitting data into train and test sets
As before, we will split the data with the train_test_split
function from Scikit-learn.
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.30, random_state=0)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)
'''
(40, 4)
(10, 4)
(40,)
(10,)
'''
Fitting the simple linear regression to the Training Set
Create an object of linear regression and train the model with the training datasets.
regressor = LinearRegression() # Instatiate LinearREgression object
regressor.fit(X_train, y_train) # fit the model
Getting the coefficients and the intercept
Getting the coefficients enables us to form an estimated multiple regression model. Let's have a look:
regressor.coef_
reressor.intercept_
Predicting test results
Predict the output of new observations with the trained model.
Compare the actual values and predicted values:
comparison_df = pd.DataFrame({"Actual":y_test,"Predicted":y_pred})
comparison_df
Checking the residuals:
residuals = y_test - y_pred
residuals
When we compare the first values, $103282, and the predicted $103015, the difference/residue is approximately $267, which is not bad, showing that our model is working fine.
Compare the actual values and predicted values with a scatter plot:
import matplotlib.pyplot as plt
sns.scatterplot(x=y_test, y = y_pred, ci=None, s=140)
plt.xlabel('y_test data')
plt.ylabel('Predictions')
Evaluating the model
We can check the goodness of fit or the score of our model with the R2 (r2_score) metric.
score = r2_score(y_test, y_pred)
score
The score is 0.93, closer to 1, indicating that our model works as expected.
Complete code for multiple linear regression
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import mean_absolute_error,mean_squared_error, r2_score
startups_df = pd.read_csv('50_Startups.csv')
startups_df.head()
startups_df.info()
startups_df.describe()
startups_df.shape
startups_df.corr()
sns.heatmap(startups_df.corr(), annot=True)
# SPLITTING THE DATA INTO INDEPEDENT AND DEPENDENT VARIABLES(X, y)
X = startups_df.iloc[:, :-1] # Independent varibles
y = startups_df.iloc[:, -1] # dependent variable
# USING OneHotEncoder TO CONVERT CATEGORICAL DATA TO DUMMY VARIABLES
# creating instance of one-hot-encoder
enc = OneHotEncoder(drop='first') # drop the first dummy variable (K-1)
enc_df = pd.DataFrame(enc.fit_transform(X[['State']]).toarray())
enc_df.columns = ['Florida', 'New York']
# merge with main df on key values
X = X.join(enc_df)
X = X.drop('State', axis=1) # DROP THE State COLUMN
X.head()
# SPLITTING DATA FOR train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
# FITTING THE MODEL/TRAIN
regressor = LinearRegression() # Instatiate LinearREgression object
regressor.fit(X_train, y_train) # fit the model
print('Coefficients: ', regressor.coef_)
print('Intercept: ',regressor.intercept_)
# PREDICTING
y_pred=regressor.predict(X_test)
# predictions
comparison_df = pd.DataFrame({"Actual":y_test,"Predicted":y_pred})
print(comparison_df)
# RESIDUALS
residuals = y_test - y_pred
print('Residuals: ', residuals)
# CHECKING THE SCORE OF THE MODEL WITH R^2 METRIC
score = r2_score(y_test, y_pred)
print('Model Score: ', score, 'Equal to: ', score * 100, '%')
Final thoughts
We can perform other types of regression, and we will discuss them in later articles. For this article, we have centered our interest on understanding linear regression and how to perform it with Scikit learn library. Generally, we have covered:
- The modeling process of Scikit-learn.
- Simple and multiple linear regression.
- Loading datasets and understanding them before modeling.
- How to split, train, test, and evaluate our linear regression models.
- What multicollinearity is in regression.
- Problems that might occur when building multiple linear regression models, like the dummy variable trap.
- Assumptions of linear regression.
- Basic plots with Seaborn relevant to the tasks.
The Complete Data Science and Machine Learning Bootcamp on Udemy is a great next step if you want to keep exploring the data science and machine learning field.
Follow us on LinkedIn, Twitter, and GitHub, and subscribe to our blog, so you don't miss a new issue.