# Linear regression in Python with Scikit-learn (With examples, code, and notebook)

Scikit-learn is a handy and robust library with efficient tools for machine learning. It provides a variety of supervised and unsupervised machine learning algorithms. The library is written in Python and is built on Numpy, Pandas, Matplotlib, and Scipy. In this tutorial, we will discuss linear regression with Scikit-learn.

## What is linear regression?

Linear regression is a type of predictive analysis that attempts to predict the value of a dependent variable with another independent variable. It estimates the coefficients of a linear equation involving one or more independent variables that best predict the dependent variable and fits a straight line or surface that reduces the variation between the predicted and the actual output values.

## Linear regression assumptions

For successful linear regression, four assumptions must be met. They include:

• There should be a linear relationship between the independent and dependent variables.
• The residuals should be independent, with no correlations between them.
• Residuals should have a constant variance at every level of x.
• Residuals of the model should be normally distributed.

## Installing Scikit-learn

To install Scikit-learn, ensure that you have Numpy(See our Numpy tutorial) and Scipy installed. Scipy can be installed with `pip` or `conda` like below:

Install Scipy with `pip`:

``pip install scipy``

Install Scipy with `conda`:

``conda install scipy``

If you just installed or had Numpy and Scipy installed, proceed to install Scikit-learn with the following commands:

Install via `pip`:

``pip install -U scikit-learn``

Install with `conda`:

``conda install -c conda-forge scikit-learn``

## Models provided by Scikit-learn

Models provided by Scikit-learn include:

• Unsupervised learning algorithms.
• Clustering.
• Cross-validation.
• Dimensionality reduction.
• Feature extraction.
• Feature selection.

## Scikit-learn modeling process

Modeling in Scikit-learn involves the following steps:

• Splitting the dataset.
• Training the model.

Scikit-learn provides example datasets, such as the `iris` and `digits`  used for classification, the `California housing dataset`, and the `Ames housing dataset` for regression.

A dataset has two components, namely:

• Features are the variables of the data. Feature names give a list of all the names of the features in the feature matrix – a collection of features commonly represented with 'X'.
• Response/target is the output variable. Target names represent the possible values from a response column(response vector, commonly represented with 'y'). We usually have one response column.

Let's load the `California housing dataset`:

``````from sklearn.datasets import fetch_california_housing

housing_data = fetch_california_housing()
X = housing_data.data # represent the feature matrix
y = housing_data.target # represent the response vector/target

feature_names = housing_data.feature_names
target_names = housing_data.target_names

print('Feature names: ', feature_names)
print('\nTarget names: ', target_names, '(Median house value for households)')
print("\nFirst 5 rows of X:\n", X[:5])
print('\nShape of dataset', X.shape)``````

### Splitting datasets

Splitting the dataset is crucial in determining the accuracy of a model. If we were to train the model with the raw dataset and predict the response for the same dataset, the model would suffer flaws like overfitting, thus compromising its accuracy. For this reason, we have to split the data into training and testing sets. This way, we can use the training set to train the model and the test set to test the model.

Scikit-learn provides a `train_test_split` function for splitting the datasets into train and test subsets.

We will split the California housing dataset in the ratio of 70:30, where 70% will be the training set and 30% for the testing set.

``````from sklearn.datasets import fetch_california_housing

housing_data = fetch_california_housing()
X = housing_data.data # represent the feature matrix
y = housing_data.target # represent the response vector/target

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=1)
print(X_train.shape)
print(X_test.shape)

print(y_train.shape)
print(y_test.shape)``````

From the example, in the `train_test_split` function, we have provided:

• X and y representing the feature matrix and response vector, respectively.
• `test_size` which represents the ratio of test data(0.3 for 30%) and the total data given.
• `random_size` indicating that we want the data splitting to be similar every time.

### Training the model

After the dataset is split, we need to train a prediction model. At this stage, we choose a class of a model from the appropriate estimator class in Scikit-learn. For this example, we will use the `LinearRegression` class.

``````from sklearn.datasets import fetch_california_housing

housing_data = fetch_california_housing()
X = housing_data.data # represent the feature matrix
y = housing_data.target # represent the response vector/target

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=1) # split the data

# importing the linearRegression class
from sklearn.linear_model import LinearRegression

regressor = LinearRegression() # instantiate the Linear Regression model
regressor.fit(X_train, y_train) # training the model

# expose the model to new values and predict the target vector
y_predictions = regressor.predict(X_test)
print('Predictions:', y_predictions)
# get the coefficients and intercept
print("Coefficients:\n", regressor.coef_)
print('Intercept:\n', regressor.intercept_)``````

## Linear regression

As we mentioned above, linear regression is a supervised machine learning algorithm that tries to predict the relationship between a dependent variable and one or more independent variables. Linear regression establishes the relationship between these two variables by fitting the best fit line, also called the regression line.

The Ordinary Least Squares regression(OLS) is a common technique for estimating linear regression equations coefficients. It fits the linear model with coefficients w = (w1, …, wp) to minimize the residual sum of squares between the observed targets in the dataset and the predicted targets.

Mathematically, we can represent a linear regression as:

``````y= a0+a1x+ ε
``````

Where:

• y is the dependent variable.
• x is the independent variable (predictor variable).
• a0 is the intercept of the line.
• a1 is the linear regression coefficient (scale factor to each input value).
• ε is the random error.

The `linear_model.LinearRegression` module is used to implement linear regression. `LinearRegression` takes the following parameters:

• `fit_intercept`: A boolean value that enables calculation of the intercept when set to True otherwise, no intercept is calculated.
• `copy_X`: Default is True; thus, X is copied.
• `n_jobs`: Number of jobs to use for computation.
• `positive`: If set to True, all coefficients become positive.

The `LinearRegression` class also has the following attributes:

• `coef_`: gives the estimated coefficients.
• `intercept_`:  gives the independent term in the linear model.

There are two types of linear regression:

• Simple linear regression uses only one independent variable to predict a dependent variable.
• Multiple linear regression is an extension of simple linear regression with multiple independent variables to predict a dependent variable.

## Building a simple linear regression model with Scikit-learn

With the basics out of the way, let's look at how to build a simple linear regression model in Scikit-learn. We kick off by loading the dataset.

Load the Student study hours dataset from Kaggle. The dataset is a CSV file with two columns, Hours and Scores. We will use it to build a simple linear regression model to predict the Scores(dependent/target variable) based on the number of Hours(independent variable) a student takes to study.

``````import pandas as pd

Before building the linear regression model, we must first understand the data:

``stud_scores.shape``

From the `info()` method, we can see that both Hours and Scores are numeric, which is crucial for linear regression.

The `describe()` method will give a statistical overview of the data. For instance, we can see that the average study time of a student is 5 hours, the minimum score is 17, and the maximum score is 95.

Check if the data has missing values:

We have no missing values.

Check for correlation and plot a heatmap:

Checking for correlation helps us understand the relationship between the variables. There is a strong positive correlation between Hours and Scores. Below is a heatmap of the correlation with Seaborn:

``````import seaborn as sns
sns.heatmap(stud_scores.corr(), annot=True)``````

We can also plot a scatter plot to determine whether linear regression is the ideal method for predicting the Scores based on the Hours of study:

``````import seaborn as sns

sns.relplot(x='Scores', y='Hours', data=stud_scores,
height=3.8, aspect=1.8, kind='scatter')
sns.set_style('dark')``````

There is a linearly increasing relationship between the dependent and independent variables; thus, linear regression is the best model for the prediction.

### Splitting the data

Scikit-learn provides a `train_test_split` function for splitting the datasets into train and test subsets. We will split our dataset in the ratio of 70:30.

Create the feature matrix(X) and the response vector(y):

``````X = stud_scores.iloc[:,:-1].values # feature matrix
y = stud_scores.iloc[:,1].values # response vector``````

Split the data:

``````# SPLITTING THE DATA
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=1)

print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)``````

### Fitting simple linear regression

Import the `LinearRegression` class from the `linear_model` to train the model. Instantiate an object of the class named regressor.

``````from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)``````

The regressor object is also called an estimator. An estimator is any object that fits a model based on some training data and is capable of inferring some properties on new data. All estimators implement a `fit` method.

The `fit` method takes the training data as the argument. We use this method to estimate some parameters of a model. For instance, we have passed in the `X_train` and the `y_train`the independent and dependent variables. The model learns the correlations between the predictor and target variables.

### Getting the coefficient/slope and the intercept

Now that we have fitted the model, we can check the slope and the intercept of the simple linear fit.

Coefficient:

``regressor.coef_``

The coefficient shows that, on average, the score increased by approximately  10.41 points for every hour the student studied.

Intercept:

``regressor.intecept_``

### Linear regression model fit line

The Seaborn `regplot` function enables us to visualize the linear fit of the model. It will draw a scatter plot of the variables and then fit the linear regression model. The regression line will be plotted with a 95% confidence interval.

``sns.regplot(x='Hours', y='Scores', data=stud_scores, ci=None, scatter_kws={'s':100, 'facecolor':'red'})``

### Predicting test set result

At this point, the model is now trained and ready to predict the output of new observations. Remember, we split our dataset into train and test sets. We will provide test sets to the model and check its performance.

``````y_pred = regressor.predict(X_test)
y_pred``````

Comparing the test values and the predicted values:

``````comparison_df = pd.DataFrame({"Actual":y_test,"Predicted":y_pred})
comparison_df``````

Checking the residuals:

``````residuals = y_test - y_pred
residuals``````

Comparing the test data and the predicted values with a scatter plot:

``````import matplotlib.pyplot as plt
sns.scatterplot(x=y_test, y = y_pred, ci=None, s=140)
plt.xlabel('y_test data')
plt.ylabel('Predictions')``````

The values seem to align linearly, which shows that the model is acceptable.

### Evaluating linear regression models

There are various metrics in place that we can use to evaluate linear regression models. Since models can't be 100 percent efficient, evaluating the model on different metrics can help us optimize the performance, fine-tune it, and obtain better results. The metrics we can use include:

• Mean Absolute Error(MAE) calculates the absolute difference between the actual and predicted values. We get the sum of all the prediction errors and divide them by the total number of data points.

To get the MAE of a model, import the `mean_absolute_error` class from the `sklearn.metrics` module.

``````from sklearn.metrics import mean_absolute_error
print('MAE:', mean_absolute_error(y_test,y_pred)``````
• Mean Squared Error(MSE): This is the most used metric. It finds the squared difference between actual and predicted values. We get the sum of the square of all prediction errors and divide it by the number of data points.

To get the MSE from the model, import the `mean_squared_error` class from `sklearn.metrics` module.

``````from sklearn.metrics import mean_squared_error
print("MSE",mean_squared_error(y_test,y_pred))``````
• Root Mean Squared Error(RMSE) is the square root of MSE.

Since MSE is calculated by the square of error, the square root brings it back to the same level of prediction error. We need the NumPy square root function to compute it.

``````import numpy as np
print("RMSE",np.sqrt(mean_squared_error(y_test,y_pred)))``````
• R Squared(R2): R2 is also called the coefficient of determination or goodness of fit score regression function. It measures how much irregularity in the dependent variable the model can explain. The R2 value is between 0 to 1, and a bigger value shows a better fit between prediction and actual value.

From the `sklearn.metrics` module, import the `r2_score` function, and find the goodness of fit of the model.

``````from sklearn.metrics import r2_score
r2 = r2_score(y_test,y_pred)
print(r2)``````

The model has a pretty good score, meaning it was excellent in predicting the Scores.

### Remarks on the model

We can conclude that the simple linear model we built works fine in predicting the Scores based on the Hours of study since the errors were relatively low and the R2 score was high.

## Complete code for the simple linear regression

``````# import all the modules we will need
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

# NOTE: methods from sklearn.metrics module can be imported as one liner like:
# from sklearn.metrics import mean_absolute_error,mean_squared_error, r2_score

stud_scores.info()
stud_scores.describe()

# CREATING FEATURE MATRIX AND RESPONSE VECTOR
X = stud_scores.iloc[:,:-1].values # feature matrix
y = stud_scores.iloc[:,1].values # response vector

# SPLITTING THE DATA
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=1)

# FITTING LINEAR REGRESSION MODEL / TRAINING
regressor = LinearRegression()
regressor.fit(X_train, y_train)

# GETTING THE COEFFICIENTS AND INTERCEPT
print('Coefficients: ', regressor.coef_)
print('Intercept: ',regressor.intercept_)

# PREDICTION OF TEST RESULT
y_pred = regressor.predict(X_test)
print('Predictions:\n', y_pred)

# COMPARING TEST DATA AND PREDICTED DATA
comparison_df = pd.DataFrame({"Actual":y_test,"Predicted":y_pred})
print('Actual test data vs predicted: \n', comparison_df)

# EVALUATING MODEL METRICS
print('MAE:', mean_absolute_error(y_test,y_pred))
print("MSE",mean_squared_error(y_test,y_pred))
print("RMSE",np.sqrt(mean_squared_error(y_test,y_pred)))
r2 = r2_score(y_test,y_pred)
print('Model Score: ', r2)

# FITTING LINEAR REGRESSION LINE
sns.regplot(x='Hours', y='Scores', data=stud_scores, ci=None,
scatter_kws={'s':100, 'facecolor':'red'})``````

Aside:

Before we can explore multiple linear regression, there are certain concepts that we need to understand as they will be essential in knowing how to carry out multiple linear regression perfectly. These concepts are:

• Multicollinearity in regression analysis.
• Dummy variables and Dummy variable trap.

## Multicollinearity in regression analysis

Multicollinearity in regression analysis occurs when two or more predictors or independent variables are highly correlated such that they do not give unique or independent information in the regression model. The independent variables should be independent; thus, if the degree of correlation between them is high, problems can occur when fitting and interpreting the regression model.

For instance, if we wanted to predict the maximum verticle jump of an athlete with predictor variables like shoe size and height, we will encounter an instance of high colinearity between the shoe size and the height, as usually, tall people tend to have larger shoe sizes.

The main goal of regression analysis is to isolate the relationship between each independent variable and the dependent variable. We interpret regression coefficients as the mean change in the dependent variable for each 1 unit change in an independent variable when all other independent variables are constant.

So if we cannot change the value of a given predictor variable without changing another predictor variable, then there is a problem caused by high collinearity.

The most common way to detect multicollinearity is by using the variance inflation factor (VIF), which measures the correlation and strength of correlation between the predictor variables in a regression model.

## Dummy variables and Dummy variable trap

### Dummy variables

Typically in linear regression, we carry out linear regression with numeric data. Numeric data is easy to handle for linear regression. However, sometimes we may use categorical data as predictor variables to make predictions, for example, Gender(male, female).

Categorical data can not be used directly for regression and needs to be transformed into numeric data. The solution is to use dummy variables. We create dummy variables for regression analysis that take on one of two values: zero or one.

Take, for example, this dataset:

If we were to use the State category, we would need to convert it into some dummy variable:

To do this, Scikit-learn provides `LabelEncoder` and `OneHotEncoder`utility classes.

### Dummy variable trap

A dummy variable trap is a scenario where we have highly correlated attributes(Multicollinear), and one variable predicts the value of other variables.

The number of dummies we have to create equals K-1, where K represents the number of different values the categorical variable can take. When we use the OneHotEncoder utility class, one variable can be predicted by other variables, which we can exclude(K-1).

For instance, from the sample dataset we have displayed above, the State category can take up to 3 variables(California, Florida, and New York). So we will create K-1 = 3-1= 2 dummy variables to get the following result:

Using all the dummy variables for regression results in the dummy variable trap!

## Building a multiple linear regression model with Scikit-learn

This section will focus on multiple independent variables to predict a single target.

Since we have p predictor variables, we can represent multiple linear regression with the equation below:

``Y = β0 + β1X1 + β2X2 + … + βpXp + ε``

Where:

• Y: The response variable.
• Xj: The jth predictor variable.
• βj: The average effect on Y of a one unit increase in Xj, holding all other predictors fixed.
• ε: The error term.

We will load the 50 startups dataset from Kaggle. The dataset is a CSV file with data collected from New York, California, and Florida with around 50 business Startups – 17 in each state. The variables used in the dataset are Profit, R&D spending, Administration Spending, and Marketing Spending. Our main goal is to predict the profits.

Let's first import all the modules we will need:

``````import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error,mean_squared_error, r2_score``````

``````startups_df = pd.read_csv('50_Startups.csv')

Info:

We have five variables where four of them have continuous data, and one is categorical. We will convert the categorical variable later.

Overview of statistical data:

Correlation and heatmap:

### Extracting independent variables(X) and dependent(y)

Since we are predicting profits, the profit variable will be the dependent variable and the rest independent variables.

``````X = startups_df.iloc[:, :-1]    # Independent varibles
y = startups_df.iloc[:, -1]     # dependent variable

### Encoding dummy variables

As we can observe, we have a State column with categorical data. We need to assign dummy variables to it as we can not directly use categorical data for regression. We will use the `OneHotEncoder` utility classes from Scikit-learn `sklearn.preprocessing` module.

One hot encoding:

``````# creating instance of one-hot-encoder
enc = OneHotEncoder(drop='first') # drop the first dummy variable (K-1)

enc_df = pd.DataFrame(enc.fit_transform(X[['State']]).toarray())
enc_df.columns = ['Florida', 'New York']
# merge with main df on key values
X = X.join(enc_df)

We can now drop the State column:

``````X = X.drop('State', axis=1)

### Splitting data into train and test sets

As before, we will split the data with the `train_test_split`function from Scikit-learn.

``````X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.30, random_state=0)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

'''
(40, 4)
(10, 4)
(40,)
(10,)
'''``````

### Fitting the simple linear regression to the Training Set

Create an object of linear regression and train the model with the training datasets.

``````regressor = LinearRegression() # Instatiate LinearREgression object
regressor.fit(X_train, y_train) # fit the model``````

### Getting the coefficients and the intercept

Getting the coefficients enables us to form an estimated multiple regression model. Let's have a look:

``````regressor.coef_
reressor.intercept_``````

### Predicting test results

Predict the output of new observations with the trained model.

Compare the actual values and predicted values:

``````comparison_df = pd.DataFrame({"Actual":y_test,"Predicted":y_pred})
comparison_df``````

Checking the residuals:

``````residuals = y_test - y_pred
residuals``````

When we compare the first values, \$103282, and the predicted \$103015, the difference/residue is approximately \$267, which is not bad, showing that our model is working fine.

Compare the actual values and predicted values with a scatter plot:

``````import matplotlib.pyplot as plt
sns.scatterplot(x=y_test, y = y_pred, ci=None, s=140)
plt.xlabel('y_test data')
plt.ylabel('Predictions')``````

### Evaluating the model

We can check the goodness of fit or the score of our model with the R2 (r2_score) metric.

``````score = r2_score(y_test, y_pred)
score``````

The score is 0.93, closer to 1, indicating that our model works as expected.

## Complete code for multiple linear regression

``````import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import mean_absolute_error,mean_squared_error, r2_score

startups_df.info()
startups_df.describe()
startups_df.shape

startups_df.corr()
sns.heatmap(startups_df.corr(), annot=True)

# SPLITTING THE DATA INTO INDEPEDENT AND DEPENDENT VARIABLES(X, y)
X = startups_df.iloc[:, :-1]    # Independent varibles
y = startups_df.iloc[:, -1]     # dependent variable

# USING OneHotEncoder TO CONVERT CATEGORICAL DATA TO DUMMY VARIABLES

# creating instance of one-hot-encoder
enc = OneHotEncoder(drop='first') # drop the first dummy variable (K-1)
enc_df = pd.DataFrame(enc.fit_transform(X[['State']]).toarray())
enc_df.columns = ['Florida', 'New York']
# merge with main df on key values
X = X.join(enc_df)
X = X.drop('State', axis=1) # DROP THE State COLUMN

# SPLITTING DATA FOR train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# FITTING THE MODEL/TRAIN
regressor = LinearRegression() # Instatiate LinearREgression object
regressor.fit(X_train, y_train) # fit the model

print('Coefficients: ', regressor.coef_)
print('Intercept: ',regressor.intercept_)

# PREDICTING
y_pred=regressor.predict(X_test)
# predictions
comparison_df = pd.DataFrame({"Actual":y_test,"Predicted":y_pred})
print(comparison_df)

# RESIDUALS
residuals = y_test - y_pred
print('Residuals: ', residuals)

# CHECKING THE SCORE OF THE MODEL WITH R^2 METRIC
score = r2_score(y_test, y_pred)
print('Model Score: ', score, 'Equal to: ', score * 100, '%')``````

## Final thoughts

We can perform other types of regression, and we will discuss them in later articles. For this article, we have centered our interest on understanding linear regression and how to perform it with Scikit learn library. Generally, we have covered:

• The modeling process of Scikit-learn.
• Simple and multiple linear regression.
• How to split, train, test, and evaluate our linear regression models.
• What multicollinearity is in regression.
• Problems that might occur when building multiple linear regression models, like the dummy variable trap.
• Assumptions of linear regression.
• Basic plots with Seaborn relevant to the tasks. The Complete Data Science and Machine Learning Bootcamp on Udemy is a great next step if you want to keep exploring the data science and machine learning field.

Follow us on LinkedIn, Twitter, and GitHub, and subscribe to our blog, so you don't miss a new issue.

Data Science

### Brian Mutea

Software Engineer | Data Scientist with an appreciable passion for building models that fix problems and sharing knowledge.

## Discussion

Community guidelines