Module 7 - Tutorial - Linear Regression

Back to the course outline

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn #library for predictive modeling
%matplotlib inline
from sklearn.datasets import load_boston
#load_boston is a sample dataset from the scikit-learn library
boston = load_boston()

Available Datasets

load_boston([return_X_y]) Load and return the boston house-prices dataset (regression).

load_iris([return_X_y]) Load and return the iris dataset (classification).

load_diabetes([return_X_y]) Load and return the diabetes dataset (regression).

load_digits([n_class, return_X_y]) Load and return the digits dataset (classification).

load_linnerud([return_X_y]) Load and return the linnerud dataset (multivariate regression).

load_wine([return_X_y]) Load and return the wine dataset (classification).

load_breast_cancer([return_X_y]) Load and return the breast cancer wisconsin dataset (classification).

#data = dataset, target = dependent variable, feature_names = column headers, DESCR = data dictionary
boston.keys()
#506 rows, 13 columns
boston.data.shape
#list of column names from sample Boston dataset
print(boston.feature_names)
#description of the sample Boston housing dataset
print(boston.DESCR)
#make a dataframe from the sample Boston housing dataset
bos = pd.DataFrame(boston.data)
bos.head()
#add the list of column names from the sample Boston housing dataset to the dataframe
bos.columns = boston.feature_names
bos.head()
#verifying the first 5 rows in the dependent variable datatset
boston.target[:5]
#assigning dependent variable to column named "Price"
bos['PRICE'] = boston.target
bos.head()
#compare number of rooms to home price
plt.scatter(bos['RM'], bos['PRICE']) 
plt.xlabel("Average number of rooms per dwelling (RM)") 
plt.ylabel("Housing Price") 
plt.title("Relationship between RM and Price") 
plt.show()
from sklearn.linear_model import LinearRegression
#make dataframe that only contains predictive features
X = bos.drop('PRICE', axis = 1)
#assign linear regression function to a variable
lm = LinearRegression()
lm
Important functions to keep in mind while fitting a linear regression model are:
  • lm.fit() -> fits a linear model

  • lm.predict() -> Predict Y using the linear model with estimated coefficients

  • lm.score() -> Returns the coefficient of determination (R^2). A measure of how well observed outcomes are replicated by the model, as the proportion of total variation of outcomes explained by the model.

#same as statsmodel; fit will calculate linear regression model based on values of predictive features
lm.fit(X, bos['PRICE'])
#y-intercept for the linear regression formula
print('Estimated intercept coefficient:', lm.intercept_)
#number of predictive features
print('Number of coefficients:', len(lm.coef_))
#show the slope (weight) for each predictive feature
pd.DataFrame(list(zip(X.columns, lm.coef_)), columns = ['features', 'estimatedCoefficients'])
#show the first 5 values that the model predicted
lm.predict(X)[0:5]
#score determines how accurate the model predicted
lm.score(X, bos['PRICE'])
#visual comparison between the true price of a house and the price the model predicted
plt.scatter(bos['PRICE'], lm.predict(X))
plt.xlabel("Price")
plt.ylabel("Predicted Price")
plt.title("Price vs Predicted Price")
plt.show()

Randomize train and test data

Training data should be randomize so that the model can learn from a diverse set of predictive features. This helps it to have better predictive ability when trying on a test set (data that it has not seen before).

#test_size default = 0.25
#train_test_split function give back 4 variables
#2 variables for X (the predictive features) - training and testing
#2 variables for y (the target) - training and testing
X_train, X_test, Y_train, Y_test = sklearn.model_selection.train_test_split(
    X, bos.PRICE, test_size=0.33, random_state = 5)
print(X_train.shape)
print(X_test.shape)
print(Y_train.shape)
print(Y_test.shape)
#calculate the model using the training data
lm.fit(X_train, Y_train)
#predictions for training and test data
pred_train = lm.predict(X_train)
pred_test = lm.predict(X_test)
#evaluate the accuracy of the model of training vs test
print(lm.score(X_train, Y_train))
print(lm.score(X_test, Y_test))

Back to the course outline

Runtimes (1)