Repository for WM's Data 146 Course
First, importing the relevant libraries and functions we will need
from sklearn.linear_model import LinearRegression
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler as SS
import numpy as np
import pandas as pd
Next, store the dataset we will be using
data = fetch_california_housing(as_frame=True)
X = data.data
X_names = data.feature_names
y = data.target
df = data.frame
The as_frame=True
parameter brings in the data as a pandas
dataframe
Creating a function for kfolds with various parameters allows us to easily train different models without having to type the same code over and over.
def DoKFold(model, X, y, k, standardize=False, random_state=146):
from sklearn.model_selection import KFold
if standardize:
from sklearn.preprocessing import StandardScaler as SS
ss = SS()
kf = KFold(n_splits=k, shuffle=True, random_state=random_state)
train_scores = []
test_scores = []
for idxTrain, idxTest in kf.split(X):
Xtrain = X.iloc[idxTrain, :]
Xtest = X.iloc[idxTest, :]
ytrain = y[idxTrain]
ytest = y[idxTest]
if standardize:
Xtrain = ss.fit_transform(Xtrain)
Xtest = ss.transform(Xtest)
model.fit(Xtrain, ytrain)
train_scores.append(r2_score(ytrain, model.predict(Xtrain)))
test_scores.append(r2_score(ytest, model.predict(Xtest)))
return train_scores, test_scores
Which of the below features is most strongly correlated with the target?
Running df.corr()
lets us see the Pearson’s Correlation Coefficient for each variable in our dataset.
MedInc | HouseAge | AveRooms | AveBedrms | Population | AveOccup | Latitude | Longitude | MedHouseVal | |
---|---|---|---|---|---|---|---|---|---|
MedInc | 1 | -0.119034 | 0.326895 | -0.0620401 | 0.00483435 | 0.0187662 | -0.0798091 | -0.0151759 | 0.688075 |
HouseAge | -0.119034 | 1 | -0.153277 | -0.0777473 | -0.296244 | 0.0131914 | 0.0111727 | -0.108197 | 0.105623 |
AveRooms | 0.326895 | -0.153277 | 1 | 0.847621 | -0.0722128 | -0.00485229 | 0.106389 | -0.0275401 | 0.151948 |
AveBedrms | -0.0620401 | -0.0777473 | 0.847621 | 1 | -0.0661974 | -0.0061812 | 0.0697211 | 0.0133444 | -0.0467005 |
Population | 0.00483435 | -0.296244 | -0.0722128 | -0.0661974 | 1 | 0.0698627 | -0.108785 | 0.0997732 | -0.0246497 |
AveOccup | 0.0187662 | 0.0131914 | -0.00485229 | -0.0061812 | 0.0698627 | 1 | 0.00236618 | 0.00247582 | -0.0237374 |
Latitude | -0.0798091 | 0.0111727 | 0.106389 | 0.0697211 | -0.108785 | 0.00236618 | 1 | -0.924664 | -0.14416 |
Longitude | -0.0151759 | -0.108197 | -0.0275401 | 0.0133444 | 0.0997732 | 0.00247582 | -0.924664 | 1 | -0.0459666 |
MedHouseVal | 0.688075 | 0.105623 | 0.151948 | -0.0467005 | -0.0246497 | -0.0237374 | -0.14416 | -0.0459666 | 1 |
After looking at this table, it’s easy to see that the most correlated feature is Median Income (MedInc), with an R2 of 0.688.
If the features are standardized, the correlations from the previous question do not change.
Xs = SS().fit_transform(X)
sdf = pd.DataFrame(Xs, index=X.index, columns=X.columns)
sdf['MedHouseVal'] = y
sdf.corr()
The correlations are the same as previous.
If we were to perform a linear regression using only the feature identified in question 15, what would be the coefficient of determination? Enter your answer to two decimal places, for example: 0.12
Lets go ahead and actually do the regression.
x = data.data[['MedInc']]
lin_reg = LinearRegression()
np.round(lin_reg.fit(x, y).score(x, y), 2)
Out:
0.47
We have an R2 value of 0.47
Let’s take a look at how a few different regression methods perform on this data.
Start with a linear regression.
Standardize the data
Perform a K-fold validation using:
k=20 \
shuffle=True
random_state=146
What is the mean R2 value on the test folds? Enter your answer to 5 decimal places, for example: 0.12345
train_scores, test_scores = DoKFold(lin_reg,X,y,20, standardize=True)
print('Training: ' + format(np.mean(train_scores), '.5f'))
print('Testing: ' + format(np.mean(test_scores), '.5f'))
Out:
Training: 0.60630
Testing: 0.60198
Next, try Ridge regression.
To save you some time, I’ve determined that you should look at 101 equally spaced values between 20 and 30 for alpha.
Use the same settings for K-fold validation as in the previous question.
For the optimal value of alpha in this range, what is the mean R2 value on the test folds? Enter your answer to 5 decimal places, for example: 0.12345
a_range = np.linspace(20, 30, 101)
# a_range = np.linspace(5, 15, 100)
# a_range = np.linspace(7, 8, 100)
k = 20
avg_tr_score=[]
avg_te_score=[]
for a in a_range:
rid_reg = Ridge(alpha=a)
train_scores,test_scores = DoKFold(rid_reg,X,y,k,standardize=True)
avg_tr_score.append(np.mean(train_scores))
avg_te_score.append(np.mean(test_scores))
idx = np.argmax(avg_te_score)
print('Optimal alpha value: ' + format(a_range[idx], '.3f'))
print('Training score for this value: ' + format(avg_tr_score[idx],'.3f'))
print('Testing score for this value: ' + format(avg_te_score[idx], '.5f'))
Out:
Optimal alpha value: 25.800
Training score for this value: 0.606
Testing score for this value: 0.60201
Next, try Lasso regression. Look at 101 equally spaced values between 0.001 and 0.003.
Use the same settings for K-fold validation as in the previous 2 questions.
For the optimal value of alpha in this range, what is the mean R2 value on the test folds? Enter you answer to 5 decimal places, for example: 0.12345
a_range = np.linspace(0.002, 0.003, 101)
k = 20
avg_tr_score=[]
avg_te_score=[]
for a in a_range:
print(a)
las_reg = Lasso(alpha=a)
train_scores,test_scores = DoKFold(las_reg,X,y,k,standardize=True)
avg_tr_score.append(np.mean(train_scores))
avg_te_score.append(np.mean(test_scores))
idx = np.argmax(avg_te_score)
print('Optimal alpha value: ' + format(a_range[idx], '.5f'))
print('Training score for this value: ' + format(avg_tr_score[idx],'.3f'))
print('Testing score for this value: ' + format(avg_te_score[idx], '.5f'))
Optimal R2 value on test folds: 0.60213
Let’s look at some of what these models are estimating.
Refit a linear, Ridge, and Lasso regression to the entire (standardized) dataset.
No need to do any train/test splits or K-fold validation here. Use the optimal alpha values you found previously.
Which of these models estimates the smallest coefficient for the variable that is least correlated (in terms of absolute value of the correlation coefficient) with the target?
lin_reg.fit(X, y)
rid_reg = Ridge(25.8)
rid_reg.fit(X, y)
las_reg = Lasso(0.00186)
las_reg.fit(X, y)
Least correlated: AveOccup (Average Occupation)
Ridge regression esimates -0.03925, the lowest
Which of the above models estimates the smallest coefficient for the variable that is most correlated (in terms of the absolute value of the correlation coefficient) with the target? Most correlated: MedInc (Median Income)
Lasso Regression estimates 0.82, the lowest
If we had looked at MSE instead of R2 when doing our Ridge regression (question 19), would we have determined the same optimal value for alpha, or something different?
Our new training method:
def DoKFold(model, X, y, k, standardize=False, random_state=146):
# def DoKFold(model, X, y, k, standardize=False):
from sklearn.model_selection import KFold
if standardize:
from sklearn.preprocessing import StandardScaler as SS
ss = SS()
kf = KFold(n_splits=k, shuffle=True, random_state=random_state)
# kf = KFold(n_splits=k, shuffle=True)
train_scores = []
test_scores = []
for idxTrain, idxTest in kf.split(X):
Xtrain = X.iloc[idxTrain, :]
Xtest = X.iloc[idxTest, :]
ytrain = y[idxTrain]
ytest = y[idxTest]
if standardize:
Xtrain = ss.fit_transform(Xtrain)
Xtest = ss.transform(Xtest)
model.fit(Xtrain, ytrain)
train_scores.append(mean_squared_error(ytrain, model.predict(Xtrain)))
test_scores.append(mean_squared_error(ytest, model.predict(Xtest)))
return train_scores, test_scores
This returns a different alpha result.
If we had looked at MSE instead of R2 when doing our Lasso regression (question 20), what would we have determined the optimal value for alpha to be? Enter your answer to 5 decimal places, for example: 0.12345
Running the above lasso regression with the new training method gives us an optimal alpha value of 0.00300