Car drivers like to adjust the seat position for their own comfort. Car designers would find it helpful to know where different drivers will position the seat depending on their size and age. Researchers at the HuMoSim laboratory at the University of Michigan collected data on 38 drivers with 9 variables: Age, Weight, HTShoes (height in shoes in cm), Ht (height bare foot in cm), Seated (seated height in cm), Arm (lower arm length in cm), Thigh (thigh length in cm), Leg (lower leg length in cm), and hipcenter (horizontal distance of the midpoint of the hips from a fixed location in the car in mm).
data("seatpos")
summary(seatpos) %>%
kbl() %>%
kable_styling(font_size = 10)
Age | Weight | HtShoes | Ht | Seated | Arm | Thigh | Leg | hipcenter | |
---|---|---|---|---|---|---|---|---|---|
Min. :19.00 | Min. :100.0 | Min. :152.8 | Min. :150.2 | Min. : 79.40 | Min. :26.00 | Min. :31.00 | Min. :30.20 | Min. :-279.15 | |
1st Qu.:22.25 | 1st Qu.:131.8 | 1st Qu.:165.7 | 1st Qu.:163.6 | 1st Qu.: 85.20 | 1st Qu.:29.50 | 1st Qu.:35.73 | 1st Qu.:33.80 | 1st Qu.:-203.09 | |
Median :30.00 | Median :153.5 | Median :171.9 | Median :169.5 | Median : 89.40 | Median :32.00 | Median :38.55 | Median :36.30 | Median :-174.84 | |
Mean :35.26 | Mean :155.6 | Mean :171.4 | Mean :169.1 | Mean : 88.95 | Mean :32.22 | Mean :38.66 | Mean :36.26 | Mean :-164.88 | |
3rd Qu.:46.75 | 3rd Qu.:174.0 | 3rd Qu.:177.6 | 3rd Qu.:175.7 | 3rd Qu.: 91.62 | 3rd Qu.:34.48 | 3rd Qu.:41.30 | 3rd Qu.:38.33 | 3rd Qu.:-119.92 | |
Max. :72.00 | Max. :293.0 | Max. :201.2 | Max. :198.4 | Max. :101.60 | Max. :39.60 | Max. :45.50 | Max. :43.10 | Max. : -30.95 |
The response hipcenter is negative, since it measures the distance, without loss of generality, we could make it as positive to get positive esimates for coefficients, and possitive correlations with predictors. The model performance will not change.
seatpos$hipcenter <- -seatpos$hipcenter
corrplot(cor(seatpos),tl.col="black",tl.srt =45,order = "AOE")
We first take a look at the correlation matrix (Figure 15). There are several large pairwise correlations both between predictors and between predictors and the response (hipcenter). Now we check the eigendecomposition:
e <- eigen(t(data.matrix(seatpos)) %*% data.matrix(seatpos))$val
sqrt(e[1]/e)
## [1] 1.000000 8.107987 16.130477 25.240545 132.800428 190.210314 242.337636
## [8] 302.014039 807.424129
There is a wide range in the eigenvalues and several condition numbers are large. This means that problems are being caused by more than just one linear combination. There is much variance inflation when we look at the variance inflation factors (VIFs) of predictors:
vif(seatpos[,-9])
## Age Weight HtShoes Ht Seated Arm Thigh
## 1.997931 3.647030 307.429378 333.137832 8.951054 4.496368 2.762886
## Leg
## 6.694291
The highest value is height bare foot in cm, the standard error is \(\sqrt{307.43} = 17.53\) times larger than it would have been without collinearity.
One cure for collinearity is amputation. We have too many variables that are trying to do the same job of explaining the response. When several variables, which are highly correlated, are each associated with the response, we have to take care that we do not conclude that the variables we drop have nothing to do with the response. In order to keep all your variables in the model, we should consider a method of estimation which is ridge regression.
Ridge regression makes the assumption that the regression coefficients (after normalization) are not likely to be very large. The idea of shrinkage is therefore embedded in the method. It is appropriate for use when the design matrix is collinear and the usual least squares estimates of \(\beta\) appear to be unstable. Before doing Ridge regression, we need to center the predictors by their means and scaled by their standard deviations and center the response. The lm.ridge() function takes care all that so we do not need to manually center variables.
Picking the range of \(\lambda\) needs some experiements. We try some small \(\lambda\) first (Figure 16):
ridge_fit <- lm.ridge(hipcenter ~ ., seatpos, lambda = seq(0,2,0.001))
matplot(ridge_fit$lambda, t(ridge_fit$coef), type='l',lty=1,
xlab=expression(lambda),ylab=expression(hat(beta)))
We can observe that the coefficients shrink as \(\lambda\) grows larger, so we should try larger \(\lambda\) with maximum \(\lambda\) is 20. The selected \(\lambda\) with smallest GCV is:
ridge_fit <- lm.ridge(hipcenter ~ ., seatpos, lambda = seq(0,20,0.001))
MASS::select(ridge_fit)
## modified HKB estimator is 5.425415
## modified L-W estimator is 3.589434
## smallest value of GCV at 20
The smallest generalized crossvalidation (GCV) at \(\lambda=20\), so we might even get smaller GCV if we try larger \(\lambda\). This time we extend the range of \(\lambda\) to 50 (Figure 17). The selected \(\lambda\) with smallest GCV is:
ridge_fit <- lm.ridge(hipcenter ~ ., seatpos, lambda = seq(0,50,0.001))
MASS::select(ridge_fit)
## modified HKB estimator is 5.425415
## modified L-W estimator is 3.589434
## smallest value of GCV at 22.301
Now we have the smallest GCV at \(\lambda=22.301\) (Figure 18), which is around the middle of the range of \(\lambda\)’s. So we have an approriate range of shrinkage as long as the upper bound is greater than 22.3.
ridge_summary <- tidy(ridge_fit)
ridge_g <- glance(ridge_fit)
ggplot(ridge_summary, aes(lambda, estimate, color = term)) +
geom_line() +
geom_vline(xintercept = ridge_g$lambdaGCV, lty=2)
ggplot(ridge_summary, aes(lambda, GCV)) +
geom_line() +
geom_vline(xintercept = ridge_g$lambdaGCV, col = "red", lty = 2)
ridge_fit$coef[, which.min(ridge_fit$GCV)]
## Age Weight HtShoes Ht Seated Arm Thigh Leg
## -7.329603 3.687670 8.186684 8.186005 5.816297 4.528778 4.998163 10.625503
This ridge regression both centers and scales the predictors, so we need to do the same in computing the fit. Furthermore, we need to add back in the mean of the response because of the centering (see the code in the Appendix). So with Age=64.8, Weight=263.7, HtShoes=181.08, Ht=178.560, Seated=91.44, Arm=35.64, Thigh=40.95, and Leg=38.79, the horizontal distance of the midpoint of the hips from a fixed location in the car is 194.77(mm).
(ypred <- ridge_fit$ym + #add back the mean of the response
scale(t(data.frame(x=c(64.800,263.700,181.080,178.560,91.440,35.640,40.950,38.790))),
center=ridge_fit$xm, scale=ridge_fit$scales) %*% #center and scale back
ridge_fit$coef[, which.min(ridge_fit$GCV)])
## [,1]
## x 194.7716
To evaluate how good our Ridge regression is, we will divide the data into trainning (80%) and test set (20%), then compute the RMSE. To get a better criteria, we will compare its RMSE with that of Linear regression (i.e. Least Squares). The RMSE of Ridge regression is 27 and the RMSE of the Linear regression is 25.85. It seems that linear regression performs better than Ridge regression with smaller RMSE.
set.seed(551775328)
trainid <- sample(1:nrow(seatpos), 0.8*nrow(seatpos))
seatpos_train <- seatpos[trainid,]
seatpos_test <- seatpos[-trainid,]
ridge_fit <- lm.ridge(hipcenter ~ ., seatpos_train, lambda = seq(0,50,0.001))
ypred <- ridge_fit$ym +
scale(seatpos_test[,-9], center=ridge_fit$xm, scale=ridge_fit$scales) %*%
ridge_fit$coef[, which.min(ridge_fit$GCV)]
sqrt(mean((ypred - seatpos_test$hipcenter)^2))
lm_fit <- lm(hipcenter ~ ., seatpos_train)
sqrt(mean((predict(lm_fit, seatpos_test) - seatpos_test$hipcenter)^2))
But in different simulation, Ridge regression is better than OLS Linear regression (the code in the Appendix):
set.seed(666666)
trainid <- sample(1:nrow(seatpos), 0.8*nrow(seatpos))
seatpos_train <- seatpos[trainid,]
seatpos_test <- seatpos[-trainid,]
ridge_fit <- lm.ridge(hipcenter ~ ., seatpos_train, lambda = seq(0,50,0.001))
ypred <- ridge_fit$ym +
scale(seatpos_test[,-9], center=ridge_fit$xm, scale=ridge_fit$scales) %*%
ridge_fit$coef[, which.min(ridge_fit$GCV)]
lm_fit <- lm(hipcenter ~ ., seatpos_train)
data.frame(RMSE_Ridge = sqrt(mean((ypred - seatpos_test$hipcenter)^2)),
RMSE_Linear=sqrt(mean((predict(lm_fit, seatpos_test)
- seatpos_test$hipcenter)^2))) %>%
kbl() %>% kable_styling(full_width = F)
RMSE_Ridge | RMSE_Linear |
---|---|
41.48973 | 57.39749 |
So to conclude how good Ridge Regression is as compare to Linear Regression. We create a loop to run simulation 1000 times and then compare means of RMSE from two models (see the code in the Appendix). Here is the result:
RMSE_Ridge <- rep(NA,1000)
RMSE_Linear <- rep(NA,1000)
for (i in 1:1000) {
trainid <- sample(1:nrow(seatpos), 0.8*nrow(seatpos))
seatpos_train <- seatpos[trainid,]
seatpos_test <- seatpos[-trainid,]
ridge_fit <- lm.ridge(hipcenter ~ ., seatpos_train, lambda = seq(0,50,0.001))
ypred <- ridge_fit$ym +
scale(seatpos_test[,-9], center=ridge_fit$xm, scale=ridge_fit$scales) %*%
ridge_fit$coef[, which.min(ridge_fit$GCV)]
lm_fit <- lm(hipcenter ~ ., seatpos_train)
RMSE_Ridge[i] <- sqrt(mean((ypred - seatpos_test$hipcenter)^2))
RMSE_Linear[i] <- sqrt(mean((predict(lm_fit, seatpos_test)
- seatpos_test$hipcenter)^2))
}
data.frame("Mean of RMSE in Ridge Regression" = mean(RMSE_Ridge),
"Mean of RMSE in Linear Regression"=mean(RMSE_Linear)) %>%
kbl() %>% kable_styling(full_width = F)
Mean.of.RMSE.in.Ridge.Regression | Mean.of.RMSE.in.Linear.Regression |
---|---|
37.92238 | 44.61189 |
As we can see, RMSE in Ridge regression in general is smaller than Linear Regression.