Ridge Regression

Ridge regression model with the seatpos data with hipcenter as the response and all other variables as predictors.

Car drivers like to adjust the seat position for their own comfort. Car designers would find it helpful to know where different drivers will position the seat depending on their size and age. Researchers at the HuMoSim laboratory at the University of Michigan collected data on 38 drivers with 9 variables: Age, Weight, HTShoes (height in shoes in cm), Ht (height bare foot in cm), Seated (seated height in cm), Arm (lower arm length in cm), Thigh (thigh length in cm), Leg (lower leg length in cm), and hipcenter (horizontal distance of the midpoint of the hips from a fixed location in the car in mm).

data("seatpos")
summary(seatpos) %>%
  kbl() %>%
  kable_styling(font_size = 10)

Age	Weight	HtShoes	Ht	Seated	Arm	Thigh	Leg	hipcenter
Min. :19.00	Min. :100.0	Min. :152.8	Min. :150.2	Min. : 79.40	Min. :26.00	Min. :31.00	Min. :30.20	Min. :-279.15
1st Qu.:22.25	1st Qu.:131.8	1st Qu.:165.7	1st Qu.:163.6	1st Qu.: 85.20	1st Qu.:29.50	1st Qu.:35.73	1st Qu.:33.80	1st Qu.:-203.09
Median :30.00	Median :153.5	Median :171.9	Median :169.5	Median : 89.40	Median :32.00	Median :38.55	Median :36.30	Median :-174.84
Mean :35.26	Mean :155.6	Mean :171.4	Mean :169.1	Mean : 88.95	Mean :32.22	Mean :38.66	Mean :36.26	Mean :-164.88
3rd Qu.:46.75	3rd Qu.:174.0	3rd Qu.:177.6	3rd Qu.:175.7	3rd Qu.: 91.62	3rd Qu.:34.48	3rd Qu.:41.30	3rd Qu.:38.33	3rd Qu.:-119.92
Max. :72.00	Max. :293.0	Max. :201.2	Max. :198.4	Max. :101.60	Max. :39.60	Max. :45.50	Max. :43.10	Max. : -30.95

The response hipcenter is negative, since it measures the distance, without loss of generality, we could make it as positive to get positive esimates for coefficients, and possitive correlations with predictors. The model performance will not change.

seatpos$hipcenter <- -seatpos$hipcenter

corrplot(cor(seatpos),tl.col="black",tl.srt =45,order = "AOE")

Correlation matrix of seatpos

We first take a look at the correlation matrix (Figure 15). There are several large pairwise correlations both between predictors and between predictors and the response (hipcenter). Now we check the eigendecomposition:

e <- eigen(t(data.matrix(seatpos)) %*% data.matrix(seatpos))$val
sqrt(e[1]/e)

## [1]   1.000000   8.107987  16.130477  25.240545 132.800428 190.210314 242.337636
## [8] 302.014039 807.424129

There is a wide range in the eigenvalues and several condition numbers are large. This means that problems are being caused by more than just one linear combination. There is much variance inflation when we look at the variance inflation factors (VIFs) of predictors:

vif(seatpos[,-9])

##        Age     Weight    HtShoes         Ht     Seated        Arm      Thigh 
##   1.997931   3.647030 307.429378 333.137832   8.951054   4.496368   2.762886 
##        Leg 
##   6.694291

The highest value is height bare foot in cm, the standard error is $\sqrt{307.43} = 17.53$ times larger than it would have been without collinearity.
One cure for collinearity is amputation. We have too many variables that are trying to do the same job of explaining the response. When several variables, which are highly correlated, are each associated with the response, we have to take care that we do not conclude that the variables we drop have nothing to do with the response. In order to keep all your variables in the model, we should consider a method of estimation which is ridge regression.

Ridge regression makes the assumption that the regression coefficients (after normalization) are not likely to be very large. The idea of shrinkage is therefore embedded in the method. It is appropriate for use when the design matrix is collinear and the usual least squares estimates of $\beta$ appear to be unstable. Before doing Ridge regression, we need to center the predictors by their means and scaled by their standard deviations and center the response. The lm.ridge() function takes care all that so we do not need to manually center variables.

Picking the range of $\lambda$ needs some experiements. We try some small $\lambda$ first (Figure 16):

ridge_fit <- lm.ridge(hipcenter ~ ., seatpos, lambda = seq(0,2,0.001))
matplot(ridge_fit$lambda, t(ridge_fit$coef), type='l',lty=1,
        xlab=expression(lambda),ylab=expression(hat(beta)))

$Coefficients with $\lambda$ from 0 to 2$

Coefficients with $\lambda$ from 0 to 2

We can observe that the coefficients shrink as $\lambda$ grows larger, so we should try larger $\lambda$ with maximum $\lambda$ is 20. The selected $\lambda$ with smallest GCV is:

ridge_fit <- lm.ridge(hipcenter ~ ., seatpos, lambda = seq(0,20,0.001))
MASS::select(ridge_fit)

## modified HKB estimator is 5.425415 
## modified L-W estimator is 3.589434 
## smallest value of GCV  at 20

The smallest generalized crossvalidation (GCV) at $\lambda=20$, so we might even get smaller GCV if we try larger $\lambda$. This time we extend the range of $\lambda$ to 50 (Figure 17). The selected $\lambda$ with smallest GCV is:

ridge_fit <- lm.ridge(hipcenter ~ ., seatpos, lambda = seq(0,50,0.001))
MASS::select(ridge_fit)

## modified HKB estimator is 5.425415 
## modified L-W estimator is 3.589434 
## smallest value of GCV  at 22.301

Now we have the smallest GCV at $\lambda=22.301$ (Figure 18), which is around the middle of the range of $\lambda$’s. So we have an approriate range of shrinkage as long as the upper bound is greater than 22.3.

ridge_summary <- tidy(ridge_fit)
ridge_g <- glance(ridge_fit)

ggplot(ridge_summary, aes(lambda, estimate, color = term)) + 
  geom_line() +
  geom_vline(xintercept = ridge_g$lambdaGCV, lty=2)

$Ridge trace plot. The generalized crossvalidation choice of $\lambda$ is shown as a vertical line.$

Ridge trace plot. The generalized crossvalidation choice of $\lambda$ is shown as a vertical line.

ggplot(ridge_summary, aes(lambda, GCV)) + 
  geom_line() +
  geom_vline(xintercept = ridge_g$lambdaGCV, col = "red", lty = 2)

$Plot of generalized crossvalidation (GCV) versus $\lambda$$

Plot of generalized crossvalidation (GCV) versus $\lambda$

The coefficients of predictors at $\lambda=22.3$ are:

ridge_fit$coef[, which.min(ridge_fit$GCV)]

##       Age    Weight   HtShoes        Ht    Seated       Arm     Thigh       Leg 
## -7.329603  3.687670  8.186684  8.186005  5.816297  4.528778  4.998163 10.625503

Use the model to predict the response

This ridge regression both centers and scales the predictors, so we need to do the same in computing the fit. Furthermore, we need to add back in the mean of the response because of the centering (see the code in the Appendix). So with Age=64.8, Weight=263.7, HtShoes=181.08, Ht=178.560, Seated=91.44, Arm=35.64, Thigh=40.95, and Leg=38.79, the horizontal distance of the midpoint of the hips from a fixed location in the car is 194.77(mm).

(ypred <- ridge_fit$ym + #add back the mean of the response
  scale(t(data.frame(x=c(64.800,263.700,181.080,178.560,91.440,35.640,40.950,38.790))), 
        center=ridge_fit$xm, scale=ridge_fit$scales) %*% #center and scale back
  ridge_fit$coef[, which.min(ridge_fit$GCV)])

##       [,1]
## x 194.7716

Root Mean Square Error (RMSE)

To evaluate how good our Ridge regression is, we will divide the data into trainning (80%) and test set (20%), then compute the RMSE. To get a better criteria, we will compare its RMSE with that of Linear regression (i.e. Least Squares). The RMSE of Ridge regression is 27 and the RMSE of the Linear regression is 25.85. It seems that linear regression performs better than Ridge regression with smaller RMSE.

set.seed(551775328)
trainid <- sample(1:nrow(seatpos), 0.8*nrow(seatpos))
seatpos_train <- seatpos[trainid,]
seatpos_test <- seatpos[-trainid,]
ridge_fit <- lm.ridge(hipcenter ~ ., seatpos_train, lambda = seq(0,50,0.001))

ypred <- ridge_fit$ym +
  scale(seatpos_test[,-9], center=ridge_fit$xm, scale=ridge_fit$scales) %*%
  ridge_fit$coef[, which.min(ridge_fit$GCV)]
sqrt(mean((ypred - seatpos_test$hipcenter)^2))

lm_fit <- lm(hipcenter ~ ., seatpos_train)
sqrt(mean((predict(lm_fit, seatpos_test) - seatpos_test$hipcenter)^2))

But in different simulation, Ridge regression is better than OLS Linear regression (the code in the Appendix):

set.seed(666666)
trainid <- sample(1:nrow(seatpos), 0.8*nrow(seatpos))
seatpos_train <- seatpos[trainid,]
seatpos_test <- seatpos[-trainid,]
ridge_fit <- lm.ridge(hipcenter ~ ., seatpos_train, lambda = seq(0,50,0.001))

ypred <- ridge_fit$ym +
  scale(seatpos_test[,-9], center=ridge_fit$xm, scale=ridge_fit$scales) %*%
  ridge_fit$coef[, which.min(ridge_fit$GCV)]
lm_fit <- lm(hipcenter ~ ., seatpos_train)

data.frame(RMSE_Ridge = sqrt(mean((ypred - seatpos_test$hipcenter)^2)),
           RMSE_Linear=sqrt(mean((predict(lm_fit, seatpos_test) 
                                  - seatpos_test$hipcenter)^2))) %>% 
  kbl() %>% kable_styling(full_width = F)

RMSE_Ridge	RMSE_Linear
41.48973	57.39749

So to conclude how good Ridge Regression is as compare to Linear Regression. We create a loop to run simulation 1000 times and then compare means of RMSE from two models (see the code in the Appendix). Here is the result:

RMSE_Ridge <- rep(NA,1000)
RMSE_Linear <- rep(NA,1000)
for (i in 1:1000) {
trainid <- sample(1:nrow(seatpos), 0.8*nrow(seatpos))
seatpos_train <- seatpos[trainid,]
seatpos_test <- seatpos[-trainid,]
ridge_fit <- lm.ridge(hipcenter ~ ., seatpos_train, lambda = seq(0,50,0.001))

ypred <- ridge_fit$ym +
  scale(seatpos_test[,-9], center=ridge_fit$xm, scale=ridge_fit$scales) %*%
  ridge_fit$coef[, which.min(ridge_fit$GCV)]
lm_fit <- lm(hipcenter ~ ., seatpos_train)

RMSE_Ridge[i] <- sqrt(mean((ypred - seatpos_test$hipcenter)^2))
RMSE_Linear[i] <- sqrt(mean((predict(lm_fit, seatpos_test) 
                             - seatpos_test$hipcenter)^2))

}

data.frame("Mean of RMSE in Ridge Regression" = mean(RMSE_Ridge),
           "Mean of RMSE in Linear Regression"=mean(RMSE_Linear)) %>%
    kbl() %>% kable_styling(full_width = F)

Mean.of.RMSE.in.Ridge.Regression	Mean.of.RMSE.in.Linear.Regression
37.92238	44.61189

As we can see, RMSE in Ridge regression in general is smaller than Linear Regression.

Ridge Regression

Ngan Pham