mypackages = c("leaps", "glmnet")
tmp = setdiff(mypackages, rownames(installed.packages()))
if (length(tmp) > 0) install.packages(tmp)
library(leaps) # regsubsets
library(glmnet) # glmnet for lasso
This example is from simulation example 1 of Zhao and Yu (2006), which shows that LASSO could fail to pick the correct variable set when the so-called Irrepresentable Condition is violated.
Consider a simple linear regression model with three predictors and , and the response variable is modeled as
# set random seed in case you want to reproduce the result
set.seed(542)
n = 1000
x1 = rnorm(n)
x1 = rnorm(n)
x2 = rnorm(n)
e = rnorm(n)
x3 = 2/3 * x1 + 2/3 * x2 + 1/3 * e
epsilon = rnorm(n)
beta = c(2, 3)
y = beta[1] * x1 + beta[2] * x2 + epsilon
myData = data.frame(y = y, x1 = x1, x2 = x2, x3 = x3)
IC = regsubsets(y ~ ., myData, method = "exhaustive")
sumIC = summary(IC)
sumIC$bic
## [1] -1488.766 -2581.920 -2575.242
sumIC
## Subset selection object
## Call: regsubsets.formula(y ~ ., myData, method = "exhaustive")
## 3 Variables (and intercept)
## Forced in Forced out
## x1 FALSE FALSE
## x2 FALSE FALSE
## x3 FALSE FALSE
## 1 subsets of each size up to 3
## Selection Algorithm: exhaustive
## x1 x2 x3
## 1 ( 1 ) " " " " "*"
## 2 ( 1 ) "*" "*" " "
## 3 ( 1 ) "*" "*" "*"
msize = apply(sumIC$which, 1, sum)
AIC = n * log(sumIC$rss/n) + 2 * msize
BIC = n * log(sumIC$rss/n) + log(n) * msize
AIC; BIC
## 1 2 3
## 1113.79020 15.72812 17.49869
## 1 2 3
## 1123.60571 30.45139 37.12971
The lowest AIC/BIC is given by , that is, these two procedures choose the correct sub-model.
Next, we use LASSO with lambda.min and lambda.1se to check if it can select the correct model.
mylasso.cv = cv.glmnet(as.matrix(myData[, c('x1', 'x2', 'x3')]), as.vector(myData$y))
plot(mylasso.cv)
coef(mylasso.cv, s = 'lambda.1se')
## 4 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) -0.05015789
## x1 1.60443649
## x2 2.69940979
## x3 0.36282687
coef(mylasso.cv, s = 'lambda.min')
## 4 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) -0.04624025
## x1 1.87702903
## x2 2.97065701
## x3 0.09587310
Neither approach selects the correct model. The following code provides two path plots of LASSO: the x-coordinate is the L1-norm of the coefficients in the “norm” plot and the x-coorindate is log-lambda in the “lambda” plot. You can see that won’t be dropped out of the model unless log-lambda is going to (or equivalently ). Thus it is impossible for LASSO to select the correct model.
mylasso = glmnet(as.matrix(myData[, c('x1', 'x2', 'x3')]), as.vector(myData$y))
par(mfrow = c(2, 1))
plot(mylasso, label=TRUE, xvar = "norm")
plot(mylasso, label=TRUE, xvar = "lambda")
Next we can compare the corresponding prediction error on a test set.
N = 1000
mytestData = matrix(0, N, 4)
colnames(mytestData) = c("x1", "x2", "x3", "y")
mytestData = as.data.frame(mytestData)
mytestData$x1 = rnorm(N)
mytestData$x2 = rnorm(N)
mytestData$x3 = 2/3 * x1 + 2/3 * x2 + 1/3 * rnorm(N)
mytestData$y = beta[1] * mytestData$x1 +
beta[2] * mytestData$x2 + rnorm(N)
For Lasso, you can form prediction using one of the following approaches:
In this example, lambda.1se and lambda.min select the same model, the full model.
There are two commands for Lasso prediction: one takes the output from cv.glmnet
as input (where you can use lambda.min and lambda.1se), and the other one takes the output from glmnet
as input.
The results below indicate that for this example, the test error from the Refit procedure is not bad even though the model is wrong.
?predict.cv.glmnet
?predict.glmnet
tmp = predict(mylasso.cv, s="lambda.min",
newx=data.matrix(mytestData)[, -4])
mean((mytestData$y - tmp)^2)
## [1] 1.065004
tmp = predict(mylasso.cv, s="lambda.1se",
newx=data.matrix(mytestData)[, -4])
mean((mytestData$y - tmp)^2)
## [1] 1.367122
myfit.full = lm(y ~ ., myData)
tmp = predict(myfit.full, newdata=mytestData)
mean((mytestData$y - tmp)^2)
## [1] 1.057659
myfit.AIC = lm(y ~ x1 + x2, myData)
tmp = predict(myfit.AIC, newdata=mytestData)
mean((mytestData$y - tmp)^2)
## [1] 1.060607