Most code is from Lab 7.8 in ISLR.
library(ISLR)
attach(Wage)
dim(Wage)
## [1] 3000 11
names(Wage)
## [1] "year" "age" "maritl" "race" "education"
## [6] "region" "jobclass" "health" "health_ins" "logwage"
## [11] "wage"
range(age)
## [1] 18 80
poly( , 3)
returns an n-by-3 matrix: each column has mean zero and sample variance 1, and they are orthogonal to each other. The 1st column is a linear combination of age and intercept, the 2nd column is a linear combination of age^2, age, and intercept, and the 3rd column is a linear combination of age^3, age^2, age, and intercept.
tmp = poly(age, 3)
dim(tmp)
## [1] 3000 3
colMeans(tmp)
## 1 2 3
## 3.959e-19 1.722e-19 -1.923e-19
round(t(tmp) %*% tmp, dig=4)
## 1 2 3
## 1 1 0 0
## 2 0 1 0
## 3 0 0 1
fit = lm(wage ~ poly(age, 3), data = Wage)
round(summary(fit)$coef, dig = 3)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 111.7 0.729 153.211 0.000
## poly(age, 3)1 447.1 39.933 11.195 0.000
## poly(age, 3)2 -478.3 39.933 -11.978 0.000
## poly(age, 3)3 125.5 39.933 3.143 0.002
Alternatively we can use the default design matrix where the j-th column corresponds to age^j.
fit2 = lm(wage ~ age + I(age^2) + I(age^3), data=Wage)
round(summary(fit2)$coef, dig = 3)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -75.244 22.184 -3.392 0.001
## age 10.190 1.605 6.348 0.000
## I(age^2) -0.168 0.037 -4.559 0.000
## I(age^3) 0.001 0.000 3.143 0.002
# The default design matrix can also be generated by poly
# with option raw = TRUE.
# fit3 should return the same set of coefficients as fit2
# fit3 = lm(wage ~ poly(age, 3, raw = T), data = Wage)
# round(summary(fit2)$coef, dig = 3)
Note that although the coefficients from fit
and the ones from fit2
are different, the t-value and p-value for the last predictor are always the same.
Different ways to fit a polynomial regression model in R. The coefficients might not be the same but the fitted curves are the same.
Predict the wage at age = 82. fit and fit2 should give you the same answer.
predict(fit, newdata = list(age=82))
## 1
## 98.87
predict(fit2, newdata = list(age=82))
## 1
## 98.87
The fitted curve from the all three models should be the same.
agelims = range(age)
age.grid = seq(from = agelims[1], to = agelims[2])
preds = predict(fit, newdata = list(age = age.grid), se=TRUE)
plot(age, wage, xlim = agelims, pch = '.', cex = 2, col="darkgrey")
title("Degree -3 Polynomial ")
lines(age.grid, preds$fit, lwd=2, col="blue")
Forward selection for the polynomial order d based on the significance of the coefficient of the highest order starting with quardratic polynomial function, and we finally pick d=3.
summary(lm(wage ~ poly(age, 2), data=Wage))$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 111.7 0.7302 152.98 0.000e+00
## poly(age, 2)1 447.1 39.9926 11.18 1.878e-28
## poly(age, 2)2 -478.3 39.9926 -11.96 3.077e-32
summary(lm(wage ~ poly(age, 3), data=Wage))$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 111.7 0.7291 153.211 0.000e+00
## poly(age, 3)1 447.1 39.9335 11.195 1.571e-28
## poly(age, 3)2 -478.3 39.9335 -11.978 2.512e-32
## poly(age, 3)3 125.5 39.9335 3.143 1.687e-03
summary(lm(wage ~ poly(age, 4), data=Wage))$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 111.70 0.7287 153.283 0.000e+00
## poly(age, 4)1 447.07 39.9148 11.201 1.485e-28
## poly(age, 4)2 -478.32 39.9148 -11.983 2.356e-32
## poly(age, 4)3 125.52 39.9148 3.145 1.679e-03
## poly(age, 4)4 -77.91 39.9148 -1.952 5.104e-02
Back selection for the polynomial order d based on the significance of the coefficient of the highest order starting with d=6, and we finally pick d=3. For this data, the forward and the backward approaches happen to pick the same d value, but in general, the two choices (backward or forward) for d could differ.
summary(lm(wage ~ poly(age, 6), data=Wage))$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 111.70 0.7286 153.3156 0.000e+00
## poly(age, 6)1 447.07 39.9063 11.2029 1.448e-28
## poly(age, 6)2 -478.32 39.9063 -11.9860 2.290e-32
## poly(age, 6)3 125.52 39.9063 3.1454 1.675e-03
## poly(age, 6)4 -77.91 39.9063 -1.9524 5.099e-02
## poly(age, 6)5 -35.81 39.9063 -0.8974 3.696e-01
## poly(age, 6)6 62.71 39.9063 1.5714 1.162e-01
summary(lm(wage ~ poly(age, 5), data=Wage))$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 111.70 0.7288 153.2780 0.000e+00
## poly(age, 5)1 447.07 39.9161 11.2002 1.491e-28
## poly(age, 5)2 -478.32 39.9161 -11.9830 2.368e-32
## poly(age, 5)3 125.52 39.9161 3.1446 1.679e-03
## poly(age, 5)4 -77.91 39.9161 -1.9519 5.105e-02
## poly(age, 5)5 -35.81 39.9161 -0.8972 3.697e-01
summary(lm(wage ~ poly(age, 4), data=Wage))$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 111.70 0.7287 153.283 0.000e+00
## poly(age, 4)1 447.07 39.9148 11.201 1.485e-28
## poly(age, 4)2 -478.32 39.9148 -11.983 2.356e-32
## poly(age, 4)3 125.52 39.9148 3.145 1.679e-03
## poly(age, 4)4 -77.91 39.9148 -1.952 5.104e-02
summary(lm(wage ~ poly(age, 3), data=Wage))$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 111.7 0.7291 153.211 0.000e+00
## poly(age, 3)1 447.1 39.9335 11.195 1.571e-28
## poly(age, 3)2 -478.3 39.9335 -11.978 2.512e-32
## poly(age, 3)3 125.5 39.9335 3.143 1.687e-03