Most code is from Lab 7.8 in ISLR.

library(ISLR) 
attach(Wage)
dim(Wage)
## [1] 3000   11
names(Wage)
##  [1] "year"       "age"        "maritl"     "race"       "education" 
##  [6] "region"     "jobclass"   "health"     "health_ins" "logwage"   
## [11] "wage"
range(age)
## [1] 18 80

Fit a Polynomial Regression Model

poly( , 3) returns an n-by-3 matrix: each column has mean zero and sample variance 1, and they are orthogonal to each other. The 1st column is a linear combination of age and intercept, the 2nd column is a linear combination of age^2, age, and intercept, and the 3rd column is a linear combination of age^3, age^2, age, and intercept.

tmp = poly(age, 3)
dim(tmp)
## [1] 3000    3
colMeans(tmp)
##          1          2          3 
##  3.959e-19  1.722e-19 -1.923e-19
round(t(tmp) %*% tmp, dig=4)
##   1 2 3
## 1 1 0 0
## 2 0 1 0
## 3 0 0 1
fit = lm(wage ~ poly(age, 3), data = Wage)
round(summary(fit)$coef, dig = 3)
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept)      111.7      0.729 153.211    0.000
## poly(age, 3)1    447.1     39.933  11.195    0.000
## poly(age, 3)2   -478.3     39.933 -11.978    0.000
## poly(age, 3)3    125.5     39.933   3.143    0.002

Alternatively we can use the default design matrix where the j-th column corresponds to age^j.

fit2 = lm(wage ~ age + I(age^2) + I(age^3), data=Wage)
round(summary(fit2)$coef, dig = 3)
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  -75.244     22.184  -3.392    0.001
## age           10.190      1.605   6.348    0.000
## I(age^2)      -0.168      0.037  -4.559    0.000
## I(age^3)       0.001      0.000   3.143    0.002
# The default design matrix can also be generated by poly
# with option raw = TRUE. 
# fit3 should return the same set of coefficients as fit2

# fit3 = lm(wage ~ poly(age, 3, raw = T), data = Wage)
# round(summary(fit2)$coef, dig = 3)

Note that although the coefficients from fit and the ones from fit2 are different, the t-value and p-value for the last predictor are always the same.

Different ways to fit a polynomial regression model in R. The coefficients might not be the same but the fitted curves are the same.

Predict the wage at age = 82. fit and fit2 should give you the same answer.

predict(fit, newdata = list(age=82))
##     1 
## 98.87
predict(fit2, newdata = list(age=82))
##     1 
## 98.87

The fitted curve from the all three models should be the same.

agelims = range(age)
age.grid = seq(from = agelims[1], to = agelims[2])
preds = predict(fit, newdata = list(age = age.grid), se=TRUE)
plot(age, wage, xlim = agelims, pch = '.', cex = 2, col="darkgrey")
title("Degree -3 Polynomial ")
lines(age.grid, preds$fit, lwd=2, col="blue")

Forward Selection on d

Forward selection for the polynomial order d based on the significance of the coefficient of the highest order starting with quardratic polynomial function, and we finally pick d=3.

summary(lm(wage ~ poly(age, 2), data=Wage))$coef
##               Estimate Std. Error t value  Pr(>|t|)
## (Intercept)      111.7     0.7302  152.98 0.000e+00
## poly(age, 2)1    447.1    39.9926   11.18 1.878e-28
## poly(age, 2)2   -478.3    39.9926  -11.96 3.077e-32
summary(lm(wage ~ poly(age, 3), data=Wage))$coef
##               Estimate Std. Error t value  Pr(>|t|)
## (Intercept)      111.7     0.7291 153.211 0.000e+00
## poly(age, 3)1    447.1    39.9335  11.195 1.571e-28
## poly(age, 3)2   -478.3    39.9335 -11.978 2.512e-32
## poly(age, 3)3    125.5    39.9335   3.143 1.687e-03
summary(lm(wage ~ poly(age, 4), data=Wage))$coef
##               Estimate Std. Error t value  Pr(>|t|)
## (Intercept)     111.70     0.7287 153.283 0.000e+00
## poly(age, 4)1   447.07    39.9148  11.201 1.485e-28
## poly(age, 4)2  -478.32    39.9148 -11.983 2.356e-32
## poly(age, 4)3   125.52    39.9148   3.145 1.679e-03
## poly(age, 4)4   -77.91    39.9148  -1.952 5.104e-02

Backward Selection on d

Back selection for the polynomial order d based on the significance of the coefficient of the highest order starting with d=6, and we finally pick d=3. For this data, the forward and the backward approaches happen to pick the same d value, but in general, the two choices (backward or forward) for d could differ.

summary(lm(wage ~ poly(age, 6), data=Wage))$coef
##               Estimate Std. Error  t value  Pr(>|t|)
## (Intercept)     111.70     0.7286 153.3156 0.000e+00
## poly(age, 6)1   447.07    39.9063  11.2029 1.448e-28
## poly(age, 6)2  -478.32    39.9063 -11.9860 2.290e-32
## poly(age, 6)3   125.52    39.9063   3.1454 1.675e-03
## poly(age, 6)4   -77.91    39.9063  -1.9524 5.099e-02
## poly(age, 6)5   -35.81    39.9063  -0.8974 3.696e-01
## poly(age, 6)6    62.71    39.9063   1.5714 1.162e-01
summary(lm(wage ~ poly(age, 5), data=Wage))$coef
##               Estimate Std. Error  t value  Pr(>|t|)
## (Intercept)     111.70     0.7288 153.2780 0.000e+00
## poly(age, 5)1   447.07    39.9161  11.2002 1.491e-28
## poly(age, 5)2  -478.32    39.9161 -11.9830 2.368e-32
## poly(age, 5)3   125.52    39.9161   3.1446 1.679e-03
## poly(age, 5)4   -77.91    39.9161  -1.9519 5.105e-02
## poly(age, 5)5   -35.81    39.9161  -0.8972 3.697e-01
summary(lm(wage ~ poly(age, 4), data=Wage))$coef
##               Estimate Std. Error t value  Pr(>|t|)
## (Intercept)     111.70     0.7287 153.283 0.000e+00
## poly(age, 4)1   447.07    39.9148  11.201 1.485e-28
## poly(age, 4)2  -478.32    39.9148 -11.983 2.356e-32
## poly(age, 4)3   125.52    39.9148   3.145 1.679e-03
## poly(age, 4)4   -77.91    39.9148  -1.952 5.104e-02
summary(lm(wage ~ poly(age, 3), data=Wage))$coef
##               Estimate Std. Error t value  Pr(>|t|)
## (Intercept)      111.7     0.7291 153.211 0.000e+00
## poly(age, 3)1    447.1    39.9335  11.195 1.571e-28
## poly(age, 3)2   -478.3    39.9335 -11.978 2.512e-32
## poly(age, 3)3    125.5    39.9335   3.143 1.687e-03