Задание 1.4. Написать на R функцию stepCV, аналогичную stepAIC, но использующую кросс-валидацию для проверки значимости признака. Нужно следовать при этом принципу иерархии — нельзя выкидывать признаки более низкого порядка, если есть признаки более высокого. Функция должна работать со всеми методами, с которыми работает stepAIC.
Hint: Можно расковырять оригинальную функцию stepAIC, а кросс-валидацию взять из e1071.
library(MASS)
library(e1071)
source("stepCV.R")
Пример посложнее
Advertising <- read.csv("data/Advertising.csv")[, - c(1)]
l <- lm(Sales~. , Advertising)
stepCV(l, trace = TRUE, direction="backward")
## Start: CV.performance=2.95
## Sales ~ TV + Radio + Newspaper
##
## Df rank CV
## - TV -15.919825 3 18.8590
## - Radio -7.107173 3 10.0463
## <none> 4 2.9391
## - Newspaper 0.035548 3 2.9036
##
## Step: CV.performance=18.67
## Sales ~ Radio + Newspaper
##
## Df rank CV
## - Radio -7.603313 2 26.125
## <none> 3 18.522
## - Newspaper 0.040492 2 18.481
##
## Step: CV.performance=26.29
## Sales ~ Newspaper
##
## Df rank CV
## <none> 2 26.453
## - Newspaper 26.453 1 0.000
##
## Call:
## lm(formula = Sales ~ Newspaper, data = Advertising)
##
## Coefficients:
## (Intercept) Newspaper
## 12.35141 0.05469
stepCV((lm(Sales~1, Advertising)), trace = TRUE, direction="forward", scope=Sales~I(Radio*Newspaper) + Radio + TV )
## Start: CV.performance=0
## Sales ~ 1
##
## Df rank CV
## + I(Radio * Newspaper) -23.175 2 23.175
## + Radio -18.663 2 18.663
## + TV -10.760 2 10.760
## <none> 1 0.000
##
## Step: CV.performance=23.04
## Sales ~ I(Radio * Newspaper)
##
## Df rank CV
## <none> 2 23.0136
## + Radio 3.4019 3 19.6117
## + TV 16.1870 3 6.8267
##
## Call:
## lm(formula = Sales ~ I(Radio * Newspaper), data = Advertising)
##
## Coefficients:
## (Intercept) I(Radio * Newspaper)
## 12.114256 0.002314
stepCV((lm(Sales~TV+Radio, Advertising)), trace = TRUE, direction="both", scope=list(lower=Sales~1, upper=Sales~I(Radio*Newspaper) + Radio + TV))$anova
## Start: CV.performance=2.85
## Sales ~ TV + Radio
##
## Df rank CV
## - TV -15.695533 2 18.6138
## - Radio -8.033049 2 10.9513
## + I(Radio * Newspaper) -0.025309 4 2.9713
## <none> 3 2.9183
##
## Step: CV.performance=18.23
## Sales ~ Radio
##
## Df rank CV
## + I(Radio * Newspaper) -0.37409 3 18.8667
## <none> 2 18.7553
## + TV 15.62982 3 2.8628
## - Radio 18.75534 1 0.0000
##
## Step: CV.performance=19.17
## Sales ~ Radio + I(Radio * Newspaper)
##
## Df rank CV
## - Radio -3.7722 2 22.6757
## <none> 3 18.9035
## - I(Radio * Newspaper) 0.3772 2 18.5263
## + TV 16.0099 4 3.0819
##
## Step: CV.performance=22.88
## Sales ~ I(Radio * Newspaper)
##
## Df rank CV
## <none> 2 23.3504
## + Radio 4.1621 3 18.9031
## + TV 15.9881 3 7.0771
## - I(Radio * Newspaper) 23.3504 1 0.0000
## Stepwise Model Path
## Analysis of Deviance Table
##
## Initial Model:
## Sales ~ TV + Radio
##
## Final Model:
## Sales ~ I(Radio * Newspaper)
##
##
## Step performance rank CV
## 1 2.845712 3 2.845712
## 2 - TV 18.230915 2 18.230915
## 3 + I(Radio * Newspaper) 19.168760 3 19.168760
## 4 - Radio 22.875072 2 22.875072
Можно сравнить результат с stepAIC:
stepAIC(l, trace = TRUE)
## Start: AIC=212.79
## Sales ~ TV + Radio + Newspaper
##
## Df Sum of Sq RSS AIC
## - Newspaper 1 0.09 556.9 210.82
## <none> 556.8 212.79
## - Radio 1 1361.74 1918.6 458.20
## - TV 1 3058.01 3614.8 584.90
##
## Step: AIC=210.82
## Sales ~ TV + Radio
##
## Df Sum of Sq RSS AIC
## <none> 556.9 210.82
## - Radio 1 1545.6 2102.5 474.52
## - TV 1 3061.6 3618.5 583.10
##
## Call:
## lm(formula = Sales ~ TV + Radio, data = Advertising)
##
## Coefficients:
## (Intercept) TV Radio
## 2.92110 0.04575 0.18799
Результаты получаются разные.