This Box uses a simulated data set. Create file with 20 observations for 4 variables (Y, X1 and X2) with specific correlation matrix. This step was done once for the example in Box 8.7. If you run it, you’ll get different answers. To reproduce the example, use the data file collin.csv
library(mvtnorm)
sigma <- matrix (c(1,0.7,0.1,0.7,1,0.2,0.1,0.2,1), nrow=3)
qk <-rmvnorm(20, mean=c(0,0,0), sigma=sigma, method="chol")
var(qk)
[,1] [,2] [,3]
[1,] 1.4215899 0.67379244 -0.46784912
[2,] 0.6737924 0.97008421 -0.08796963
[3,] -0.4678491 -0.08796963 0.98389681
qk
[,1] [,2] [,3]
[1,] -2.6753310 -0.82462451 2.30765041
[2,] 0.5915112 0.04514730 -1.31632800
[3,] -0.9358386 0.11533014 1.36677476
[4,] 1.8176974 1.00587842 0.59996750
[5,] -0.2784551 -2.01151610 0.96326073
[6,] -0.8520420 0.70164582 -1.18420067
[7,] 0.4600557 -0.02974052 -0.59040370
[8,] 1.4838772 1.12072224 0.56206369
[9,] 1.5836442 0.26863959 -0.54506209
[10,] -0.7504159 -1.32211293 -0.42953936
[11,] 1.7503685 -0.02984474 -1.39288058
[12,] -0.1720523 -0.50683380 -0.49632457
[13,] -0.2660846 -0.66137776 -0.81303339
[14,] -0.1216650 -1.52886761 0.03911104
[15,] 0.8541191 0.20971200 0.20127394
[16,] 0.9446586 0.88989046 1.29312858
[17,] 1.2724257 1.77190214 -0.17236910
[18,] 0.3745455 -0.26300192 0.80397442
[19,] -1.3176221 -0.18688775 0.96112434
[20,] -1.0517918 -1.54368965 0.24067730
Create new variable, X2a in excel by rearranging values in x2 to be highly correlated with x1
Read in data (collin.csv) and check correlations and VIFs and fit linear models
library(car)
collin <- read.csv("../data/collin.csv")
cor(collin[,c('y','x1','x2','x2a')])
y x1 x2 x2a
y 1.0000000 0.6620854 0.1427197 0.5945663
x1 0.6620854 1.0000000 0.2640239 0.9473994
x2 0.1427197 0.2640239 1.0000000 0.1381989
x2a 0.5945663 0.9473994 0.1381989 1.0000000
collin1.lm <- lm(y~x1+x2, data=collin)
vif(collin1.lm)
x1 x2
1.074932 1.074932
summary(collin1.lm)
Call:
lm(formula = y ~ x1 + x2, data = collin)
Residuals:
Min 1Q Median 3Q Max
-0.9309 -0.4058 -0.0693 0.3073 1.3526
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.06903 0.93790 3.272 0.00449 **
x1 0.49138 0.13783 3.565 0.00238 **
x2 -0.03098 0.16907 -0.183 0.85680
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.6387 on 17 degrees of freedom
Multiple R-squared: 0.4395, Adjusted R-squared: 0.3735
F-statistic: 6.664 on 2 and 17 DF, p-value: 0.007297
collin2.lm <- lm(y~x1+x2a, data=collin)
vif(collin2.lm)
x1 x2a
9.762339 9.762339
summary(collin2.lm)
Call:
lm(formula = y ~ x1 + x2a, data = collin)
Residuals:
Min 1Q Median 3Q Max
-0.92204 -0.41660 -0.03692 0.24698 1.30171
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.2195 0.8302 3.878 0.00121 **
x1 0.7061 0.4119 1.714 0.10466
x2a -0.2866 0.5053 -0.567 0.57794
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.6334 on 17 degrees of freedom
Multiple R-squared: 0.4488, Adjusted R-squared: 0.3839
F-statistic: 6.921 on 2 and 17 DF, p-value: 0.006327