This example continues from Box 9.1 and earlier
A regression tree for the data from Loyn (1987) related the abundance of forest birds in 56 forest fragments to log patch area, log distance to nearest patch, grazing intensity, altitude and year of isolation. Transformations of predictors to improve linearity are unnecessary for regression trees (and made almost no difference to this specific analysis) but we kept these predictors transformed to match our previous analyses of these data (Boxes 8.2 and 9.1). We used the anova method of maximizing the between-groups SS for each split and used a change in r2 of 0.01 as a default cp value. No other tree-building constraints were imposed. The residual plot from fitting the regression tree did not reveal any strong variance heterogeneity nor outliers.
First, load the required packages (need rpart, randomForest, Metrics, gbm, dismo, caret)
Import loyn data file (loyn.csv) and log-transform area & dist
loyn <- read.csv("../data/loyn.csv")
head(loyn,10)
loyn$logarea <- log10(loyn$area)
loyn$logdist <- log10(loyn$dist)
loyn.rpart1 <- rpart(abund~ logarea+logdist+graze+alt+yearisol, data=loyn, method="anova")
plot(predict(loyn.rpart1),residuals(loyn.rpart1))
loyn.rpart1
n= 56
node), split, n, deviance, yval
* denotes terminal node
1) root 56 6337.9290 19.514290
2) graze>=4.5 13 277.2892 6.292308 *
3) graze< 4.5 43 3100.8840 23.511630
6) logarea< 1.145017 24 1327.1980 18.308330
12) yearisol>=1964 13 644.3108 15.138460 *
13) yearisol< 1964 11 397.8873 22.054550 *
7) logarea>=1.145017 19 303.1253 30.084210 *
summary(loyn.rpart1)
Call:
rpart(formula = abund ~ logarea + logdist + graze + alt + yearisol,
data = loyn, method = "anova")
n= 56
CP nsplit rel error xerror xstd
1 0.46699093 0 1.0000000 1.0320747 0.13421804
2 0.23202543 1 0.5330091 0.7706083 0.12636285
3 0.04496742 2 0.3009836 0.5479940 0.09493545
4 0.01000000 3 0.2560162 0.4129901 0.06721846
Variable importance
graze yearisol logarea alt logdist
37 29 20 8 5
Node number 1: 56 observations, complexity param=0.4669909
mean=19.51429, MSE=113.1773
left son=2 (13 obs) right son=3 (43 obs)
Primary splits:
graze < 4.5 to the right, improve=0.46699090, (0 missing)
logarea < 1.020696 to the left, improve=0.45988770, (0 missing)
yearisol < 1924.5 to the left, improve=0.38557730, (0 missing)
alt < 162.5 to the left, improve=0.21282890, (0 missing)
logdist < 2.594916 to the left, improve=0.05050402, (0 missing)
Surrogate splits:
yearisol < 1924.5 to the left, agree=0.964, adj=0.846, (0 split)
alt < 95 to the left, agree=0.821, adj=0.231, (0 split)
logarea < 0.150515 to the left, agree=0.804, adj=0.154, (0 split)
Node number 2: 13 observations
mean=6.292308, MSE=21.32994
Node number 3: 43 observations, complexity param=0.2320254
mean=23.51163, MSE=72.11359
left son=6 (24 obs) right son=7 (19 obs)
Primary splits:
logarea < 1.145017 to the left, improve=0.47423910, (0 missing)
graze < 1.5 to the right, improve=0.15699760, (0 missing)
alt < 162.5 to the left, improve=0.08926499, (0 missing)
yearisol < 1964.5 to the right, improve=0.06537171, (0 missing)
logdist < 1.596562 to the left, improve=0.06466948, (0 missing)
Surrogate splits:
graze < 1.5 to the right, agree=0.767, adj=0.474, (0 split)
logdist < 2.391258 to the left, agree=0.698, adj=0.316, (0 split)
yearisol < 1973.5 to the left, agree=0.605, adj=0.105, (0 split)
Node number 6: 24 observations, complexity param=0.04496742
mean=18.30833, MSE=55.29993
left son=12 (13 obs) right son=13 (11 obs)
Primary splits:
yearisol < 1964 to the right, improve=0.2147383000, (0 missing)
alt < 165 to the left, improve=0.1357326000, (0 missing)
logarea < 0.5395906 to the left, improve=0.1179568000, (0 missing)
logdist < 2.144102 to the right, improve=0.0133225500, (0 missing)
graze < 2.5 to the left, improve=0.0002803762, (0 missing)
Surrogate splits:
logarea < 0.7385606 to the left, agree=0.708, adj=0.364, (0 split)
graze < 2.5 to the left, agree=0.667, adj=0.273, (0 split)
alt < 115 to the right, agree=0.667, adj=0.273, (0 split)
logdist < 2.144102 to the right, agree=0.625, adj=0.182, (0 split)
Node number 7: 19 observations
mean=30.08421, MSE=15.95396
Node number 12: 13 observations
mean=15.13846, MSE=49.56237
Node number 13: 11 observations
mean=22.05455, MSE=36.17157
plotcp(loyn.rpart1)
plot(loyn.rpart1)
text(loyn.rpart1, use.n=TRUE, all=TRUE)
varImp(loyn.rpart1)
loyn.rpart2 <- rpart(abund~ logarea+logdist+graze+alt+yearisol, data=loyn, method="anova", cp=0.005)
plot(predict(loyn.rpart2),residuals(loyn.rpart2))
summary(loyn.rpart2)
Call:
rpart(formula = abund ~ logarea + logdist + graze + alt + yearisol,
data = loyn, method = "anova", cp = 0.005)
n= 56
CP nsplit rel error xerror xstd
1 0.46699093 0 1.0000000 1.0301639 0.13181018
2 0.23202543 1 0.5330091 0.7220963 0.10566076
3 0.04496742 2 0.3009836 0.4215349 0.07673234
4 0.00500000 3 0.2560162 0.3481076 0.06449313
Variable importance
graze yearisol logarea alt logdist
37 29 20 8 5
Node number 1: 56 observations, complexity param=0.4669909
mean=19.51429, MSE=113.1773
left son=2 (13 obs) right son=3 (43 obs)
Primary splits:
graze < 4.5 to the right, improve=0.46699090, (0 missing)
logarea < 1.020696 to the left, improve=0.45988770, (0 missing)
yearisol < 1924.5 to the left, improve=0.38557730, (0 missing)
alt < 162.5 to the left, improve=0.21282890, (0 missing)
logdist < 2.594916 to the left, improve=0.05050402, (0 missing)
Surrogate splits:
yearisol < 1924.5 to the left, agree=0.964, adj=0.846, (0 split)
alt < 95 to the left, agree=0.821, adj=0.231, (0 split)
logarea < 0.150515 to the left, agree=0.804, adj=0.154, (0 split)
Node number 2: 13 observations
mean=6.292308, MSE=21.32994
Node number 3: 43 observations, complexity param=0.2320254
mean=23.51163, MSE=72.11359
left son=6 (24 obs) right son=7 (19 obs)
Primary splits:
logarea < 1.145017 to the left, improve=0.47423910, (0 missing)
graze < 1.5 to the right, improve=0.15699760, (0 missing)
alt < 162.5 to the left, improve=0.08926499, (0 missing)
yearisol < 1964.5 to the right, improve=0.06537171, (0 missing)
logdist < 1.596562 to the left, improve=0.06466948, (0 missing)
Surrogate splits:
graze < 1.5 to the right, agree=0.767, adj=0.474, (0 split)
logdist < 2.391258 to the left, agree=0.698, adj=0.316, (0 split)
yearisol < 1973.5 to the left, agree=0.605, adj=0.105, (0 split)
Node number 6: 24 observations, complexity param=0.04496742
mean=18.30833, MSE=55.29993
left son=12 (13 obs) right son=13 (11 obs)
Primary splits:
yearisol < 1964 to the right, improve=0.2147383000, (0 missing)
alt < 165 to the left, improve=0.1357326000, (0 missing)
logarea < 0.5395906 to the left, improve=0.1179568000, (0 missing)
logdist < 2.144102 to the right, improve=0.0133225500, (0 missing)
graze < 2.5 to the left, improve=0.0002803762, (0 missing)
Surrogate splits:
logarea < 0.7385606 to the left, agree=0.708, adj=0.364, (0 split)
graze < 2.5 to the left, agree=0.667, adj=0.273, (0 split)
alt < 115 to the right, agree=0.667, adj=0.273, (0 split)
logdist < 2.144102 to the right, agree=0.625, adj=0.182, (0 split)
Node number 7: 19 observations
mean=30.08421, MSE=15.95396
Node number 12: 13 observations
mean=15.13846, MSE=49.56237
Node number 13: 11 observations
mean=22.05455, MSE=36.17157
print(loyn.rpart2)
n= 56
node), split, n, deviance, yval
* denotes terminal node
1) root 56 6337.9290 19.514290
2) graze>=4.5 13 277.2892 6.292308 *
3) graze< 4.5 43 3100.8840 23.511630
6) logarea< 1.145017 24 1327.1980 18.308330
12) yearisol>=1964 13 644.3108 15.138460 *
13) yearisol< 1964 11 397.8873 22.054550 *
7) logarea>=1.145017 19 303.1253 30.084210 *
printcp(loyn.rpart2)
Regression tree:
rpart(formula = abund ~ logarea + logdist + graze + alt + yearisol,
data = loyn, method = "anova", cp = 0.005)
Variables actually used in tree construction:
[1] graze logarea yearisol
Root node error: 6337.9/56 = 113.18
n= 56
CP nsplit rel error xerror xstd
1 0.466991 0 1.00000 1.03016 0.131810
2 0.232025 1 0.53301 0.72210 0.105661
3 0.044967 2 0.30098 0.42153 0.076732
4 0.005000 3 0.25602 0.34811 0.064493
plot(loyn.rpart2)
text(loyn.rpart2, use.n=TRUE, all=TRUE)
loyn.rpart3 <- rpart(abund~ area+dist+graze+alt+yearisol, data=loyn, method="anova")
plot(predict(loyn.rpart3),residuals(loyn.rpart3))
summary(loyn.rpart3)
Call:
rpart(formula = abund ~ area + dist + graze + alt + yearisol,
data = loyn, method = "anova")
n= 56
CP nsplit rel error xerror xstd
1 0.46699093 0 1.0000000 1.0367970 0.13348951
2 0.23202543 1 0.5330091 0.8387713 0.12886575
3 0.04496742 2 0.3009836 0.4665100 0.08564323
4 0.01000000 3 0.2560162 0.4591197 0.08497119
Variable importance
graze yearisol area alt dist
37 29 20 8 5
Node number 1: 56 observations, complexity param=0.4669909
mean=19.51429, MSE=113.1773
left son=2 (13 obs) right son=3 (43 obs)
Primary splits:
graze < 4.5 to the right, improve=0.46699090, (0 missing)
area < 10.5 to the left, improve=0.45988770, (0 missing)
yearisol < 1924.5 to the left, improve=0.38557730, (0 missing)
alt < 162.5 to the left, improve=0.21282890, (0 missing)
dist < 393.5 to the left, improve=0.05050402, (0 missing)
Surrogate splits:
yearisol < 1924.5 to the left, agree=0.964, adj=0.846, (0 split)
alt < 95 to the left, agree=0.821, adj=0.231, (0 split)
area < 1.5 to the left, agree=0.804, adj=0.154, (0 split)
Node number 2: 13 observations
mean=6.292308, MSE=21.32994
Node number 3: 43 observations, complexity param=0.2320254
mean=23.51163, MSE=72.11359
left son=6 (24 obs) right son=7 (19 obs)
Primary splits:
area < 14 to the left, improve=0.47423910, (0 missing)
graze < 1.5 to the right, improve=0.15699760, (0 missing)
alt < 162.5 to the left, improve=0.08926499, (0 missing)
yearisol < 1964.5 to the right, improve=0.06537171, (0 missing)
dist < 39.5 to the left, improve=0.06466948, (0 missing)
Surrogate splits:
graze < 1.5 to the right, agree=0.767, adj=0.474, (0 split)
dist < 246.5 to the left, agree=0.698, adj=0.316, (0 split)
yearisol < 1973.5 to the left, agree=0.605, adj=0.105, (0 split)
Node number 6: 24 observations, complexity param=0.04496742
mean=18.30833, MSE=55.29993
left son=12 (13 obs) right son=13 (11 obs)
Primary splits:
yearisol < 1964 to the right, improve=0.2147383000, (0 missing)
alt < 165 to the left, improve=0.1357326000, (0 missing)
area < 3.5 to the left, improve=0.1179568000, (0 missing)
dist < 139.5 to the right, improve=0.0133225500, (0 missing)
graze < 2.5 to the left, improve=0.0002803762, (0 missing)
Surrogate splits:
area < 5.5 to the left, agree=0.708, adj=0.364, (0 split)
graze < 2.5 to the left, agree=0.667, adj=0.273, (0 split)
alt < 115 to the right, agree=0.667, adj=0.273, (0 split)
dist < 139.5 to the right, agree=0.625, adj=0.182, (0 split)
Node number 7: 19 observations
mean=30.08421, MSE=15.95396
Node number 12: 13 observations
mean=15.13846, MSE=49.56237
Node number 13: 11 observations
mean=22.05455, MSE=36.17157
print(loyn.rpart3)
n= 56
node), split, n, deviance, yval
* denotes terminal node
1) root 56 6337.9290 19.514290
2) graze>=4.5 13 277.2892 6.292308 *
3) graze< 4.5 43 3100.8840 23.511630
6) area< 14 24 1327.1980 18.308330
12) yearisol>=1964 13 644.3108 15.138460 *
13) yearisol< 1964 11 397.8873 22.054550 *
7) area>=14 19 303.1253 30.084210 *
loyn.rpart3$variable.importance
graze yearisol area alt dist
3734.0638 2944.2044 2029.5440 760.7478 516.2058
plot(loyn.rpart3)
text(loyn.rpart3, use.n=TRUE, all=TRUE)
loyn.bag <- randomForest(abund~ logarea+logdist+graze+alt+yearisol, mtry=5, data=loyn)
print(loyn.bag)
Call:
randomForest(formula = abund ~ logarea + logdist + graze + alt + yearisol, data = loyn, mtry = 5)
Type of random forest: regression
Number of trees: 500
No. of variables tried at each split: 5
Mean of squared residuals: 44.975
% Var explained: 60.26
plot(loyn.bag)
loyn.forest <- randomForest(abund~ logarea+logdist+graze+alt+yearisol, mtry=2, data=loyn)
print(loyn.forest)
Call:
randomForest(formula = abund ~ logarea + logdist + graze + alt + yearisol, data = loyn, mtry = 2)
Type of random forest: regression
Number of trees: 500
No. of variables tried at each split: 2
Mean of squared residuals: 46.64511
% Var explained: 58.79
p1<-partialPlot(loyn.forest,loyn, x.var="logarea", plot=FALSE)
plot(p1)
loyn.rpart1a <- rpart(abund~ logarea+logdist+graze+alt+yearisol, data=loyn, method="anova", cp=0)
summary(loyn.rpart1a)
Call:
rpart(formula = abund ~ logarea + logdist + graze + alt + yearisol,
data = loyn, method = "anova", cp = 0)
n= 56
CP nsplit rel error xerror xstd
1 0.46699093 0 1.0000000 1.0825597 0.14255613
2 0.23202543 1 0.5330091 0.8500160 0.11139568
3 0.04496742 2 0.3009836 0.4513146 0.08382967
4 0.00000000 3 0.2560162 0.3856313 0.07389802
Variable importance
graze yearisol logarea alt logdist
37 29 20 8 5
Node number 1: 56 observations, complexity param=0.4669909
mean=19.51429, MSE=113.1773
left son=2 (13 obs) right son=3 (43 obs)
Primary splits:
graze < 4.5 to the right, improve=0.46699090, (0 missing)
logarea < 1.020696 to the left, improve=0.45988770, (0 missing)
yearisol < 1924.5 to the left, improve=0.38557730, (0 missing)
alt < 162.5 to the left, improve=0.21282890, (0 missing)
logdist < 2.594916 to the left, improve=0.05050402, (0 missing)
Surrogate splits:
yearisol < 1924.5 to the left, agree=0.964, adj=0.846, (0 split)
alt < 95 to the left, agree=0.821, adj=0.231, (0 split)
logarea < 0.150515 to the left, agree=0.804, adj=0.154, (0 split)
Node number 2: 13 observations
mean=6.292308, MSE=21.32994
Node number 3: 43 observations, complexity param=0.2320254
mean=23.51163, MSE=72.11359
left son=6 (24 obs) right son=7 (19 obs)
Primary splits:
logarea < 1.145017 to the left, improve=0.47423910, (0 missing)
graze < 1.5 to the right, improve=0.15699760, (0 missing)
alt < 162.5 to the left, improve=0.08926499, (0 missing)
yearisol < 1964.5 to the right, improve=0.06537171, (0 missing)
logdist < 1.596562 to the left, improve=0.06466948, (0 missing)
Surrogate splits:
graze < 1.5 to the right, agree=0.767, adj=0.474, (0 split)
logdist < 2.391258 to the left, agree=0.698, adj=0.316, (0 split)
yearisol < 1973.5 to the left, agree=0.605, adj=0.105, (0 split)
Node number 6: 24 observations, complexity param=0.04496742
mean=18.30833, MSE=55.29993
left son=12 (13 obs) right son=13 (11 obs)
Primary splits:
yearisol < 1964 to the right, improve=0.2147383000, (0 missing)
alt < 165 to the left, improve=0.1357326000, (0 missing)
logarea < 0.5395906 to the left, improve=0.1179568000, (0 missing)
logdist < 2.144102 to the right, improve=0.0133225500, (0 missing)
graze < 2.5 to the left, improve=0.0002803762, (0 missing)
Surrogate splits:
logarea < 0.7385606 to the left, agree=0.708, adj=0.364, (0 split)
graze < 2.5 to the left, agree=0.667, adj=0.273, (0 split)
alt < 115 to the right, agree=0.667, adj=0.273, (0 split)
logdist < 2.144102 to the right, agree=0.625, adj=0.182, (0 split)
Node number 7: 19 observations
mean=30.08421, MSE=15.95396
Node number 12: 13 observations
mean=15.13846, MSE=49.56237
Node number 13: 11 observations
mean=22.05455, MSE=36.17157
loyn.rpart1a.predict <- xpred.rpart(loyn.rpart1a)
rmsqe1a <- rmse(loyn$abund, loyn.rpart1a.predict)
rmsqe1a
[1] 8.70802
loyn.rpart1.predict <- xpred.rpart(loyn.rpart1)
rmsqe1 <- rmse(loyn$abund, loyn.rpart1.predict)
rmsqe1
[1] 9.409801
loyn.gbm <- gbm.step(gbm.y=1, gbm.x=c(3,6:9), data=loyn, family="gaussian", bag.fraction=0.5, learning.rate=0.01, tree.complexity=2)
GBM STEP - version 2.9
Performing cross-validation optimisation of a boosted regression tree model
for abund and using a family of gaussian
Using 56 observations and 5 predictors
creating 10 initial models of 50 trees
folds are unstratified
total mean deviance = 113.1773
tolerance is fixed at 0.1132
ntrees resid. dev.
50 89.3083
now adding trees...
100 72.9216
150 63.0143
200 57.4608
250 54.4914
300 53.0836
350 52.3106
400 52.2914
450 52.3397
500 52.211
550 52.2842
600 52.4289
650 52.5697
700 52.488
750 52.4548
800 53.1704
850 53.2053
900 53.1251
950 53.504
1000 53.747
1050 53.8292
1100 53.8188
1150 54.0263
mean total deviance = 113.177
mean residual deviance = 29.257
estimated cv deviance = 52.211 ; se = 6.729
training data correlation = 0.865
cv correlation = 0.714 ; se = 0.059
elapsed time - 0.01 minutes
print(loyn.gbm)
gbm::gbm(formula = y.data ~ ., distribution = as.character(family),
data = x.data, weights = site.weights, var.monotone = var.monotone,
n.trees = target.trees, interaction.depth = tree.complexity,
shrinkage = learning.rate, bag.fraction = bag.fraction, verbose = FALSE)
A gradient boosted model with gaussian loss function.
500 iterations were performed.
There were 5 predictors of which 5 had non-zero influence.
loyn.gbm1 <- gbm.step(gbm.y=1, gbm.x=c(3,6:9), data=loyn, family="gaussian", bag.fraction=0.5, learning.rate=0.005, tree.complexity=2)
GBM STEP - version 2.9
Performing cross-validation optimisation of a boosted regression tree model
for abund and using a family of gaussian
Using 56 observations and 5 predictors
creating 10 initial models of 50 trees
folds are unstratified
total mean deviance = 113.1773
tolerance is fixed at 0.1132
ntrees resid. dev.
50 98.2031
now adding trees...
100 85.2911
150 76.154
200 69.2657
250 64.0741
300 60.2071
350 56.9562
400 54.6287
450 52.802
500 51.5601
550 50.3817
600 49.6615
650 49.2383
700 48.9079
750 48.6573
800 48.3074
850 48.0518
900 47.9464
950 47.8624
1000 47.8151
1050 47.601
1100 47.3847
1150 47.2743
1200 47.1666
1250 47.234
1300 47.163
1350 47.0592
1400 47.0506
1450 47.0226
1500 47.1315
1550 46.9732
1600 47.0785
1650 47.1399
1700 47.1056
1750 47.0855
1800 47.0324
1850 47.0915
1900 47.086
1950 46.9991
2000 47.068
2050 46.9666
mean total deviance = 113.177
mean residual deviance = 24.499
estimated cv deviance = 46.967 ; se = 11.725
training data correlation = 0.886
cv correlation = 0.725 ; se = 0.083
elapsed time - 0.01 minutes
print(loyn.gbm1)
gbm::gbm(formula = y.data ~ ., distribution = as.character(family),
data = x.data, weights = site.weights, var.monotone = var.monotone,
n.trees = target.trees, interaction.depth = tree.complexity,
shrinkage = learning.rate, bag.fraction = bag.fraction, verbose = FALSE)
A gradient boosted model with gaussian loss function.
2050 iterations were performed.
There were 5 predictors of which 5 had non-zero influence.
this is chosen model to calculate RMSE
rmsqe2 <- sqrt(min(loyn.gbm1$cv.values))
rmsqe2
[1] 6.853216
gbm.plot(loyn.gbm1, variable.no=0)
Try simpler tree (complexity=1)
loyn.gbm2 <- gbm.step(gbm.y=1, gbm.x=c(3,6:9), data=loyn, family="gaussian", bag.fraction=0.5, learning.rate=0.005, tree.complexity=1)
GBM STEP - version 2.9
Performing cross-validation optimisation of a boosted regression tree model
for abund and using a family of gaussian
Using 56 observations and 5 predictors
creating 10 initial models of 50 trees
folds are unstratified
total mean deviance = 113.1773
tolerance is fixed at 0.1132
ntrees resid. dev.
50 99.5682
now adding trees...
100 86.426
150 76.865
200 69.142
250 63.7006
300 59.5133
350 56.3968
400 53.8642
450 52.0525
500 50.8655
550 49.9187
600 49.0016
650 48.3989
700 47.8276
750 47.5905
800 47.4264
850 47.3482
900 47.1419
950 47.2081
1000 46.9313
1050 46.6394
1100 46.6729
1150 46.6331
1200 46.6716
1250 46.6434
1300 46.5478
1350 46.6361
1400 46.7768
1450 46.7217
1500 46.7073
1550 46.7598
1600 46.9478
1650 46.9548
1700 46.9556
1750 46.8729
mean total deviance = 113.177
mean residual deviance = 27.461
estimated cv deviance = 46.548 ; se = 10.273
training data correlation = 0.872
cv correlation = 0.745 ; se = 0.083
elapsed time - 0.01 minutes
print(loyn.gbm2)
gbm::gbm(formula = y.data ~ ., distribution = as.character(family),
data = x.data, weights = site.weights, var.monotone = var.monotone,
n.trees = target.trees, interaction.depth = tree.complexity,
shrinkage = learning.rate, bag.fraction = bag.fraction, verbose = FALSE)
A gradient boosted model with gaussian loss function.
1300 iterations were performed.
There were 5 predictors of which 5 had non-zero influence.
Try more complex tree (complexity=3)
loyn.gbm3 <- gbm.step(gbm.y=1, gbm.x=c(3,6:9), data=loyn, family="gaussian", bag.fraction=0.5, learning.rate=0.005, tree.complexity=3)
GBM STEP - version 2.9
Performing cross-validation optimisation of a boosted regression tree model
for abund and using a family of gaussian
Using 56 observations and 5 predictors
creating 10 initial models of 50 trees
folds are unstratified
total mean deviance = 113.1773
tolerance is fixed at 0.1132
ntrees resid. dev.
50 98.884
now adding trees...
100 86.0671
150 76.4561
200 69.0808
250 63.9834
300 59.8556
350 56.6837
400 54.6266
450 52.715
500 51.432
550 50.515
600 50.0033
650 49.3465
700 49.2181
750 49.2457
800 49.1708
850 49.2536
900 49.1821
950 49.167
1000 49.0252
1050 48.8508
1100 48.9484
1150 48.8864
1200 48.8409
1250 48.9574
1300 48.9999
1350 48.9106
1400 48.7983
1450 48.812
1500 48.761
1550 48.7211
1600 48.8229
1650 48.9358
1700 49.0386
1750 49.1653
1800 49.1264
mean total deviance = 113.177
mean residual deviance = 26.393
estimated cv deviance = 48.721 ; se = 11.369
training data correlation = 0.877
cv correlation = 0.732 ; se = 0.073
elapsed time - 0.01 minutes
print(loyn.gbm3)
gbm::gbm(formula = y.data ~ ., distribution = as.character(family),
data = x.data, weights = site.weights, var.monotone = var.monotone,
n.trees = target.trees, interaction.depth = tree.complexity,
shrinkage = learning.rate, bag.fraction = bag.fraction, verbose = FALSE)
A gradient boosted model with gaussian loss function.
1550 iterations were performed.
There were 5 predictors of which 5 had non-zero influence.
try more complex tree (complexity=4)
loyn.gbm4 <- gbm.step(gbm.y=1, gbm.x=c(3,6:9), data=loyn, family="gaussian", bag.fraction=0.5, learning.rate=0.005, tree.complexity=4)
GBM STEP - version 2.9
Performing cross-validation optimisation of a boosted regression tree model
for abund and using a family of gaussian
Using 56 observations and 5 predictors
creating 10 initial models of 50 trees
folds are unstratified
total mean deviance = 113.1773
tolerance is fixed at 0.1132
ntrees resid. dev.
50 97.7243
now adding trees...
100 85.1598
150 76.2338
200 69.122
250 63.8355
300 60.2532
350 57.1225
400 54.8718
450 53.1288
500 51.8464
550 50.7832
600 50.1007
650 49.7548
700 49.4518
750 49.1935
800 48.8942
850 49.0452
900 49.0204
950 48.9536
1000 48.93
1050 49.0093
1100 49.0488
1150 48.8286
1200 48.7518
1250 48.783
1300 48.8788
1350 48.8562
1400 48.8175
1450 48.7362
1500 48.948
1550 49.0853
1600 49.0755
1650 48.9991
1700 49.2345
mean total deviance = 113.177
mean residual deviance = 26.246
estimated cv deviance = 48.736 ; se = 10.179
training data correlation = 0.879
cv correlation = 0.788 ; se = 0.042
elapsed time - 0.01 minutes
print(loyn.gbm4)
gbm::gbm(formula = y.data ~ ., distribution = as.character(family),
data = x.data, weights = site.weights, var.monotone = var.monotone,
n.trees = target.trees, interaction.depth = tree.complexity,
shrinkage = learning.rate, bag.fraction = bag.fraction, verbose = FALSE)
A gradient boosted model with gaussian loss function.
1450 iterations were performed.
There were 5 predictors of which 5 had non-zero influence.
try more complex tree (complexity=5)
loyn.gbm5 <- gbm.step(gbm.y=1, gbm.x=c(3,6:9), data=loyn, family="gaussian", bag.fraction=0.5, learning.rate=0.005, tree.complexity=5)
GBM STEP - version 2.9
Performing cross-validation optimisation of a boosted regression tree model
for abund and using a family of gaussian
Using 56 observations and 5 predictors
creating 10 initial models of 50 trees
folds are unstratified
total mean deviance = 113.1773
tolerance is fixed at 0.1132
ntrees resid. dev.
50 98.6754
now adding trees...
100 86.3589
150 76.8216
200 69.7461
250 64.3986
300 60.4982
350 57.4546
400 54.9405
450 52.8936
500 51.6109
550 50.6183
600 49.8874
650 49.2615
700 48.8659
750 48.6131
800 48.4315
850 48.4276
900 48.3025
950 48.2503
1000 48.1359
1050 48.1406
1100 47.8787
1150 47.8485
1200 47.7282
1250 47.7871
1300 47.7232
1350 47.8267
1400 47.9109
1450 48.0164
1500 47.8552
1550 47.8792
1600 48.0081
1650 48.0077
1700 48.186
1750 48
1800 48.0696
mean total deviance = 113.177
mean residual deviance = 27.523
estimated cv deviance = 47.723 ; se = 10.452
training data correlation = 0.872
cv correlation = 0.777 ; se = 0.04
elapsed time - 0.01 minutes
print(loyn.gbm5)
gbm::gbm(formula = y.data ~ ., distribution = as.character(family),
data = x.data, weights = site.weights, var.monotone = var.monotone,
n.trees = target.trees, interaction.depth = tree.complexity,
shrinkage = learning.rate, bag.fraction = bag.fraction, verbose = FALSE)
A gradient boosted model with gaussian loss function.
1300 iterations were performed.
There were 5 predictors of which 5 had non-zero influence.
loyn.gbm6 <- gbm(abund~ logarea+logdist+graze+alt+yearisol, data=loyn, n.trees=2000, distribution="gaussian", interaction.depth=2, bag.fraction=0.5, shrinkage=0.005, cv.folds=10)
summary(loyn.gbm6)
print(loyn.gbm6)
gbm(formula = abund ~ logarea + logdist + graze + alt + yearisol,
distribution = "gaussian", data = loyn, n.trees = 2000, interaction.depth = 2,
shrinkage = 0.005, bag.fraction = 0.5, cv.folds = 10)
A gradient boosted model with gaussian loss function.
2000 iterations were performed.
The best cross-validation iteration was 1410.
There were 5 predictors of which 5 had non-zero influence.
loyn.gbm6
gbm(formula = abund ~ logarea + logdist + graze + alt + yearisol,
distribution = "gaussian", data = loyn, n.trees = 2000, interaction.depth = 2,
shrinkage = 0.005, bag.fraction = 0.5, cv.folds = 10)
A gradient boosted model with gaussian loss function.
2000 iterations were performed.
The best cross-validation iteration was 1410.
There were 5 predictors of which 5 had non-zero influence.
do partial dependence plots one at a time, but not sure what n.trees refers to partial doesn’t seem to like gbm objects; seems like it’s now done using plot.gbm, with slightly different arguments. Left out n.trees - used package defaults
#p2 <- partialPlot(loyn.gbm6, x.var=c("graze"), n.trees=984)
#plot(p2)
plot.gbm(loyn.gbm6, i.var=c("graze"))
plot.gbm(loyn.gbm6, variable.no=0)
gbm.perf(loyn.gbm6, method="cv")
[1] 1410