Cabanellas-Reboredo et al. (2019) studied the spread of a disease in a large bivalve (Pinna nobilis) in the Mediterranean caused by a protozoan endoparasite. They collated observations of dead or unwell bivalves from many sites using information from scientific surveys and citizen science contributions. They only used observations from sites that their dispersal models indicated the disease could have spread to. They focused on relating the presence of the disease at a site to salinity and temperature.
You can find a nice picture of Pinna nobilis here, but to get an idea what these bivalves look like, hereâs the smaller Pinna bicolor from southern Australia, which âonlyâ grows to 45 or 50 cm.
The original paper is here. The supplementary data in the paper is a superset of that used for the GAM. The data can be obtained from the paper, but Dr Cabanellas-Reboredo has kindly provided the script to create the right subset, and the easiest option is to use the version here.
Cabanellas-Reboredo, M., et al. (2019). Tracking a mass mortality outbreak of pen shell Pinna nobilis populations: A collaborative effort of scientists and citizens. Scientific Reports, 9, 13355.
First, load the required packages (mgcv, gratia, statmod, lmtest, car, performance)
Import cabanellas data file (cabanellas.csv)
cabanellas <- read.csv("../data/cabanellas.csv")
head(cabanellas)
Note: there are two ways to get these data. From the original paper, the data are available as a text file of all 421 observations (Table 1). The analysis used a subset of the observations from sites of potential parasite infection, excluding sites outside the current range. If you work from this full file, Dr Miguel Cabanellas Rebredo has kindly provided the script to select the required 271 observations. It is below as a code chunk that is not evaluated. Youâll need to modify the markdown to run this chunk.
# Import Parasite Data
df=read.table("data/PinnaParasite.txt", header=TRUE, sep=" ",
na.strings = "NA", check.names = FALSE) #data frame with all 421 records
# Subset the records to fit the GAM
df.gam <- subset(df, df$GAM=="GAM")
# This subset respond to records where the parasite was detected (Parasite==1) and #
# records where the parasite was no detected (Parasite==0), but given the expasion #
# of the parasite, such regions should be infected #
#Change comma per dot
df.gam$Lat=as.numeric(gsub(",", ".", df.gam$Lat))
df.gam$Lon=as.numeric(gsub(",", ".", df.gam$Lon))
df.gam$Sdepth=as.numeric(gsub(",", ".", df.gam$Sdepth))
df.gam$Stemp=as.numeric(gsub(",", ".", df.gam$Stemp))
df.gam$Ss=as.numeric(gsub(",", ".", df.gam$Ss))
cabanellas<-df
Do simple plots for cont predictors
plot(parasite~stemp, data=cabanellas)
plot(parasite~ssal, data=cabanellas)
Check boxplots for continuous predictors
boxplot(cabanellas$stemp~cabanellas$parasite)
boxplot(cabanellas$ssal~cabanellas$parasite)
cabanellas.glm <- glm(parasite ~ stemp+ssal,data=cabanellas,family=binomial)
summary(cabanellas.glm)
Call:
glm(formula = parasite ~ stemp + ssal, family = binomial, data = cabanellas)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -33.6417 11.0805 -3.04 0.0024 **
stemp 0.2671 0.0488 5.47 4.4e-08 ***
ssal 0.8069 0.2850 2.83 0.0046 **
---
Signif. codes: 0 â***â 0.001 â**â 0.01 â*â 0.05 â.â 0.1 â â 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 223.30 on 270 degrees of freedom
Residual deviance: 182.28 on 268 degrees of freedom
AIC: 188.3
Number of Fisher Scoring iterations: 5
AIC(cabanellas.glm)
[1] 188
Do LR test for each predictor
First salinity
cabanellas1.glm <- glm(parasite ~ stemp,data=cabanellas,family=binomial)
lrtest(cabanellas.glm, cabanellas1.glm)
Likelihood ratio test
Model 1: parasite ~ stemp + ssal
Model 2: parasite ~ stemp
#Df LogLik Df Chisq Pr(>Chisq)
1 3 -91.1
2 2 -95.4 -1 8.6 0.0034 **
---
Signif. codes: 0 â***â 0.001 â**â 0.01 â*â 0.05 â.â 0.1 â â 1
now temp
cabanellas2.glm <- glm(parasite ~ ssal,data=cabanellas,family=binomial)
lrtest(cabanellas.glm, cabanellas2.glm)
Likelihood ratio test
Model 1: parasite ~ stemp + ssal
Model 2: parasite ~ ssal
#Df LogLik Df Chisq Pr(>Chisq)
1 3 -91.1
2 2 -110.3 -1 38.3 6e-10 ***
---
Signif. codes: 0 â***â 0.001 â**â 0.01 â*â 0.05 â.â 0.1 â â 1
Check quantile residuals
qcabanellas <- qresid(cabanellas.glm)
qcabanellas
[1] -0.40313 -0.30422 0.33180 0.09460 -0.63911 1.25151 1.20434 -1.14281 -0.42991 -0.49495 -0.25921 1.26833 0.47465 -0.85709 -0.10641 0.96244 0.44227 -0.88051 0.44007
[20] 0.25698 -0.07261 0.16749 2.42897 0.03253 -1.11428 1.47830 0.47139 0.70403 -0.78479 1.06543 0.28420 1.03370 0.39925 0.54743 -0.46757 -0.26124 0.60700 -0.66616
[39] -0.11568 -0.22926 0.34336 0.27991 -0.10402 -0.66445 0.76617 -0.00792 -0.58623 0.51479 -1.27058 -0.99945 -0.37564 -0.37721 0.30051 -1.26232 -1.04184 -0.70837 1.29316
[58] 0.26248 1.02381 0.73451 0.41744 0.50461 1.01896 -0.20594 0.72770 1.37634 -0.03643 -1.23103 1.39274 -0.29697 1.60184 0.49233 0.08120 -0.03616 0.52278 -1.76860
[77] -0.10484 1.99746 0.47434 -1.44314 -0.86578 0.28645 -0.83325 0.57586 1.81449 0.18323 -0.43856 0.76490 -0.92196 0.59575 -0.75375 1.42132 -0.72030 -0.94219 0.09922
[96] -0.34198 -0.49154 0.65635 0.85633 0.73874 -0.19887 0.40279 -0.28002 -0.04166 0.52125 1.60290 1.09920 -0.14610 0.39050 -0.25452 -0.05895 0.79965 -1.15344 0.64385
[115] -0.01396 -0.95448 0.64671 -0.40076 0.16948 0.30203 -0.99727 -0.05458 0.34812 0.38298 0.37694 -0.63495 -0.13959 -0.10757 -0.38948 -0.28518 -0.67903 1.05571 0.55265
[134] -0.94333 0.37362 1.02365 0.10154 0.50238 1.32671 0.20886 1.56888 1.21026 0.78380 0.12766 -0.46803 0.14113 0.13364 0.94233 0.57344 -0.19797 0.35259 0.31483
[153] -0.00461 0.41535 1.93694 -0.14246 0.81996 -0.12780 0.54048 1.09854 0.93747 -0.88863 -0.45590 0.35058 0.98939 0.30907 0.26463 -0.01065 1.48457 0.16849 -0.03581
[172] -0.88966 0.66349 -0.36578 0.16750 0.47043 -1.43671 -0.48716 1.21732 -0.18921 2.21199 0.59037 -0.09174 -0.82590 0.68059 1.21202 0.73081 0.12415 0.18358 -0.27112
[191] 0.42802 -0.79156 0.31385 1.05118 0.37230 -0.33336 0.80380 0.52988 -0.12962 -0.04303 0.25123 0.95101 1.23303 -1.46902 -0.90545 -0.82741 -0.50803 -0.81576 -1.74895
[210] 0.51808 0.47626 0.38197 1.14403 -0.68678 -0.37021 -0.77463 0.18357 0.55144 1.00319 -0.21452 0.73689 1.29817 -0.17753 1.34874 -0.00754 0.83893 1.75044 -0.31308
[229] -0.43434 1.26617 0.54181 -0.37159 -0.41105 -0.53745 -0.68492 -0.55139 -0.56934 -0.70354 -0.56415 -0.85456 -1.26616 -1.28318 -1.74176 -0.80577 -0.40009 -1.50215 -1.08400
[248] -0.48690 -0.62087 -1.26891 -1.39338 -2.94547 -2.02620 -1.88745 -1.55250 -2.86287 -1.80083 -1.36152 -0.61854 -1.48299 -0.32591 -0.26022 -0.37531 -0.98227 0.03610 -0.39792
[267] -3.83660 -3.91358 -2.12489 -0.89430 -1.07294
plot(qcabanellas~cabanellas.glm$fitted.values)
Check deviance residuals
residualPlots(cabanellas.glm, type="deviance")
Test stat Pr(>|Test stat|)
stemp 36.9 1.2e-09 ***
ssal 83.4 < 2e-16 ***
---
Signif. codes: 0 â***â 0.001 â**â 0.01 â*â 0.05 â.â 0.1 â â 1
Check for overdispersion
c(deviance(cabanellas.glm), df.residual(cabanellas.glm))
[1] 182 268
Check collinearity
vif(lm(parasite ~ stemp+ssal,data=cabanellas))
stemp ssal
1.02 1.02
cor(cabanellas$stemp,cabanellas$ssal)
[1] -0.136
get H-L test and tjur r2
performance_hosmer(cabanellas.glm, n_bins=10)
# Hosmer-Lemeshow Goodness-of-Fit Test
Chi-squared: 32.836
df: 8
p-value: 0.000
Summary: model does not fit well.
r2_tjur(cabanellas.glm)
Tjur's R2
0.23
Plot predicted values
plot(cabanellas.glm$fitted.values~cabanellas$stemp)
plot(cabanellas.glm$fitted.values~cabanellas$ssal)
cabanellas1.gam <- gam(parasite ~ s(stemp)+s(ssal),data=cabanellas,family=binomial(link=logit))
summary(cabanellas1.gam)
Family: binomial
Link function: logit
Formula:
parasite ~ s(stemp) + s(ssal)
Parametric coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.67 0.50 5.34 9.2e-08 ***
---
Signif. codes: 0 â***â 0.001 â**â 0.01 â*â 0.05 â.â 0.1 â â 1
Approximate significance of smooth terms:
edf Ref.df Chi.sq p-value
s(stemp) 4.90 5.88 18.1 0.00477 **
s(ssal) 2.91 3.59 22.6 0.00076 ***
---
Signif. codes: 0 â***â 0.001 â**â 0.01 â*â 0.05 â.â 0.1 â â 1
R-sq.(adj) = 0.755 Deviance explained = 71.1%
UBRE = -0.69724 Scale est. = 1 n = 271
AIC(cabanellas1.gam)
[1] 82
Check diagnostics although residuals difficult to interpret with binary data
(quantile residuals not available in mcgv) edf not too close to k although index is < 1
gam.check(cabanellas1.gam, type="deviance")
Method: UBRE Optimizer: outer newton
full convergence after 5 iterations.
Gradient range [-2.82e-09,3.25e-09]
(score -0.697 & scale 1).
Hessian positive definite, eigenvalue range [0.00286,0.00523].
Model rank = 19 / 19
Basis dimension (k) checking results. Low p-value (k-index<1) may
indicate that k is too low, especially if edf is close to k'.
k' edf k-index p-value
s(stemp) 9.00 4.90 0.82 0.010 **
s(ssal) 9.00 2.91 0.82 0.055 .
---
Signif. codes: 0 â***â 0.001 â**â 0.01 â*â 0.05 â.â 0.1 â â 1
qq-plot looks OK
appraise(cabanellas1.gam, type="deviance")
Plot smoother
plot(cabanellas1.gam, resid=TRUE, shade=TRUE, cex=0.5, pch=1)
Use gratia to get nicer plots
source("../R/appearance.R")
p<-draw(cabanellas1.gam, select=c(1,2), residuals=FALSE, ci_level=0.95,rug=FALSE)&
theme_qk() &
theme(
title=element_blank()
)
p
cabanellas2.gam <- gam(parasite ~ s(stemp, bs="cr")+s(ssal, bs="cr"),data=cabanellas,family=binomial(link=logit))
summary(cabanellas2.gam)
AIC(cabanellas2.gam)
Check diagnostics although residuals difficult to interpret with binary data (quantile residuals not available in mcgv)
gam.check(cabanellas2.gam, type="deviance")
appraise(cabanellas2.gam, type="deviance")
Plot smoother
plot(cabanellas2.gam, resid=TRUE, shade=TRUE, cex=0.5, pch=1)
draw(cabanellas2.gam, select=c(1,2), residuals=TRUE, ci_level=0.95,rug=TRUE)
Try visreg, as used by Cabanellas
library(visreg)
library(ggplot2)
plotStemp=visreg(cabanellas1.gam, "stemp", rug=0, xlab="Temperature (ÂșC)", ylab="Partial Effect",
gg=TRUE, line=list(col=lc),
fill=list(fill="grey80", alpha=0.5),
points=list(size=1, pch=1, col=lc)) + theme_qk()+
scale_x_continuous(breaks=seq(8,30,by=2), expand = c(0.01, 0.01))
plotStemp
p1<-plotStemp+ theme_qk()
#--------
plotSs=visreg(cabanellas1.gam, "ssal", rug=0, xlab="Salinity (psu)", ylab="Partial Effect",
gg=TRUE, line=list(col=lc),
fill=list(fill="grey80", alpha=0.5),
points=list(size=1, pch=1, col=lc))+
scale_x_continuous(breaks=seq(35,43,by=1), expand = c(0.01, 0.01))
plotSs
p2<-plotSs + theme_qk()
p1
p2
p1+p2