These exercises will push you a bit to fit GLMs to various kinds of data. You’ll need to pay attention to the nature of the predictors and think about the response variables and their likely distributions.
Uefune et al. (2020) studied the use of synthetic herbivory-induced plant volatiles (HIPVs) to attract larval parasitoid wasps (Plutella xylostella) to control diamondback moths (DBM: Cotesia vestalis), a global pest of cruciferous vegetables in greenhouses growing mizuna (Japanese mustard) in Japan. They used two groups of greenhouses, the treated group having dispensers for the HIPVs as well as honeyfeeders to attract C. vestalis) and a second untreated group. In each greenhouse, a single sticky trap, replaced weekly over 6 months, was used to catch both DBMs and C. vestalis and the numbers of both counted. While greenhouse ID could have been included as a random effect in a mixed model analysis, the available data did not record individual greenhouses. We will model numbers of C. vestalis against numbers of DBM and treatment using each trap as the units of analysis. The study was done in 2006 and 2008 but we only analyze the 2008 data.
The data are available on Dryad, and you’ll want the Excel file Fig3.xls, and within it, the sheet labelled 2008. Before using it in analysis, you’ll need to tidy up the first few rows. Alternatively, we provide a simplified version, with three columns, treatment, moth, and parasitoid (uefune.csv).
Import the data and check the properties of the response variable.
#Import uefune data file (uefune.csv)
uefune <- read.csv("../data/uefunex.csv")
head(uefune,10)
## treatment moth parasitoid
## 1 treated 0 0
## 2 treated 0 0
## 3 treated 0 0
## 4 treated 0 0
## 5 treated 0 0
## 6 treated 0 0
## 7 treated 0 1
## 8 treated 0 0
## 9 treated 0 0
## 10 treated 0 0
Hint: Try graphical presentation (possibly including a transformation) and look at summary statistics
What do you conclude about the response variable?
You should have decided that the data are too skewed to fit a linear model assuming a normal distribution, and a poisson is a strong candidate.
A look at the means and standard deviations suggests that the variance was much larger than the mean, suggesting some overdispersion. Let’s check for it, and see if we should be concerned.
Hint: here’s an R code chunk to do that:
presid <- resid(uefune.glm, type=“pearson”)
ssize <- nrow(uefune1)
params <- length(coef(uefune.glm))
disp <- sum(presid^2)/(ssize-params)
disp
Conclusions?
How does your solution work?
For interest(!). you could also compare the results to the “old” way of transforming response variables closer to normality and fitting an OLS linear model
Try linear with sqrt transform
Caudill and Rice (2016) examined the effectiveness of methods for making agriculture and habitat preservation more compatible, specifically assessing whether protocols for “Biodiversity-Friendly” coffee growing resulted in positive outcomes for mammals. They compared four habitats, forest, Bird Friendly® shade, conventional shade, and sun coffee, and used a combination of Sherman traps and camera traps to count mammals and assess species richness. Each habitat was represented by multiple sites, with 23 total sites monitored. At each site, they also recorded a series of plant habitat variables, such as cover at canopy, mid- and lower-strata, and ground level, tree basal area. They were interested primarily in differences between habitats, but also the role of vegetation characteristics in influencing mammal diversity. We focus on this latter relationship.
The data were provided as an Excel file through Dryad, which needs a little tidying to use in R; two additional header rows, and some variable names quite long. Use the tidied caudill.csv file for convenience.
Potential explanatory habitat variables are
Caudill & Rice examined three response variables, Small Mammal “species density”, which was the number of small mammal species recorded, medium-large density, and total species density, which combined small and med-large.
df <- read.csv("../data/caudill.csv")
head(df,10)
## site habitat trapnights individuals specdens_small rrapnights_camera
## 1 1 BF 500 3 1 10
## 2 2 BF 500 14 4 10
## 3 3 BF 480 6 3 10
## 4 4 BF 480 13 4 10
## 5 5 BF 480 16 3 10
## 6 6 BF 500 21 5 10
## 7 7 Forest 500 6 2 10
## 8 8 Forest 500 16 3 10
## 9 9 Forest 340 2 2 10
## 10 10 Forest 480 4 1 10
## num_images specdens_ml totspecdens canopy basalarea midstrata lowstrata
## 1 8 2 3 90.9 37.9 35.7 30.3
## 2 10 5 9 81.4 20.7 70.3 86.5
## 3 9 3 6 92.7 34.0 58.6 21.4
## 4 24 3 7 77.3 22.9 54.5 67.7
## 5 3 1 4 91.7 22.9 64.5 54.3
## 6 3 2 7 82.3 22.9 35.8 70.6
## 7 5 2 4 89.0 12.0 76.8 53.8
## 8 5 2 5 96.6 18.4 18.3 65.1
## 9 12 1 3 89.6 38.6 66.7 51.7
## 10 5 3 4 96.0 26.8 44.6 39.2
## groundcov treecount elevation treerichness treeheight coffeeheight X
## 1 62.2 84 723 10 20.4 2.5 NA
## 2 2.6 59 711 12 15.1 2.7 NA
## 3 53.4 80 713 13 11.7 2.8 NA
## 4 3.2 41 660 9 18.4 2.7 NA
## 5 65.9 48 641 5 12.9 2.7 NA
## 6 7.1 35 755 9 19.4 2.4 NA
## 7 28.4 171 487 NA NA NA NA
## 8 9.4 112 496 NA NA NA NA
## 9 24.3 35 650 NA NA NA NA
## 10 17.8 203 667 NA NA NA NA
#Forest habitat has no coffee trees, and low tree richness, so habitat variables not all present. Create subset of this data file
df2<-filter(df, habitat != 'Forest')
Look at the three response variables. What kind of distribution is most likely for them?
Would you analyse all three response variables? Why?
What do you conclude about habitat influences on small mammals?
This exercise expands on the example introduced in Chapter 13, where Sinclair and Arcese (1995) were interested in the causes (predation or other) of death of wildebeeste, more specifically whether predation varied with sex and health of animals. They addressed the question by examining carcasses, and they cross-classified 226 wildebeest carcasses from the Serengeti by three variables: sex (male, female), cause of death (predation, non-predation) and bone marrow type (SWF: solid white fatty; OG: opaque gelatinous; TG: translucent gelatinous; with the first indicating a healthy animal which is not undernourished).
We could look at these data in two ways; as we discussed in the chapter, it is sometimes difficult to identify a clear response variable in situations like this. At other times, one variable is a clear candidate.
The data were extracted from the original paper and used in our first edition. The data file is here.
#Import sinclair data file (sinclair.csv)
sinclair <- read.csv("../data/sinclair.csv")
head(sinclair,10)
## death sex marrow gnus
## 1 Predation F SWF 26
## 2 Predation F OG 32
## 3 Predation F TG 8
## 4 Predation M SWF 14
## 5 Predation M OG 43
## 6 Predation M TG 10
## 7 Other F SWF 6
## 8 Other F OG 26
## 9 Other F TG 16
## 10 Other M SWF 7
factor(sinclair$sex) #use these later when running glm
## [1] F F F M M M F F F M M M
## Levels: F M
factor(sinclair$marrow)
## [1] SWF OG TG SWF OG TG SWF OG TG SWF OG TG
## Levels: OG SWF TG
factor(sinclair$death)
## [1] Predation Predation Predation Predation Predation Predation Other
## [8] Other Other Other Other Other
## Levels: Other Predation
If you’re working with R we’d recommend that you normally just run our small script libraries.R:
In this case, you’ll need to load two other packages, epitools and vcd
The 226 carcasses each could be placed into one of 12 cells of a 2 x 2 x 3 table, and we could model the cell count.
We could make the cause of death our focus, recording its death as predation or other, and record values for the other two values.
What is your first step in fitting the model to these data?
What are your next steps?
What do you conclude about the lives of wildebeeste?
There are two kinds of data files that can be used in this case. We could work with a long file in which every gnu carcass is a record, or we could use a summary file, where one column of the file contains a cell count. This second format is what we used for fitting the log-linear model of part a, so we’ll stay with it.
To run these data with a binary response, we need to do two things:
Use a weights option, so the variable gnus in this case indicates that a record like this occurs several times, equal to the value of gnus for this record.
running a glm may require the death variable to be numerical, so we’ll define a new variable, pred with values 1 if Predation, 0 if Other. We can do that using ifelse.
Don’t do a full analysis here; just start by looking at the fit of the initial complex model, and see how your results compare to those you obtained in part a
One of the consequences of inbreeding in animal populations is that it is thought to increase the susceptibility of individuals to disease. To assess this, Townsend et al. (2018) did genetic analysis of 178 crows found across California, and assessed their level of homozygosity (homozygosity by locus - HL), a measure of how inbred they were (higher levels of HL = more inbred). They also took blood samples from the same crows to check for the presence of avian malaria (Plasmodium), a common disease. They wanted to test whether higher amounts of inbreeding (higher HL) were associated with increased probability of having a Plasmodium infection.
The data file is here
#Import data file
df <- read.csv("../data/townsend.csv")
head(df,10)
## Plasmodium HL
## 1 0 27.2
## 2 0 27.0
## 3 1 27.2
## 4 0 22.6
## 5 0 34.6
## 6 0 32.4
## 7 0 26.0
## 8 0 20.5
## 9 0 26.2
## 10 1 31.8
What kind of response variable do we have here?
Because we are doing a binomial response a simple scatterplot to check assumptions here is a little pointless, but just in case you do want to see it, have a crack
Think about how you might check how well the model fits
Examine the results from fitting the GLM and testing the null hypothesis that the slope of the response to predictor equals zero.
Determine and interpret the following:
y-intercept
slope of the regression
z value for main H0
P-value for main H0
Obtain the 95% confidence intervals on the model estimates
We can actually use this model to determine what the probability of a bird having avian malaria is if its homozygosity level is 50%
What is that probability?
We might want to something a bit more meaningful about the relationship of homozygosity level to incidence of Plasmodium and express this in terms of the odds of a bird having avian malaria. Remember that the binomial model uses a logit link function so that it models the response in terms of log odds (remember the odds are the probability of having Plasmodium divided by the probability of not having it). To compute how the odds of a bird having Plasmodium varies with homozygosity level in R, us exp(coef(townsend.glm)). You can also generate 95% confidence intervals on these odds ratios using exp(confint.default(townsend.glm))
What do you interpret from these results about the odds of a bird having Plasmodium as you increase the homozygosity level?
You might find it helpful to plot the data as well as examining the output
Caudill SA, Rice RA (2016) Do Bird Friendly® Coffee Criteria Benefit Mammals? Assessment of Mammal Diversity in Chiapas, Mexico. PLoS ONE 11(11): e0165662. doi:10.1371/journal.pone.0165662
Sinclair, A. R. E. & Arcese, P. (1995). Population consequences of predation-sensitive foraging: the Serengeti wildebeeste. Ecology, 76, 882-91. doi.org/10.2307/1939353
Townsend, A. K., Taff, C. C., Wheeler, S. S., Weis, A. M., Hinton, M. G., Jones, M. L., Logsdon, R. M., Reisen, W. K., Freund, D., Sehgal, R. N. M., Saberi, M., Suh, Y. H., Hurd, J., and Boyce, W. M.. 2018. Low heterozygosity is associated with vector-borne disease in crows. Ecosphere 9( 10):e02407. 10.1002/ecs2.2407
Uefune M, Abe J, Shiojiri K, Urano S, Nagasaka K, Takabayashi J. 2020 Targeting diamondback moths in greenhouses by attracting specific native parasitoids with herbivory-induced plant volatiles. R. Soc. Open Sci. 7: 201592. http://dx.doi.org/10.1098/rsos.201592