The aim of these exercises is to get experience with multivariate data sets. In this chapter, you’ll focus on analyses based on associations among variables.
You’ll want the packages vegan, mvabund, and DAAG.
In earlier chapters, we’ve used the sequence we’ve used previously as a guide to working through an analysis, but for this chapter and the next, this sequence doesn’t work as well, particularly step 4. Keep using the sequence, but skip over steps that aren’t appropriate for a particular question.
What is the biological question and what are the response and predictor variables?
What distribution do you expect the response variable to follow?
Is each predictor
Write out the linear model corresponding to the biological question.
What are the assumptions behind the statistical model you’ll fit?
Fit the model
How will you assess whether the model fits well?
Can you detect an effect of the fixed predictors?
How much of the variance is attributable to each random predictor?
What do you conclude (including any cautions)
Posthumus, Koprowski, and Steidl (2015) were interested in the ecological role of red squirrels in eastern North America. These territorial animals hoard pine cones for food, and create large middens - piles of plant debris - when they return to eat these pine cones, etc (see their Figure 1). These middens are potentially important habitat changes for other vertebrates of these forests. Postumus and colleagues focused on two aspects of this relationship:
How do middens influence birds and other mammals? What habitat conditions influence the use of a site by squirrels?
We’ll look at this second aspect of their study - the relationship between habitat variables at each midden location and squirrel residency. They measured six habitat variables: 1. logvol: volume of downed logs > 20 cm diameter 2. cancov: canopy cover 3. basal area of large trees 4. slope (°) 5. aspect (°) 6. lsnags: number of large snags > 40 cm dbh 7. shannon: tree diversity 8. ltrees: number of large trees > 40 cm dbh
The data are available as Supplementary file S1. It’s an Excel file, from which you will need the first sheet (Habitat and Squirrel Features), and when you read it in, skip the first 3 rows. The package readxl makes this import easy. You might also want to rename the variables to something a bit more tractable. As usual, we’ve a tidied version of the file that can be used.
posthumus <- read_csv("../data/posthumus.csv")
## Rows: 100 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (11): cancov, shannon, logvol, lsnags, ltrees, basarea, slope, aspect, r...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(posthumus,10)
## # A tibble: 10 × 11
## cancov shannon logvol lsnags ltrees basarea slope aspect rspres coneind
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 78.5 0.623 102. 21.9 132. 455418. -10 285 0 0
## 2 56.7 0.708 117. 21.9 87.7 305798. -8 240 0 0
## 3 83.9 0.381 31.1 0 87.7 528042. -22 5 0 0
## 4 78.2 1.1 44.3 21.9 175. 636345. -5 180 1 4
## 5 58.5 0.892 154. 0 21.9 319288. -14 250 0 0
## 6 71.7 0.861 101. 0 43.9 520032. -12 50 0 0
## 7 65.1 0.699 125. 21.9 110. 584455. -10 220 0 0
## 8 62.3 0.814 174. 0 65.8 440843. -4 282 1 1
## 9 65.8 1.02 74.4 21.9 43.9 628073. -8 300 0 0
## 10 73.2 0.248 92.9 65.8 132. 928825. -14 274 0 0
## # ℹ 1 more variable: resprop <dbl>
Start by looking at a scatterplot matrix to see if there any issues.
summary(posthumus)
## cancov shannon logvol lsnags
## Min. :16.64 Min. :0.2480 Min. : 13.95 Min. : 0.00
## 1st Qu.:68.21 1st Qu.:0.7953 1st Qu.: 48.73 1st Qu.: 0.00
## Median :75.13 Median :0.9765 Median : 67.89 Median : 21.93
## Mean :72.53 Mean :0.9687 Mean : 83.14 Mean : 30.26
## 3rd Qu.:81.08 3rd Qu.:1.1643 3rd Qu.:110.13 3rd Qu.: 43.86
## Max. :90.90 Max. :1.6400 Max. :320.77 Max. :131.58
## ltrees basarea slope aspect
## Min. : 21.93 Min. : 187271 Min. :-41.0 Min. : 3.0
## 1st Qu.: 109.65 1st Qu.: 509582 1st Qu.:-18.0 1st Qu.:127.8
## Median : 230.26 Median : 607795 Median :-12.0 Median :252.0
## Mean : 464.70 Mean : 645352 Mean :-13.7 Mean :213.8
## 3rd Qu.: 794.96 3rd Qu.: 779580 3rd Qu.: -8.0 3rd Qu.:310.5
## Max. :1907.91 Max. :1722517 Max. : -2.0 Max. :359.0
## rspres coneind resprop
## Min. :0.0 Min. :0.00 Min. :0.0000
## 1st Qu.:0.0 1st Qu.:0.00 1st Qu.:0.0000
## Median :0.0 Median :0.00 Median :0.1667
## Mean :0.2 Mean :0.69 Mean :0.3278
## 3rd Qu.:0.0 3rd Qu.:0.00 3rd Qu.:0.6667
## Max. :1.0 Max. :4.00 Max. :1.0000
# correlation matrix - correlations low
cor(posthumus[,c('cancov','shannon','logvol','lsnags','ltrees','basarea','slope','aspect')])
## cancov shannon logvol lsnags ltrees
## cancov 1.00000000 -0.221248776 -0.31355638 0.048869189 0.403682264
## shannon -0.22124878 1.000000000 0.04018764 0.138911782 -0.259423437
## logvol -0.31355638 0.040187644 1.00000000 0.021759489 -0.043456584
## lsnags 0.04886919 0.138911782 0.02175949 1.000000000 0.003282012
## ltrees 0.40368226 -0.259423437 -0.04345658 0.003282012 1.000000000
## basarea 0.44989087 0.099267088 -0.14381036 0.518924557 0.115393734
## slope 0.02907149 -0.009026509 0.20302971 -0.163293434 -0.034250151
## aspect -0.13755352 0.096014126 0.07617264 0.157538616 0.011613852
## basarea slope aspect
## cancov 0.44989087 0.029071491 -0.13755352
## shannon 0.09926709 -0.009026509 0.09601413
## logvol -0.14381036 0.203029711 0.07617264
## lsnags 0.51892456 -0.163293434 0.15753862
## ltrees 0.11539373 -0.034250151 0.01161385
## basarea 1.00000000 -0.153839072 -0.05530765
## slope -0.15383907 1.000000000 0.07750172
## aspect -0.05530765 0.077501717 1.00000000
scatterplotMatrix(~cancov+shannon+logvol+lsnags+ltrees+basarea+slope+aspect,data=posthumus,smooth = FALSE, diagonal=list(method='boxplot'))
Let’s transform the habitat variables, as was done in the original paper, and look at the scatterplot matrix again. Canopy cover was squared, and the other 5 were log-transformed. Note that the slope values in the original data file are negative, so we reversed the sign before transforming. Large snags has zero values, so we’ll add 1 to these values (1 being the smallest possible value)
# transform as in original paper
posthumus <-posthumus %>%
dplyr::mutate(
llogvol = log10(logvol),
llsnags = log10(lsnags+1),
lltrees = log10(ltrees),
lbasarea = log10(basarea),
lslope = log10(-1*slope),
scancov = (cancov^2)
)
Relook at correlations
cor(posthumus[,c('scancov','shannon','llogvol','llsnags','lltrees','lbasarea','lslope','aspect')])
## scancov shannon llogvol llsnags lltrees
## scancov 1.00000000 -0.21123659 -0.20567115 0.04219617 0.483626340
## shannon -0.21123659 1.00000000 0.02915727 0.06251862 -0.157744470
## llogvol -0.20567115 0.02915727 1.00000000 0.10576105 -0.005254380
## llsnags 0.04219617 0.06251862 0.10576105 1.00000000 0.096833086
## lltrees 0.48362634 -0.15774447 -0.00525438 0.09683309 1.000000000
## lbasarea 0.52857281 0.04163735 -0.13116006 0.33998513 0.256871410
## lslope 0.03581867 0.04122565 -0.19877262 0.22144833 0.039356635
## aspect -0.14110271 0.09601413 0.08804067 0.12680025 0.006370239
## lbasarea lslope aspect
## scancov 0.52857281 0.03581867 -0.141102709
## shannon 0.04163735 0.04122565 0.096014126
## llogvol -0.13116006 -0.19877262 0.088040668
## llsnags 0.33998513 0.22144833 0.126800254
## lltrees 0.25687141 0.03935663 0.006370239
## lbasarea 1.00000000 0.15031820 -0.108194103
## lslope 0.15031820 1.00000000 -0.041692492
## aspect -0.10819410 -0.04169249 1.000000000
scatterplotMatrix(~scancov+shannon+llogvol+llsnags+lltrees+lbasarea+lslope+aspect,data=posthumus, smooth = FALSE, diagonal=list(method='boxplot'))
Would you use a correlation or a covariance matrix
Does the PCA help with simplifying the habitat variables?
How much variation do they explain?
How do the original variables contribute to these components?
While recording habitat variables, each location was also surveyed 21 times, and the research team recorded the number of times Red Squirrels were present. With standardised surveying effort, we could look at the number of surveys with squirrels, but they also calculated a proportion of surveys with squirrels present (resprop).
Now that you have some nice independent predictors (we hope ;-)), you can explore whether squirrel presence can be predicted using them. This should be quick, as you know the predictors are independent!
Hint: If you saved your PCA analysis into a file, you can easily access the scores. If you coerce them into a dataframe, using as.data.frame(posthumus.pca$scores), you’re good to go.
Hutto and Barrett (2021) were concerned with effects of urbanization and the potential for urban open spaces to mediate those effects. They focused on amphibians as a group of concern, and surveyed 51 ponds in South Carolina, where they recorded a suite of habitat variables relevant to amphibians. The sites were classified into three groups, highly urbanized, low urbanization, and urban open spaces.
At each pond, they also characterized the frog and toad assemblage, using a mixture of dip-net surveys, calls, and deployment of retreats for tree frogs.
The data are in three supplementary files, linked through the paper. You’ll be using huttoenv.csv and huttoamph.
huttoenv <- read_csv("../data/huttoenv.csv")
## Rows: 51 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): type
## dbl (10): site, group, ph, cond, wdepth, omdepth, cancov, area, rivernear, w...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(huttoenv, 10)
## # A tibble: 10 × 11
## site type group ph cond wdepth omdepth cancov area rivernear
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 Low 1 6.54 35.2 1.2 9.49 78 0.29 0
## 2 2 High 2 6.32 26.5 1.2 2.5 1 0.22 307.
## 3 3 Low 1 7.04 157. 0.33 4.55 1 0.24 21.1
## 4 4 High 2 7.8 36.2 1.2 2.82 1 0.14 449.
## 5 5 Open Space 3 7.18 47.7 1.2 3.28 0 0.06 21.2
## 6 6 Open Space 3 6.42 35.9 1.2 4.17 0 0.04 26.0
## 7 7 Open Space 3 6.84 101. 1.2 2.42 6 1.23 0
## 8 8 Open Space 3 6.63 49.3 1.2 8.27 19 0.18 0
## 9 9 Open Space 3 6.89 43.3 1.2 0.92 9 0.27 0
## 10 10 Open Space 3 6.24 44.4 1.2 5.62 33 0.1 1.58
## # ℹ 1 more variable: wetlandnear <dbl>
The variables of interest for our analysis are:
Now have a quick look at the anuran data.
huttoamph <- read_csv("../data/huttoamph.csv")
## Rows: 50 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): type
## dbl (13): Site, ACRE, BAME, BFOW, GCAR, HCIN, HVER, LCAT, LCLA, LPAL, LSPH, ...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(huttoamph,10)
## # A tibble: 10 × 14
## Site type ACRE BAME BFOW GCAR HCIN HVER LCAT LCLA LPAL LSPH PCRU
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 Low 9 0 5 0 0 0 1 4 0 1 10
## 2 2 High 0 0 0 0 4 2 0 4 0 2 5
## 3 3 Low 0 0 0 3 9 6 2 2 0 1 10
## 4 4 High 0 0 0 0 0 0 8 0 0 2 1
## 5 5 Open… 0 0 0 0 0 0 0 1 0 0 0
## 6 6 Open… 1 0 2 0 0 0 4 0 0 1 0
## 7 7 Open… 3 0 5 0 3 0 1 2 0 1 1
## 8 8 Open… 3 0 0 0 1 0 1 2 0 1 1
## 9 9 Open… 0 0 0 0 0 0 0 0 0 0 0
## 10 10 Open… 0 0 0 0 0 0 3 6 0 2 1
## # ℹ 1 more variable: PFER <dbl>
The frog and toad data were essentially standardized by the authors when they combined the different survey methods, with each species receiving an abundance score between 0 and 10.
The anurans were:
They ranged from species present around nearly every pond (L. clamitans) to those only present in a few ponds (P. feriarum).
Think about the distributions of the habitat variables:
Are there correlations between them?
Do any variables need transformation before PCA
Should you standardise the data?
Important hint: One pond has no anuran data, but has environmental data. You’ll need to exclude pond 17 from the huttoenv data at some stage
Militão et al. (2014) were interested in how easy it was to distinguish between two closely-related shearwaters, the Yelkouan and the Balearctic. Genetic analyses suggested that these birds diverged relatively recently, and can be hard to identify visually, and they overlap in parts of their ranges in the western Mediterranean. Both are caught as by-catch in commercial longline fisheries, but they differ in conservation status, with the Balearctic being critically endangered and the Yelkouan vulnerable. This difference in conservation status makes it important to identify by-catch, particularly as they may differ in vulnerability to fishing.
Militao and colleagues used several approaches, from biometric measurements to Stable Isotope Analysis to plumage colour to bill morphology, to identify a simple method for identifying bycatch, without having to resort to costly and time-consuming genetic analysis. They applied these approaches to a sample of birds of known species identity.
We’ll focus on the biometric data, because this information can be recorded easily while on a fishing vessel, and requires little training. The question of interest is:
Their data are in supplement S1 of the paper. They’re an Excel file, and you’ll need to skip the first 6 rows to get to the variable names in row 7. If you’ve copied the file into your downloads folder, the next line of code will read in this data frame:
militao <- read_excel(“~/Downloads/pone.0115650.s003.xlsx”, skip = 6)
We’ve used a tidied up version (militao.csv)
The variables of relevance are:
militao <- read_csv("../data/militao.csv")
## Rows: 194 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): dataset, sex, spp, sppsex
## dbl (8): birdid, year, bdb, bdn, bl, mhl, tl, wl
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(militao,10)
## # A tibble: 10 × 12
## dataset birdid sex spp sppsex year bdb bdn bl mhl tl wl
## <chr> <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 a 1 m ys ysm 2002 10.5 7.55 36.4 85.2 46.0 237
## 2 a 20 f bs bsf 2002 11.5 8.71 38.2 89.4 51.0 251
## 3 a 21 f bs bsf 2002 11.7 8.74 38.0 88.6 48.2 253
## 4 a 23 f bs bsf 2002 11.4 8.4 38.9 90.0 50.9 263
## 5 a 138 f bs bsf 2004 11.3 8.02 38.2 87.8 49.2 250
## 6 a 140 m bs bsm 2004 12.7 9.22 39.0 91.8 51.7 261
## 7 a 141 f bs bsf 2004 10.8 7.95 37.7 88.4 49.9 253
## 8 a 142 m u um 2004 11.4 8.59 39.0 91.5 49.1 242.
## 9 a 143 f bs bsf 2004 12.0 9.1 35.7 86.5 47.8 257
## 10 a 214 f ys ysf 2005 10.3 6.77 35.2 82.0 46.6 225
Analysis in the paper uses data from this file, but excludes birds that could not be assigned to species (labelled unknown), and for the LDA, uses a subset of the data (“a”).
Militao et al. also standardised their biometric variables - “translation” - to remove sexual differences - subtracted sex-species mean for each variable.
militao1 <- filter(militao, dataset=='a' & spp %in% c('ys','bs'))
#Centre the morphological variables
militao1 <- militao1 %>%
dplyr::group_by(sppsex) %>%
mutate(
bdbm = mean(bdb),
bdnm = mean(bdn),
blm = mean(bl),
mhlm = mean(mhl),
tlm = mean(tl),
wlm = mean(wl)
) %>%
ungroup() %>%
mutate(
bdbc = bdb - bdbm,
bdnc = bdn - bdnm,
blc = bl - blm,
mhlc = mhl - mhlm,
tlc = tl - tlm,
wlc = wl - wlm
)
head(militao1)
The paper’s authors chose to adjust for bird sex by subtracting the mean value for that bird species/sex from each measurement, before running their LDA.
Did that decision improve the discriminatory capacity, i.e. what would have happened if raw data were used?
Can you think of another way of adjusting for sex of bird?