Chapter 15

The aim of these exercises is to get experience with multivariate data sets. In this chapter, you’ll focus on analyses based on associations among variables.

You’ll want the packages vegan, mvabund, and DAAG.

In earlier chapters, we’ve used the sequence we’ve used previously as a guide to working through an analysis, but for this chapter and the next, this sequence doesn’t work as well, particularly step 4. Keep using the sequence, but skip over steps that aren’t appropriate for a particular question.

  1. What is the biological question and what are the response and predictor variables?

  2. What distribution do you expect the response variable to follow?

  3. Is each predictor

    1. Continuous or categorical?
    2. Fixed or random?
  4. Write out the linear model corresponding to the biological question.

  5. What are the assumptions behind the statistical model you’ll fit?

    1. Are those assumptions satisfied?
  6. Fit the model

    1. How will you assess whether the model fits well?

    2. Can you detect an effect of the fixed predictors?

    3. How much of the variance is attributable to each random predictor?

  7. What do you conclude (including any cautions)

A Unconstrained multivariate analysis: PCA and Principal Components Regression

Posthumus, Koprowski, and Steidl (2015) were interested in the ecological role of red squirrels in eastern North America. These territorial animals hoard pine cones for food, and create large middens - piles of plant debris - when they return to eat these pine cones, etc (see their Figure 1). These middens are potentially important habitat changes for other vertebrates of these forests. Postumus and colleagues focused on two aspects of this relationship:

How do middens influence birds and other mammals? What habitat conditions influence the use of a site by squirrels?

We’ll look at this second aspect of their study - the relationship between habitat variables at each midden location and squirrel residency. They measured six habitat variables: 1. logvol: volume of downed logs > 20 cm diameter 2. cancov: canopy cover 3. basal area of large trees 4. slope (°) 5. aspect (°) 6. lsnags: number of large snags > 40 cm dbh 7. shannon: tree diversity 8. ltrees: number of large trees > 40 cm dbh

The data are available as Supplementary file S1. It’s an Excel file, from which you will need the first sheet (Habitat and Squirrel Features), and when you read it in, skip the first 3 rows. The package readxl makes this import easy. You might also want to rename the variables to something a bit more tractable. As usual, we’ve a tidied version of the file that can be used.

posthumus <- read_csv("../data/posthumus.csv")
## Rows: 100 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (11): cancov, shannon, logvol, lsnags, ltrees, basarea, slope, aspect, r...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(posthumus,10)
## # A tibble: 10 × 11
##    cancov shannon logvol lsnags ltrees basarea slope aspect rspres coneind
##     <dbl>   <dbl>  <dbl>  <dbl>  <dbl>   <dbl> <dbl>  <dbl>  <dbl>   <dbl>
##  1   78.5   0.623  102.    21.9  132.  455418.   -10    285      0       0
##  2   56.7   0.708  117.    21.9   87.7 305798.    -8    240      0       0
##  3   83.9   0.381   31.1    0     87.7 528042.   -22      5      0       0
##  4   78.2   1.1     44.3   21.9  175.  636345.    -5    180      1       4
##  5   58.5   0.892  154.     0     21.9 319288.   -14    250      0       0
##  6   71.7   0.861  101.     0     43.9 520032.   -12     50      0       0
##  7   65.1   0.699  125.    21.9  110.  584455.   -10    220      0       0
##  8   62.3   0.814  174.     0     65.8 440843.    -4    282      1       1
##  9   65.8   1.02    74.4   21.9   43.9 628073.    -8    300      0       0
## 10   73.2   0.248   92.9   65.8  132.  928825.   -14    274      0       0
## # ℹ 1 more variable: resprop <dbl>

Start by looking at a scatterplot matrix to see if there any issues.

summary(posthumus)
##      cancov         shannon           logvol           lsnags      
##  Min.   :16.64   Min.   :0.2480   Min.   : 13.95   Min.   :  0.00  
##  1st Qu.:68.21   1st Qu.:0.7953   1st Qu.: 48.73   1st Qu.:  0.00  
##  Median :75.13   Median :0.9765   Median : 67.89   Median : 21.93  
##  Mean   :72.53   Mean   :0.9687   Mean   : 83.14   Mean   : 30.26  
##  3rd Qu.:81.08   3rd Qu.:1.1643   3rd Qu.:110.13   3rd Qu.: 43.86  
##  Max.   :90.90   Max.   :1.6400   Max.   :320.77   Max.   :131.58  
##      ltrees           basarea            slope           aspect     
##  Min.   :  21.93   Min.   : 187271   Min.   :-41.0   Min.   :  3.0  
##  1st Qu.: 109.65   1st Qu.: 509582   1st Qu.:-18.0   1st Qu.:127.8  
##  Median : 230.26   Median : 607795   Median :-12.0   Median :252.0  
##  Mean   : 464.70   Mean   : 645352   Mean   :-13.7   Mean   :213.8  
##  3rd Qu.: 794.96   3rd Qu.: 779580   3rd Qu.: -8.0   3rd Qu.:310.5  
##  Max.   :1907.91   Max.   :1722517   Max.   : -2.0   Max.   :359.0  
##      rspres       coneind        resprop      
##  Min.   :0.0   Min.   :0.00   Min.   :0.0000  
##  1st Qu.:0.0   1st Qu.:0.00   1st Qu.:0.0000  
##  Median :0.0   Median :0.00   Median :0.1667  
##  Mean   :0.2   Mean   :0.69   Mean   :0.3278  
##  3rd Qu.:0.0   3rd Qu.:0.00   3rd Qu.:0.6667  
##  Max.   :1.0   Max.   :4.00   Max.   :1.0000
# correlation matrix - correlations low
cor(posthumus[,c('cancov','shannon','logvol','lsnags','ltrees','basarea','slope','aspect')])
##              cancov      shannon      logvol       lsnags       ltrees
## cancov   1.00000000 -0.221248776 -0.31355638  0.048869189  0.403682264
## shannon -0.22124878  1.000000000  0.04018764  0.138911782 -0.259423437
## logvol  -0.31355638  0.040187644  1.00000000  0.021759489 -0.043456584
## lsnags   0.04886919  0.138911782  0.02175949  1.000000000  0.003282012
## ltrees   0.40368226 -0.259423437 -0.04345658  0.003282012  1.000000000
## basarea  0.44989087  0.099267088 -0.14381036  0.518924557  0.115393734
## slope    0.02907149 -0.009026509  0.20302971 -0.163293434 -0.034250151
## aspect  -0.13755352  0.096014126  0.07617264  0.157538616  0.011613852
##             basarea        slope      aspect
## cancov   0.44989087  0.029071491 -0.13755352
## shannon  0.09926709 -0.009026509  0.09601413
## logvol  -0.14381036  0.203029711  0.07617264
## lsnags   0.51892456 -0.163293434  0.15753862
## ltrees   0.11539373 -0.034250151  0.01161385
## basarea  1.00000000 -0.153839072 -0.05530765
## slope   -0.15383907  1.000000000  0.07750172
## aspect  -0.05530765  0.077501717  1.00000000
scatterplotMatrix(~cancov+shannon+logvol+lsnags+ltrees+basarea+slope+aspect,data=posthumus,smooth = FALSE, diagonal=list(method='boxplot'))

Let’s transform the habitat variables, as was done in the original paper, and look at the scatterplot matrix again. Canopy cover was squared, and the other 5 were log-transformed. Note that the slope values in the original data file are negative, so we reversed the sign before transforming. Large snags has zero values, so we’ll add 1 to these values (1 being the smallest possible value)

# transform as in original paper

posthumus <-posthumus %>%
  dplyr::mutate(
    llogvol = log10(logvol),
    llsnags = log10(lsnags+1),
    lltrees = log10(ltrees),
    lbasarea = log10(basarea),
    lslope = log10(-1*slope),
    scancov = (cancov^2)
    )

Relook at correlations

cor(posthumus[,c('scancov','shannon','llogvol','llsnags','lltrees','lbasarea','lslope','aspect')])
##              scancov     shannon     llogvol    llsnags      lltrees
## scancov   1.00000000 -0.21123659 -0.20567115 0.04219617  0.483626340
## shannon  -0.21123659  1.00000000  0.02915727 0.06251862 -0.157744470
## llogvol  -0.20567115  0.02915727  1.00000000 0.10576105 -0.005254380
## llsnags   0.04219617  0.06251862  0.10576105 1.00000000  0.096833086
## lltrees   0.48362634 -0.15774447 -0.00525438 0.09683309  1.000000000
## lbasarea  0.52857281  0.04163735 -0.13116006 0.33998513  0.256871410
## lslope    0.03581867  0.04122565 -0.19877262 0.22144833  0.039356635
## aspect   -0.14110271  0.09601413  0.08804067 0.12680025  0.006370239
##             lbasarea      lslope       aspect
## scancov   0.52857281  0.03581867 -0.141102709
## shannon   0.04163735  0.04122565  0.096014126
## llogvol  -0.13116006 -0.19877262  0.088040668
## llsnags   0.33998513  0.22144833  0.126800254
## lltrees   0.25687141  0.03935663  0.006370239
## lbasarea  1.00000000  0.15031820 -0.108194103
## lslope    0.15031820  1.00000000 -0.041692492
## aspect   -0.10819410 -0.04169249  1.000000000
scatterplotMatrix(~scancov+shannon+llogvol+llsnags+lltrees+lbasarea+lslope+aspect,data=posthumus, smooth = FALSE, diagonal=list(method='boxplot'))

Use a PCA to simplify the habitat variables

Would you use a correlation or a covariance matrix

Does the PCA help with simplifying the habitat variables?

Summarize the first three PCs

How much variation do they explain?

How do the original variables contribute to these components?

Does this matter for squirrels?

While recording habitat variables, each location was also surveyed 21 times, and the research team recorded the number of times Red Squirrels were present. With standardised surveying effort, we could look at the number of surveys with squirrels, but they also calculated a proportion of surveys with squirrels present (resprop).

Now that you have some nice independent predictors (we hope ;-)), you can explore whether squirrel presence can be predicted using them. This should be quick, as you know the predictors are independent!

Hint: If you saved your PCA analysis into a file, you can easily access the scores. If you coerce them into a dataframe, using as.data.frame(posthumus.pca$scores), you’re good to go.

B. Constrained multivariate analysis (RDA) - amphibians in urban ponds

Hutto and Barrett (2021) were concerned with effects of urbanization and the potential for urban open spaces to mediate those effects. They focused on amphibians as a group of concern, and surveyed 51 ponds in South Carolina, where they recorded a suite of habitat variables relevant to amphibians. The sites were classified into three groups, highly urbanized, low urbanization, and urban open spaces.

At each pond, they also characterized the frog and toad assemblage, using a mixture of dip-net surveys, calls, and deployment of retreats for tree frogs.

The data are in three supplementary files, linked through the paper. You’ll be using huttoenv.csv and huttoamph.

huttoenv <- read_csv("../data/huttoenv.csv")
## Rows: 51 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (1): type
## dbl (10): site, group, ph, cond, wdepth, omdepth, cancov, area, rivernear, w...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(huttoenv, 10)
## # A tibble: 10 × 11
##     site type       group    ph  cond wdepth omdepth cancov  area rivernear
##    <dbl> <chr>      <dbl> <dbl> <dbl>  <dbl>   <dbl>  <dbl> <dbl>     <dbl>
##  1     1 Low            1  6.54  35.2   1.2     9.49     78  0.29      0   
##  2     2 High           2  6.32  26.5   1.2     2.5       1  0.22    307.  
##  3     3 Low            1  7.04 157.    0.33    4.55      1  0.24     21.1 
##  4     4 High           2  7.8   36.2   1.2     2.82      1  0.14    449.  
##  5     5 Open Space     3  7.18  47.7   1.2     3.28      0  0.06     21.2 
##  6     6 Open Space     3  6.42  35.9   1.2     4.17      0  0.04     26.0 
##  7     7 Open Space     3  6.84 101.    1.2     2.42      6  1.23      0   
##  8     8 Open Space     3  6.63  49.3   1.2     8.27     19  0.18      0   
##  9     9 Open Space     3  6.89  43.3   1.2     0.92      9  0.27      0   
## 10    10 Open Space     3  6.24  44.4   1.2     5.62     33  0.1       1.58
## # ℹ 1 more variable: wetlandnear <dbl>

The variables of interest for our analysis are:

  • type: Type of area in which the pond occurred (3 categories)
  • area: wetland area (ha)
  • ph: pH
  • cond: Conductivity (µS/m)
  • wdepth: wetland depth (m)
  • omdepth: mean depth of the organic layer (cm)
  • rivernear: linear distance to nearest river (m)
  • wetlandnear: linear distance to wetland

Now have a quick look at the anuran data.

huttoamph <- read_csv("../data/huttoamph.csv")
## Rows: 50 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (1): type
## dbl (13): Site, ACRE, BAME, BFOW, GCAR, HCIN, HVER, LCAT, LCLA, LPAL, LSPH, ...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(huttoamph,10)
## # A tibble: 10 × 14
##     Site type   ACRE  BAME  BFOW  GCAR  HCIN  HVER  LCAT  LCLA  LPAL  LSPH  PCRU
##    <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1     1 Low       9     0     5     0     0     0     1     4     0     1    10
##  2     2 High      0     0     0     0     4     2     0     4     0     2     5
##  3     3 Low       0     0     0     3     9     6     2     2     0     1    10
##  4     4 High      0     0     0     0     0     0     8     0     0     2     1
##  5     5 Open…     0     0     0     0     0     0     0     1     0     0     0
##  6     6 Open…     1     0     2     0     0     0     4     0     0     1     0
##  7     7 Open…     3     0     5     0     3     0     1     2     0     1     1
##  8     8 Open…     3     0     0     0     1     0     1     2     0     1     1
##  9     9 Open…     0     0     0     0     0     0     0     0     0     0     0
## 10    10 Open…     0     0     0     0     0     0     3     6     0     2     1
## # ℹ 1 more variable: PFER <dbl>

The frog and toad data were essentially standardized by the authors when they combined the different survey methods, with each species receiving an abundance score between 0 and 10.

The anurans were:

  • Acris crepitans (ACRE)
  • Anaxyrus fowleri (BFOW)
  • Anaxyrus americana (BAME)
  • Gastrophryne carolinensis (GCAR)
  • Dryophytes cinerea (HCIN)
  • D. versicolor (HVER)
  • Lithobates clamitans (LCLA)
  • L. palustris (LPAL)
  • L. sphenocephalus (LSPH)
  • L. catesbeianus (LCAT)
  • Pseudacris feriarum (PFER)
  • P. crucifer (PCRU)

They ranged from species present around nearly every pond (L. clamitans) to those only present in a few ponds (P. feriarum).

Try using the environmental variables to constrain the ordination, rather than the habitat categories. Does a clearer pattern emerge?

Think about the distributions of the habitat variables:

Are there correlations between them?

Do any variables need transformation before PCA

Should you standardise the data?

Now do the RDA, once you’re happy with the data

Important hint: One pond has no anuran data, but has environmental data. You’ll need to exclude pond 17 from the huttoenv data at some stage

C. LDA: Using biometric data to distinguish bird species

Militão et al. (2014) were interested in how easy it was to distinguish between two closely-related shearwaters, the Yelkouan and the Balearctic. Genetic analyses suggested that these birds diverged relatively recently, and can be hard to identify visually, and they overlap in parts of their ranges in the western Mediterranean. Both are caught as by-catch in commercial longline fisheries, but they differ in conservation status, with the Balearctic being critically endangered and the Yelkouan vulnerable. This difference in conservation status makes it important to identify by-catch, particularly as they may differ in vulnerability to fishing.

Militao and colleagues used several approaches, from biometric measurements to Stable Isotope Analysis to plumage colour to bill morphology, to identify a simple method for identifying bycatch, without having to resort to costly and time-consuming genetic analysis. They applied these approaches to a sample of birds of known species identity.

We’ll focus on the biometric data, because this information can be recorded easily while on a fishing vessel, and requires little training. The question of interest is:

Can we use biometric measurements to reliably classify a bird to species?

Their data are in supplement S1 of the paper. They’re an Excel file, and you’ll need to skip the first 6 rows to get to the variable names in row 7. If you’ve copied the file into your downloads folder, the next line of code will read in this data frame:

militao <- read_excel(“~/Downloads/pone.0115650.s003.xlsx”, skip = 6)

We’ve used a tidied up version (militao.csv)

The variables of relevance are:

  • sex
  • spp: species
  • sppsex: a composite variable combining sex and species to a single variable with four values
  • year: when bird was collected (not used)
  • bdb: bill depth at base
  • bdn: bill depth at nostril
  • bl: bill length
  • mhl: maximum head length
  • tl: tarsus length
  • wl: wing length
militao <- read_csv("../data/militao.csv")
## Rows: 194 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): dataset, sex, spp, sppsex
## dbl (8): birdid, year, bdb, bdn, bl, mhl, tl, wl
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(militao,10)
## # A tibble: 10 × 12
##    dataset birdid sex   spp   sppsex  year   bdb   bdn    bl   mhl    tl    wl
##    <chr>    <dbl> <chr> <chr> <chr>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 a            1 m     ys    ysm     2002  10.5  7.55  36.4  85.2  46.0  237 
##  2 a           20 f     bs    bsf     2002  11.5  8.71  38.2  89.4  51.0  251 
##  3 a           21 f     bs    bsf     2002  11.7  8.74  38.0  88.6  48.2  253 
##  4 a           23 f     bs    bsf     2002  11.4  8.4   38.9  90.0  50.9  263 
##  5 a          138 f     bs    bsf     2004  11.3  8.02  38.2  87.8  49.2  250 
##  6 a          140 m     bs    bsm     2004  12.7  9.22  39.0  91.8  51.7  261 
##  7 a          141 f     bs    bsf     2004  10.8  7.95  37.7  88.4  49.9  253 
##  8 a          142 m     u     um      2004  11.4  8.59  39.0  91.5  49.1  242.
##  9 a          143 f     bs    bsf     2004  12.0  9.1   35.7  86.5  47.8  257 
## 10 a          214 f     ys    ysf     2005  10.3  6.77  35.2  82.0  46.6  225

Analysis in the paper uses data from this file, but excludes birds that could not be assigned to species (labelled unknown), and for the LDA, uses a subset of the data (“a”).

Militao et al. also standardised their biometric variables - “translation” - to remove sexual differences - subtracted sex-species mean for each variable.

militao1 <- filter(militao, dataset=='a' & spp %in% c('ys','bs'))
#Centre the morphological variables
militao1 <- militao1 %>% 
  dplyr::group_by(sppsex) %>%
  mutate(
    bdbm = mean(bdb),
    bdnm = mean(bdn),
    blm = mean(bl),
    mhlm = mean(mhl),
    tlm = mean(tl),
    wlm = mean(wl)
    ) %>%
  ungroup() %>%
  mutate(
    bdbc = bdb - bdbm,
    bdnc = bdn - bdnm,
    blc = bl - blm,
    mhlc = mhl - mhlm,
    tlc = tl - tlm,
    wlc = wl - wlm
  )
head(militao1)

Use linear discriminant analysis (LDA) to decide whether these morphological measurements are enough to distinguish these shearwater species

Sensitivity

The paper’s authors chose to adjust for bird sex by subtracting the mean value for that bird species/sex from each measurement, before running their LDA.

Did that decision improve the discriminatory capacity, i.e. what would have happened if raw data were used?

Can you think of another way of adjusting for sex of bird?

References

Hutto, David, and Kyle Barrett. 2021. “Do Urban Open Spaces Provide Refugia for Frogs in Urban Environments?” Plos One 16 (1). https://doi.org/gr2r9n.
Militão, Teresa, Elena Gómez-Díaz, Antigoni Kaliontzopoulou, and Jacob González-Solís. 2014. “Comparing Multiple Criteria for Species Identification in Two Recently Diverged Seabirds.” Edited by Robert Guralnick. PLoS ONE 9 (12): e115650. https://doi.org/f6zccr.
Posthumus, Erin E., John L. Koprowski, and Robert J. Steidl. 2015. “Red Squirrel Middens Influence Abundance but Not Diversity of Other Vertebrates.” Plos One 10 (4). https://doi.org/f7jwtd.