The aim of these exercises is to continue making sure you’re comfortable with handling multivariate data. In this chapter’s exercises, you’ll focus on analyses based on dissimilarities/distances, including fitting linear models to these kinds of response variables.
You’ll need the packages vegan and mvabund.
For each of the examples below, you should follow the sequence we’ve used previously, as far as it’s sensible:
What is the biological question and what are the response and predictor variables?
What distribution do you expect the response variable to follow?
Is each predictor
Write out the linear model corresponding to the biological question.
What are the assumptions behind the statistical model you’ll fit?
Fit the model
How will you assess whether the model fits well?
Can you detect an effect of the fixed predictors?
How much of the variance is attributable to each random predictor?
What do you conclude (including any cautions)
Dixon et al. (2018) focused on assemblages of reptiles in forests and woodlands of southeastern Australia, with a particular interest in relationships between these assemblages and the fire history of the landscape. They identified 81 sites that varied in time since fire, from 6 months to >96 y. Rather than a continuum, three categories of time since fire were used, 0.5-2y, 6-12y, and >96y. Sites were also classified according to habitat.
Reptile assemblages were sampled with a range of methods, including visual surveys and camera traps, to give counts of 20 reptiles, whose abundance ranged from 0-1 to 0-126, depending on species.
Data are available from dryad, as Rep_abund.csv. You’ll want to focus on the columns with reptile numbers, and the one with the fire history (tsf). For convenience, we’ve extracted those data as dixonbiota.csv
Start by having a quick look at the data.
df <- read_csv("../data/dixonbiota.csv")
## Rows: 79 Columns: 22
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): site, tsf
## dbl (20): adup, aplat, amur, amac, aram, dcor, ecun, esax, eul, htal, ldel, ...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(df, 10)
## # A tibble: 10 × 22
## site tsf adup aplat amur amac aram dcor ecun esax eul htal ldel
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 A01 long 1 0 4 2 0 1 0 0 0 1 1
## 2 A02 long 0 0 4 2 0 0 0 0 1 1 0
## 3 A04 long 0 3 0 1 0 1 0 0 0 0 0
## 4 A05 long 2 0 0 1 0 0 2 0 1 0 2
## 5 A06 long 0 1 1 4 1 0 0 1 0 0 5
## 6 A07 long 0 0 0 1 0 1 0 0 0 1 2
## 7 A08 long 3 1 0 1 0 0 2 0 1 0 2
## 8 A09 long 1 1 1 2 0 0 0 0 1 0 3
## 9 A10 long 0 2 0 0 0 0 1 2 2 0 0
## 10 B01 short 0 2 0 0 0 0 0 0 0 0 0
## # ℹ 9 more variables: lgui <dbl>, lwhi <dbl>, ppor <dbl>, pent <dbl>,
## # pspe <dbl>, ptex <dbl>, rdie <dbl>, tnig <dbl>, vros <dbl>
The file is pretty straightforward, with tsf being the fire category, and columns 3-22 each representing one reptile taxon.
What would you conclude about the infuence of time since fire now?
Dixon and their colleagues also recorded a range of habitat variables, which we’ve collected for you in dixonenv.csv
dixonenv <- read_csv("../data/dixonenv.csv")
## Rows: 79 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): site, tsf, veg, aspect
## dbl (10): lcwd, shrcov, grcov, litcov, litdep, rocks, elev, warm, cold, twi
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(dixonenv, 10)
## # A tibble: 10 × 14
## site tsf lcwd veg shrcov grcov litcov litdep rocks aspect elev warm
## <chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 A01 long 9.72 SAW 0.077 0.65 0.35 9.9 4.68 N 1051. 25.0
## 2 A02 long 11.2 SAW 0.125 0.63 0.465 13.4 0 S 970. 25.6
## 3 A04 long 11.1 SAW 0.012 0.77 0.32 10.1 1.14 N 1073. 24.8
## 4 A05 long 10.8 SAW 0.059 0.345 0.636 9.2 18.9 N 1299. 22.6
## 5 A06 long 11.4 DS 0.084 0.685 0.585 23.9 24.3 W 1504. 20.9
## 6 A07 long 11.8 DS 0.144 0.9 0.665 22.1 7.66 E 1532. 20.6
## 7 A08 long 12.0 DS 0.04 0.725 0.41 10.8 0 W 1201. 23.6
## 8 A09 long 12.1 DS 0.079 0.74 0.465 17.1 2.86 E 1468. 21.2
## 9 A10 long 11.5 DS 0.057 0.575 0.635 21.2 26.1 S 1449. 21.3
## 10 B01 short 9.23 DS 0.31 0.175 0.375 8.2 1.37 W 739. 27.3
## # ℹ 2 more variables: cold <dbl>, twi <dbl>
They recorded Coarse Woody Debris, which was long-transformed, litter cover (litcov), % cover of groundcover (grcov), and % cover of shrubs (shrcov), and rock cover. In the original paper, Dixon et al. were comfortable that these predictors weren’t correlated.
The previous analyses showed differences in dispersion between tsf groups which may affect our permanova interpretations. An alternative is to use mvabund based on a poisson or negative binomial distribution.
Let’s return to the Hutto and Barrett (2021) example from the Chapter 15 exercises. There, we used analyses based on associations between variables to explore the relationship between frog assemblages (RDA) in ponds and the degree of urbanization surrounding. This seems a question that could just as easily be examined using distances or dissimilarities.
Return to the data and assess whether frog assemblages (using the standardized abundance scale or presence-absence) differ with urbanization.
To answer this question, you’ll need to exclude a couple of ponds that had no frogs.
df <- read_csv("../data/huttoamph.csv")
## Rows: 50 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): type
## dbl (13): Site, ACRE, BAME, BFOW, GCAR, HCIN, HVER, LCAT, LCLA, LPAL, LSPH, ...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(df,10)
## # A tibble: 10 × 14
## Site type ACRE BAME BFOW GCAR HCIN HVER LCAT LCLA LPAL LSPH PCRU
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 Low 9 0 5 0 0 0 1 4 0 1 10
## 2 2 High 0 0 0 0 4 2 0 4 0 2 5
## 3 3 Low 0 0 0 3 9 6 2 2 0 1 10
## 4 4 High 0 0 0 0 0 0 8 0 0 2 1
## 5 5 Open… 0 0 0 0 0 0 0 1 0 0 0
## 6 6 Open… 1 0 2 0 0 0 4 0 0 1 0
## 7 7 Open… 3 0 5 0 3 0 1 2 0 1 1
## 8 8 Open… 3 0 0 0 1 0 1 2 0 1 1
## 9 9 Open… 0 0 0 0 0 0 0 0 0 0 0
## 10 10 Open… 0 0 0 0 0 0 3 6 0 2 1
## # ℹ 1 more variable: PFER <dbl>
#ponds 9 and 40 had no frogs. Remove for distance analysis
df <- df %>%
filter(Site!= 9 & Site !=40)
df
## # A tibble: 48 × 14
## Site type ACRE BAME BFOW GCAR HCIN HVER LCAT LCLA LPAL LSPH PCRU
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 Low 9 0 5 0 0 0 1 4 0 1 10
## 2 2 High 0 0 0 0 4 2 0 4 0 2 5
## 3 3 Low 0 0 0 3 9 6 2 2 0 1 10
## 4 4 High 0 0 0 0 0 0 8 0 0 2 1
## 5 5 Open… 0 0 0 0 0 0 0 1 0 0 0
## 6 6 Open… 1 0 2 0 0 0 4 0 0 1 0
## 7 7 Open… 3 0 5 0 3 0 1 2 0 1 1
## 8 8 Open… 3 0 0 0 1 0 1 2 0 1 1
## 9 10 Open… 0 0 0 0 0 0 3 6 0 2 1
## 10 11 Open… 0 1 8 0 0 0 5 1 0 2 0
## # ℹ 38 more rows
## # ℹ 1 more variable: PFER <dbl>
The explanation of the variables is in the previous chapter’s exercises.
Griffen et al. (2019) examined the oral microbiome of humans, with a focus on the effects of HIV and AntiRetroviral therapy (ART) on this microbiome. They described the microbiome of 341 patients, who fell into one of two categories: HIV-, and HIV+ with ART. They also matched their samples as far as possible for sex, along with other conditions that can influence oral microbiomes (Candida infection, current smoking). We’ll use their records for these other categories as well. There were more HIV+ than - subjects, but within these groups, approximately equal numbers of two sexes, two Candida infection status, and two smoking categories.
The bacteriome was recorded separately using 16S RNA sequencing, which covered just over 600 taxa.
The data are available from their paper, as two Excel files in the supplementary information. The metadata gives HIV status, and a raft of demographic information. We’ll just work with HIV, sex, Candida, and current smoking. A second sheet has the bacteriome. The code chunk below reads this file. It also screens the file for any bacterial taxa that were not present in this particular study, which reduces the bacterial taxa to 599.
library(readxl)
#Read in metadata. Lots of information we don't need, so a couple of iterations to get down to a few columns we want - HIV status, candida, current smoking, gender
df <- read_excel("../data/41598_2019_55703_MOESM3_ESM.xlsx",
range = "A1:S342", na = "NA")
df<-dplyr::select(df, HIV, candida, gender, smoking_current)
head(df)
## # A tibble: 6 × 4
## HIV candida gender smoking_current
## <chr> <chr> <chr> <chr>
## 1 Yes No Female No
## 2 Yes Yes Male No
## 3 Yes No Female No
## 4 Yes No Male No
## 5 Yes No Male No
## 6 Yes No Male Yes
#df1 is the bacterial data
df1 <- read_excel("../data/41598_2019_55703_MOESM2_ESM.xlsx")
## New names:
## • `` -> `...1`
head(df1,10)
## # A tibble: 10 × 636
## ...1 `Abiotrophia defectiva` Acetobacter pasteuria…¹ Achromobacter xyloso…²
## <dbl> <dbl> <dbl> <dbl>
## 1 14102. 14 0 0
## 2 14104. 10 0 0
## 3 14108. 38 0 0
## 4 14109. 10 0 0
## 5 14112. 0 0 0
## 6 14113. 0 0 0
## 7 14115. 0 0 0
## 8 14117. 1 0 0
## 9 14121. 0 0 0
## 10 14122. 0 0 0
## # ℹ abbreviated names: ¹`Acetobacter pasteurianus`,
## # ²`Achromobacter xylosoxidans`
## # ℹ 632 more variables: `Acidovorax cattleyae` <dbl>,
## # `Acinetobacter baumanii` <dbl>, `Acinetobacter calcoaceticus` <dbl>,
## # `Acinetobacter guillouiae` <dbl>, `Acinetobacter gyllenbergii` <dbl>,
## # `Acinetobacter haemolyticus` <dbl>, `Acinetobacter JEL30` <dbl>,
## # `Acinetobacter johnsonii` <dbl>, `Acinetobacter junii` <dbl>, …
df1 <- df1 %>%
dplyr::select(where( ~ is.numeric(.x) && sum(.x) != 0)) #Drops any bacterial taxon not recorded in this study (i.e. where it's all zeroes)
df1 <- df1[,-1] #remove col1, which is sample ID
Do these factors act independently on the bacteriome?
Which effects are largest (use some graphical methods)?
First steps: Think about standardization and which distance measure you’ll use. Bray-Curtis seems common for these kind of data, and counts of different taxa vary widely.
Could you fit a simpler model to the data?