Chapter 12

The aim of these exercises is to start getting familiar with mixed models, and in these exercises we’ll be focused on models with correlated data.

For each of the examples below, you should follow the sequence we’ve used previously:

What is the biological question?
Is the predictor continuous or categorical?
Write out the linear model corresponding to this question.
What distribution do you expect the response variable to follow?
What are the assumptions behind the statistical model you’ll fit?
1. Are those assumptions satisfied?
Fit the model
1. How will you assess whether the model fits well?
2. Can you detect an effect of the predictor?
3. How do you measure the effect?
What do you conclude (including any cautions)

Things to look out for

In fitting these models, we’ll need to make the initial checklist more complex:

as well as classifying a predictor as continuous or categorical, think clearly about whether it is random or fixed
think about the structure of the data; are all observations of the response variable independent of each other, or are they linked in some way? If they are linked, how do de account for that linkage in specifying and fitting a statistical model?
When there are random effects, assumptions can be more extensive. We’re usually interested in the fixed effects, and different fixed effects can have different assumptions. This is particularly the case with nested designs.
When there are correlated effects, we might need to be aware of additional assumptions.

A. Straightforward, with a chance to be complex

Let’s start with an example we mentioned briefly in Chapter 12, and work through the data analysis. Allen and Marshall (2014) studied the effects of egg size on larval characteristics of a marine tubeworm. They allocated eggs from a group of large or small female tubeworms to ten jars for each egg size. These eggs were fertilized, and after the embryos hatched, the proportion of larvae that had settled and metamorphosed was recorded for each jar every three days for 15 days. The question here is about maternal effects (using size as a proxy for energy reserves available) on offspring performance (time to settle and initiate metamorphosis from larval form to adult form).

Start by following the steps we’ve used for other chapters, so you’re clear on the questions the details of the design, and the applicable statistical model.

We’ll start by reading in the data and taking a quick look at it.

## setting contr.sum globally: options(contrasts=c('contr.sum', 'contr.poly'))

## # A tibble: 10 × 5
##    sizeclass jar   day    day1  perc
##    <chr>     <fct> <ord> <dbl> <dbl>
##  1 b         1     3         3 0.052
##  2 b         2     3         3 0.096
##  3 b         3     3         3 0.052
##  4 b         4     3         3 0.078
##  5 b         5     3         3 0.081
##  6 b         6     3         3 0.058
##  7 b         7     3         3 0.091
##  8 b         8     3         3 0.064
##  9 b         9     3         3 0.07 
## 10 b         10    3         3 0.071

In this data file:

sizeclass describes the mothers, b(ig) or s(mall)
jar is numbered 1-20, but could as easily be a, b, c, etc. or any arbitrary label. We treated it as a character variable just so we don’t be fooled into thinking the numbers mean anything quantitative.
day represents sampling times. We’ve made it categorical, ready for some analyses.
day1 represents sampling times, but as a continuous variable
perc is the proportion of larvae settled and undergoing or completed metamorphosis
lperc is log-transformed perc. Allen & Marshall used this transformation.

Start by taking a close look at the data and thinking about assumptions

If you have concerns about the raw data, would those concerns be eased if you used log-transformed responses?

Let’s start with a straightforward, conventional model

First analyse with day as categorical factor (using the predictor day)

Think of a couple of ways you could look at this predictor

What do you conclude about maternal effects? Illustrate your results graphically

You may (hopefully you did!) have noticed some quite different variances between groups. One way we could deal with this is to allow group-specific variances. For interest, let’s use this approach and see if this model fits better

Do your conclusions change?

Extend yourself: treat time as a continuous predictor (use day1)

Think about how you’d specify the model. Describe the random slopes and random intercepts models, and then fit them.

Is the random intercepts approach needed?

Last, if you want to extend yourself a bit more, Let’s look at accounting for some of the heterogeneity in variances by allowing covariances to differ (we’ll use AR1)

Now run as mixed effects random intercept with nlme - use day1 for continuous (not factor) and allow covariances to differ.

Does this approach give us a model that fits better?

B. Organic matter breakdown in streams - a messy example

Kiffer et al. (2018) were interested in the nutritional value of leaves from native and introduced trees to “shredders” – invertebrate larvae in streams that help to break down organic material. They focused on one shredder species, the caddisfly Triplectides gracilis and its relationship between four native species and the blue gum Eucalyptus globulus, an introduced species grown in plantations. They exposed individual caddisflies to one of these species, and recorded their growth; their Figure 1 shows the experimental setup. They measured the tibial length of each larva at weekly intervals, and used a conversion equation to estimate the animal’s biomass at this time.

What was their biological question?

What factors are in the design?

Which of these variables would you include:

Tree species (S)
Plantation type (P)
Caddis flies (C)
Time (T)

Which terms answer the biological question?

What checks would you do before fitting a model to these data?

Now let’s get hands-on.

First, the housekeeping - get the data file. The original data are in supplementary file S3, which you can get by following the link above to the original paper. S3 is an xls file, and the experimental data are in the sheet Larvae growth. If you want to import it to R, you’ll need to skip the first two rows; the variable names are on row 3 and data start on row 4. Name the frame kiffer. Alternatively, we have it for you:

## Rows: 93 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): species
## dbl (6): subject, week1, week2, week3, week4, week5
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

## # A tibble: 6 × 7
##   subject species       week1 week2 week3 week4 week5
##     <dbl> <chr>         <dbl> <dbl> <dbl> <dbl> <dbl>
## 1       1 Styrax pohlii  0.04 -0.02  0.02 -0.62 -0.95
## 2       2 Styrax pohlii  0.03  0.01 -0.01 -0.05  0.02
## 3       3 Styrax pohlii -0.19  0    -0.17  0.18 NA   
## 4       4 Styrax pohlii  0.3   0.19  0.31  0.19 NA   
## 5       5 Styrax pohlii  0.17 -0.15  0.94  0.93 NA   
## 6       6 Styrax pohlii  0.35  0     0.12 -0.08  0.18

Is there anything notable about this data set?

If you’re not sure, run a “traditional” OLS Repeated Measures analysis.

To make this analysis easier in R, let’s start by rearranging the original file with the weekly size measurements as variables to a longer file with one row for each weekly measurement

Still not sure? Use ezDesign as suggested or ezPrecis.

## Data frame dimensions: 93 rows, 7 columns

##              type missing values                 min           max
## subject    factor       0     93                   1            97
## species character       0      5 Eucalyptus globulus Styrax pohlii
## week1     numeric       0     61               -1.22          2.31
## week2     numeric       1     76               -1.65          3.52
## week3     numeric      10     73               -2.03          5.44
## week4     numeric      20     66               -2.23          7.94
## week5     numeric      31     59               -2.32          8.11

Suggest two ways of dealing with this problem (or three, if you’re really keen!).

Outline any issues with each approach

Assuming that you identified a solution involving dropping subjects and one involving a different way of treating weeks, we’ll need to create a couple of other arrangements of the data.

A version for running a “traditional” repeated measures OLS. This is a “wide” data file, with measurements at each week as columns. There are missing values, which OLS RM models don’t like, so we’ll drop any animal without a measurement in every week - kiffer_na. We’ll then convert it to a long version for running with time as a categorical predictor - kiffer1_na

“Traditional” OLS RM

Run the OLS version of the model, with time as a categorical predictor. What other checks would you do at this stage?

Do some exploratory data analysis

What do you conclude from your initial checks?

Here’s the split-plot version of this analysis

You might find it helpful to generate linear polynomials

Run the model as a linear mixed effects model

(R script provided). For easy comparison, use the kiffer1_na file, which has the same data.

Get anova results

The advantage of the mixed effects approach is that we can keep all insects in

An important decision is whether to fit a random intercepts model (as we did) or a random intercepts model, which we can do using the mixed model approach.

What do you conclude from your analysis? Would you reach different conclusions from the “traditional” and mixed model approaches?

C. (Challenging) - back to psychostimulants and flies

Let’s return to the Highfill et al. (2019) example from the Chapter 10 exercises. Recall that this was a substantial experiment designed to assess variation in susceptibility to psychosomatic drugs between genetic lines of flies, with the sexes incorporated into the design.

In the earlier discussion of this example, you should have thought about the physical structure of the experiment, in which food with and without the drug in question was offered to small groups of flies, where a group of flies was provided with capillary tubes with one of two solutions. The same group of flies was assessed three times, with feeding/recovery periods in between, to see if drug responses developed through time.

Let’s bring back the data file and look at it again. While we’re at it, we’ll make sure that Lines are treated as factors, rather than continuous predictors:

## # A tibble: 10 × 6
##    Solution Sex   Line  vial  Consumption Exposure
##    <chr>    <chr> <fct> <fct>       <dbl>    <dbl>
##  1 Cocaine  F     41    1            34.6        1
##  2 Cocaine  F     41    2            23.6        1
##  3 Cocaine  F     41    3            22.6        1
##  4 Cocaine  F     41    4            38.6        1
##  5 Cocaine  F     41    5            11.6        1
##  6 Cocaine  F     41    6            37.6        1
##  7 Cocaine  F     41    7            54.6        1
##  8 Cocaine  F     41    8            13.6        1
##  9 Cocaine  F     41    9            59.6        1
## 10 Cocaine  F     41    10           81.6        1

The design has vials as an experimental unit, which we might consider to be plots or subjects.

What other possible correlations in the data are there?

The model structure

What are the between-subjects factors?

What are within-subject(s) factors?

Have a crack at specifying the model and fitting it to the data. The worked examples of 12.4 and 12.5 will give you some help, but don’t expect it to be easy :-(

References

Allen, R. M., and D. Marshall. 2014. “Egg Size Effects Across Multiple Life-History Stages in the Marine Annelid Hydroides Diramphus.” PLoS One 9 (7): e102253. https://doi.org/gf8mjs.

Highfill, Chad A., Brandon M. Baker, Stephenie D. Stevens, Robert R. H. Anholt, and Trudy F. C. Mackay. 2019. “Genetics of Cocaine and Methamphetamine Consumption and Preference in Drosophila Melanogaster.” PLOS Genetics 15 (5). https://doi.org/gr2pgk.

Kiffer, Walace P., Flavio Mendes, Cinthia G. Casotti, Larissa C. Costa, and Marcelo S. Moretti. 2018. “Exotic Eucalyptus Leaves Are Preferred over Tougher Native Species but Affect the Growth and Survival of Shredders in an Atlantic Forest Stream (Brazil).” Edited by Luiz U. Hepp. PLOS ONE 13 (1): e0190743. https://doi.org/gr9z2g.