The aim of these exercises is to start getting familiar with mixed models, and in these exercises we’ll be focused on models with correlated data.
For each of the examples below, you should follow the sequence we’ve used previously:
What is the biological question?
Is the predictor continuous or categorical?
Write out the linear model corresponding to this question.
What distribution do you expect the response variable to follow?
What are the assumptions behind the statistical model you’ll fit?
Fit the model
How will you assess whether the model fits well?
Can you detect an effect of the predictor?
How do you measure the effect?
What do you conclude (including any cautions)
In fitting these models, we’ll need to make the initial checklist more complex:
as well as classifying a predictor as continuous or categorical, think clearly about whether it is random or fixed
think about the structure of the data; are all observations of the response variable independent of each other, or are they linked in some way? If they are linked, how do de account for that linkage in specifying and fitting a statistical model?
When there are random effects, assumptions can be more extensive. We’re usually interested in the fixed effects, and different fixed effects can have different assumptions. This is particularly the case with nested designs.
When there are correlated effects, we might need to be aware of additional assumptions.
Let’s start with an example we mentioned briefly in Chapter 12, and work through the data analysis. Allen and Marshall (2014) studied the effects of egg size on larval characteristics of a marine tubeworm. They allocated eggs from a group of large or small female tubeworms to ten jars for each egg size. These eggs were fertilized, and after the embryos hatched, the proportion of larvae that had settled and metamorphosed was recorded for each jar every three days for 15 days. The question here is about maternal effects (using size as a proxy for energy reserves available) on offspring performance (time to settle and initiate metamorphosis from larval form to adult form).
Start by following the steps we’ve used for other chapters, so you’re clear on the questions the details of the design, and the applicable statistical model.
We’ll start by reading in the data and taking a quick look at it.
## setting contr.sum globally: options(contrasts=c('contr.sum', 'contr.poly'))
## # A tibble: 10 × 5
## sizeclass jar day day1 perc
## <chr> <fct> <ord> <dbl> <dbl>
## 1 b 1 3 3 0.052
## 2 b 2 3 3 0.096
## 3 b 3 3 3 0.052
## 4 b 4 3 3 0.078
## 5 b 5 3 3 0.081
## 6 b 6 3 3 0.058
## 7 b 7 3 3 0.091
## 8 b 8 3 3 0.064
## 9 b 9 3 3 0.07
## 10 b 10 3 3 0.071
In this data file:
If you have concerns about the raw data, would those concerns be eased if you used log-transformed responses?
First analyse with day as categorical factor (using the predictor day)
Think of a couple of ways you could look at this predictor
You may (hopefully you did!) have noticed some quite different variances between groups. One way we could deal with this is to allow group-specific variances. For interest, let’s use this approach and see if this model fits better
Do your conclusions change?
Think about how you’d specify the model. Describe the random slopes and random intercepts models, and then fit them.
Is the random intercepts approach needed?
Now run as mixed effects random intercept with nlme - use day1 for continuous (not factor) and allow covariances to differ.
**Does this approach give us a model that fits better?
Kiffer et al. (2018) were interested in the nutritional value of leaves from native and introduced trees to “shredders” – invertebrate larvae in streams that help to break down organic material. They focused on one shredder species, the caddisfly Triplectides gracilis and its relationship between four native species and the blue gum Eucalyptus globulus, an introduced species grown in plantations. They exposed individual caddisflies to one of these species, and recorded their growth; their Figure 1 shows the experimental setup. They measured the tibial length of each larva at weekly intervals, and used a conversion equation to estimate the animal’s biomass at this time.
Which of these variables would you include:
Look at this list and decide which terms should be included
Now let’s get hands-on.
First, the housekeeping - get the data file. The original data are in supplementary file S3, which you can get by following the link above to the original paper. S3 is an xls file, and the experimental data are in the sheet Larvae growth. If you want to import it to R, you’ll need to skip the first two rows; the variable names are on row 3 and data start on row 4. Name the frame kiffer. Alternatively, we have it for you:
## Rows: 93 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): species
## dbl (6): subject, week1, week2, week3, week4, week5
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 6 × 7
## subject species week1 week2 week3 week4 week5
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 Styrax pohlii 0.04 -0.02 0.02 -0.62 -0.95
## 2 2 Styrax pohlii 0.03 0.01 -0.01 -0.05 0.02
## 3 3 Styrax pohlii -0.19 0 -0.17 0.18 NA
## 4 4 Styrax pohlii 0.3 0.19 0.31 0.19 NA
## 5 5 Styrax pohlii 0.17 -0.15 0.94 0.93 NA
## 6 6 Styrax pohlii 0.35 0 0.12 -0.08 0.18
Is there anything notable about this data set?
If you’re not sure, run a “traditional” OLS Repeated Measures analysis.
To make this analysis easier in R, let’s start by rearranging the original file with the weekly size measurements as variables to a longer file with one row for each weekly measurement
Still not sure? Use ezDesign as suggested or ezPrecis.
## Data frame dimensions: 93 rows, 7 columns
## type missing values min max
## subject factor 0 93 1 97
## species character 0 5 Eucalyptus globulus Styrax pohlii
## week1 numeric 0 61 -1.22 2.31
## week2 numeric 1 76 -1.65 3.52
## week3 numeric 10 73 -2.03 5.44
## week4 numeric 20 66 -2.23 7.94
## week5 numeric 31 59 -2.32 8.11
Outline any issues with each approach
Assuming that you identified a solution involving dropping subjects and one involving a different way of treating weeks, we’ll need to create a couple of other arrangements of the data.
Run the OLS version of the model, with time as a categorical predictor. What other checks would you do at this stage?
Do some exploratory data analysis
What do you conclude from your initial checks?
Here’s the split-plot version of this analysis
You might find it helpful to generate linear polynomials
Run the model as a linear mixed effects model
(R script provided). For easy comparison, use the kiffer1_na file, which has the same data.
## Numerical variables NOT centered on 0: weekno
## If in interactions, interpretation of lower order (e.g., main) effects difficult.
## Warning in mixed(growth ~ species * weekno + (weekno | subject), test_intercept
## = TRUE, : Cannot test intercept with Satterthwaite approximation.
## Mixed Model Anova Table (Type 3 tests, S-method)
##
## Model: growth ~ species * weekno + (weekno | subject)
## Data: kiffer1_na
## Effect df F p.value
## 1 species 4, 55.00 0.02 .999
## 2 weekno 1, 55.00 21.85 *** <.001
## 3 species:weekno 4, 55.00 12.03 *** <.001
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '+' 0.1 ' ' 1
The advantage of the mixed effects approach is that we can keep all insects in
An important decision is whether to fit a random intercepts model (as we did) or a random intercepts model, which we can do using the mixed model approach.
Let’s return to the Highfill et al. (2019) example from the Chapter 10 exercises. Recall that this was a substantial experiment designed to assess variation in susceptibility to psychosomatic drugs between genetic lines of flies, with the sexes incorporated into the design.
In the earlier discussion of this example, you should have thought about the physical structure of the experiment, in which food with and without the drug in question was offered to small groups of flies, where a group of flies was provided with capillary tubes with one of two solutions. The same group of flies was assessed three times, with feeding/recovery periods in between, to see if drug responses developed through time.
Let’s bring back the data file and look at it again. While we’re at it, we’ll make sure that Lines are treated as factors, rather than continuous predictors:
## # A tibble: 10 × 6
## Solution Sex Line vial Consumption Exposure
## <chr> <chr> <fct> <fct> <dbl> <dbl>
## 1 Cocaine F 41 1 34.6 1
## 2 Cocaine F 41 2 23.6 1
## 3 Cocaine F 41 3 22.6 1
## 4 Cocaine F 41 4 38.6 1
## 5 Cocaine F 41 5 11.6 1
## 6 Cocaine F 41 6 37.6 1
## 7 Cocaine F 41 7 54.6 1
## 8 Cocaine F 41 8 13.6 1
## 9 Cocaine F 41 9 59.6 1
## 10 Cocaine F 41 10 81.6 1
The design has vials as an experimental unit, which we might consider to be plots or subjects.
What are the between-subjects factors?
What are within-subject(s) factors?
Have a crack at specifying the model and fitting it to the data. The worked examples of 12.4 and 12.5 will give you some help, but don’t expect it to be easy :-(