The aim of these exercises is to help you get comfortable with running exploratory data analysis - taking a data set and a potential model, and evaluating whether the two are compatible. This activity will be part of most of the exercises in the later chapters. Particular things to look for are:
Does the data set have extreme values (outliers)?
How do you identify those values?
Are the observations legitimate or mistakes?
Are extreme values influential?
What statistical model is proposed for this data set?
What are the important assumptions?
How can you evaluate whether those assumptions are satisfied?
If they aren’t satisfied, what are your options?
Exploratory data analysis can be counter-intuitive. We emphasise its value as something to do before going ahead with fitting the model, assessing parameters, and testing hypotheses, but some assumptions can only be assessed by fitting the model. At this stage, we’re aiming to fit the model without bothering to look at parameters of interest, confidence intervals, or P-values, but software doesn’t always make this easy. We’re trying to avoid the temptation of P-hacking or its equivalent - fitting a model that will give us desired results, rather than one that is appropriate for the particular data set. If you wanted to be super-rigorous about this step, you could write some Rmarkdown code where the results of model-fitting are suppressed (using RESULTS= ‘Hide’, etc.) and you just generate, e.g. residual plots!
Continue with the Le Boeuf et al. (2000) elephant seal example from Chapter 4’s exercises.
For the linear model you’ve specified, what are the assumptions?
Open the data file and check those assumptions.
df <- read.csv("../data/leboeuf.csv")
head(df,10)
## male departwt distance FFAduration durationto durationfrom
## 1 Pop NA 534 31 18 11
## 2 Alt 973 755 89 9 8
## 3 Pro 977 1210 77 12 18
## 4 Hal 1121 NA NA NA NA
## 5 Blu NA 1297 76 19 25
## 6 Dua 996 1487 68 18 23
## 7 Rov 1100 2073 69 29 25
## 8 Ric 1068 2181 46 21 42
## 9 Ori 1097 NA NA NA NA
## 10 Jer 1199 NA NA NA NA
The data also include information on departure weight. Have a look to see if that variable might also be linked to foraging duration, and whether it might also be linked to distance travelled.
Hint: here’s your chance at a scatterplot matrix
Continue with the Peake and Quinn (1993) example from Chapter 4’s exercises, the relationships between the number of individuals and number of species (response variables) against mussel clump area (predictor variable).
For the linear model you’ve specified, what are the assumptions?
Open the data file and check those assumptions.
df <- read.csv("../data/peakquinn.csv")
head(df,10)
## area indiv species
## 1 516.00 18 3
## 2 469.06 60 7
## 3 462.25 57 6
## 4 938.60 100 8
## 5 1357.15 48 10
## 6 1773.66 118 9
## 7 1686.01 148 10
## 8 1786.29 214 11
## 9 3090.07 225 16
## 10 3980.12 283 9
Continue with the Vosteen, Gershenzon, and Kunert (2016) study examining patterns of production of honeydew by different races of pea aphids (Acyrthosiphon pisum) from Chapter 4’s exercises. You should have described a model to fit to these data.
For the linear model you’ve specified, what are the assumptions?
Read in the data and check the assumptions
df <- read.csv("../data/vosteen.csv")
head(df,10)
## clone_plant_combination honeydew clone plant hostplant
## 1 T_Vicia 1.08 T Vicia Universal
## 2 T_Vicia 2.21 T Vicia Universal
## 3 T_Vicia 2.63 T Vicia Universal
## 4 T_Vicia 1.63 T Vicia Universal
## 5 T_Vicia 3.51 T Vicia Universal
## 6 T_Vicia 2.53 T Vicia Universal
## 7 T_Vicia 2.92 T Vicia Universal
## 8 T_Vicia 0.98 T Vicia Universal
## 9 T_Vicia 2.39 T Vicia Universal
## 10 T_Vicia 2.05 T Vicia Universal
Again, you examined in Chapter 4’s exercises the work of Binning, Roche, and Layton (2013) who studied the effect of ectoparasitic isopods on the swimming ability of a tropical species of bream. They created four treatment groups in the laboratory: eight unparasitized fish, 10 parasitized fish, 10 parasitized fish that had the parasites removed, and ten unparasitized fish that had model parasites made of plastic added. They recorded the swimming speed (body lengths per second) and oxygen consumption (mgO2 per kg per hour) of each fish. The data are available from datadryad. Within dryad, the dataset you want is “binning etal 2012 one way anova.txt”; in this dataset, SMR is Standard Metabolic Rate (O2 consumption), AS = factorial aerobic scope, and Ucrit is swimming speed.
In Chapter 4’s exercises, you should have described a model to fit to these data.
What are the assumptions associated with that model?
Read in the data and check the assumptions
df <- read.csv("../data/binning.csv")
head(df,10)
## Fish Treatment SMR MMR AS Ucrit
## 1 P10 P 110.20 535.80 4.86 3.79
## 2 P12 P 140.11 471.45 3.36 3.66
## 3 P27 P 135.84 573.08 4.22 2.47
## 4 P42 P 140.44 379.69 2.70 3.65
## 5 P15 P 120.54 561.93 4.66 3.52
## 6 P23 P 108.02 375.57 3.48 3.39
## 7 P26 P 119.23 534.75 4.49 3.74
## 8 P37 P 152.73 494.95 3.24 3.37
## 9 P72 P 100.86 429.73 4.26 3.36
## 10 P75 P 134.09 434.60 3.24 3.68
Kaufman et al. (2013) examined the neuroanatomy of a recently described species of sengi, which are small insectivorous mammals also known as elephant or jumping shrews. These animals are interesting, having been originally placed with the mammalian order Insectivora, along with shrews, hedgehogs, moles, etc., but this group is now known to be polyphyletic, and sengis are more appropriately grouped with elephants, dugongs, and hyraxes. They are in the order Macroscelidea, within the Afrotheria. The Afrotheria includes another order of small insectivores, the Tenrecoidea (tenrecs and golden moles). The Laurasiatheria also includes several families of small insectivores
Small insectivores are generally thought to have small brain mass (when adjusted for overall body mass), but there has been some question of whether sengis fit this pattern, and Kaufman and colleagues were curious whether the new species, Rhynchocyon udzungwensis, fitted with other sengi. They assembled data from 56 small insectivores, 5 sengi, 14 afrotherian species, and 37 laurasiatherians. For each species, they calculated brain mass (in mg) and total body mass (g).
Data are all presented in Table 1 of the paper. We’ve extracted it from the paper and it’s here kaufman.csv
In the exercises for this chapter, we’ll just think about brain size relative to body size, and we’ll pick this example up again in Chapter 8.
Load the data file, and look at the relationship between brain mass and body mass.
df <- read.csv("../data/kaufman.csv")
head(df,10)
## family genus species bodymass brainmass relation
## 1 Solenodontidae Solenodon paradoxus 672.0 4723 laurasiatherian
## 2 Tenrecidae Tenrec ecaudatus 852.0 2588 afrotherian
## 3 Tenrecidae Setifer setosus 237.0 1516 afrotherian
## 4 Tenrecidae Hemicentetes semispin 116.0 839 afrotherian
## 5 Tenrecidae Echinops telfairi 87.5 623 afrotherian
## 6 Tenrecidae Oryzorictes talpoides 44.2 580 afrotherian
## 7 Tenrecidae Microgale cowani 15.2 420 afrotherian
## 8 Tenrecidae Limnogale mergulus 92.0 1150 afrotherian
## 9 Tenrecidae Microgale dobsoni 31.9 557 afrotherian
## 10 Tenrecidae Microgale talazaci 48.2 766 afrotherian
## relation2
## 1 other insectivore
## 2 other insectivore
## 3 other insectivore
## 4 other insectivore
## 5 other insectivore
## 6 other insectivore
## 7 other insectivore
## 8 other insectivore
## 9 other insectivore
## 10 other insectivore
What kind of model are we intending to fit to these data?
Look at the relationship between the two variables? Are there any steps you’d recommend we take?
Note that the original researchers used a reduced major axis regression, as they considered both variables measured with error. Note the discussion in Chapter 6 about whether to consider X random or fixed. For our purposes here, we’ll treat it as fixed