Our primary objective for this case study is to go through a few full analyses for continuous outcomes. In practice, we would ideally prespecify an analysis plan rather than fitting all possible models.
library(tidyverse)
library(nlme)
library(geepack)
library(multcomp) # glht() function for linear combinations of parameters
library(ggcorrplot) # correlation heat maps
library(GGally) # scatterplot matrix
library(cowplot) # convenient for grids of plots
library(clubSandwich) # robust SE for LMM
pal2use <- c("#000000", "#E69F00", "#56B4E9", "#009E73",
"#F0E442", "#0072B2", "#D55E00", "#CC79A7")
pal2use2 <- c("#E69F00", "#0072B2")
These data are from a study of dental growth measurements of the distance (mm) from the center of the pituitary gland to the pteryomaxillary fissure. Measurements were obtained on 11 girls and 16 boys at ages 8, 10, 12, and 14.
(Reference: Pottho R. F. and Roy, S. W. A generalized multivariate analysis of variance model useful especially for growth curve problems. Biometrika 1964;51:313-326)
data("Orthodont") # in package nlme
The dataset contains measurements on 27 children and includes the following 6 columns:
By default, the dataset has males as the reference group. We’ll change the reference group to females:
# relevel Sex so that the reference group is female
Orthodont$Sex <- relevel(Orthodont$Sex, ref = "Female")
Here are the first 6 rows of the dataset:
head(Orthodont)
## Grouped Data: distance ~ age | Subject
## distance age Subject Sex
## 1 26.0 8 M01 Male
## 2 25.0 10 M01 Male
## 3 29.0 12 M01 Male
## 4 31.0 14 M01 Male
## 5 21.5 8 M02 Male
## 6 22.5 10 M02 Male
This is called “long” format, in which each observation gets its own row. Sometimes it is convenient to have the data in “wide” format, in which each subject’s observations are all in a single row. Wide format can be used with a balanced design (collected at the same set of occasions). Fortunately, it’s not too difficult to switch between these formats in R (at least in simple cases like this one).
Let’s switch to wide format:
dental.wide <- data.frame(Orthodont) %>%
pivot_wider(names_from = age, names_prefix = "Age",
values_from = distance)
head(dental.wide)
## # A tibble: 6 x 6
## Subject Sex Age8 Age10 Age12 Age14
## <ord> <fct> <dbl> <dbl> <dbl> <dbl>
## 1 M01 Male 26 25 29 31
## 2 M02 Male 21.5 22.5 23 26.5
## 3 M03 Male 23 22.5 24 27.5
## 4 M04 Male 25.5 27.5 26.5 27
## 5 M05 Male 20 23.5 22.5 26
## 6 M06 Male 24.5 25.5 27 28.5
And, just for practice, back to long:
dental.long <- dental.wide %>%
pivot_longer(Age8:Age14,
names_to = "age", names_prefix = "Age",
values_to = "distance") %>%
mutate(age = as.numeric(age)) %>%
dplyr::select(distance, age, Subject, Sex)
## make sure we got the same dataset back
all(Orthodont == dental.long)
## [1] TRUE
## here we will define a new variable age.8 = age - 8
dental.long$age.8 = dental.long$age - 8
We could consider several scientific questions for this dataset:
Let’s first check that the data are balanced and complete, as expected. This is very easy to do from the wide-format data:
any(is.na(dental.wide))
## [1] FALSE
From the long-format data, we could check that every subject ID has the same number of observations:
# Look at number of observations for each subject
table(dental.long$Subject)
##
## M16 M05 M02 M11 M07 M08 M03 M12 M13 M14 M09 M15 M06 M04 M01 M10 F10 F09
## 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## F06 F01 F05 F07 F02 F08 F03 F04 F11
## 4 4 4 4 4 4 4 4 4
# How many unique numbers of observations?
length(unique(table(dental.long$Subject)))
## [1] 1
Now we’ll move on to exploring longitudinal trajectories (individual and aggregate) and the covariance structure.
We can use a spaghetti plot to visualize all individual trajectories.
dental.long %>%
ggplot(aes(x=age, y=distance, group=Subject, col=Sex)) +
geom_line(aes(linetype=Sex)) +
geom_point() +
xlab("Age (Years)") + ylab("Length (mm)") +
ggtitle("Spaghetti Plot (Individuals' Dental Growth)") +
theme_classic(base_size = 14) +
scale_color_manual(values=pal2use2)
A few key takeaways from this plot:
dental.long %>%
mutate(Subject = as.character(Subject)) %>% # more natural ordering
ggplot(aes(x=age, y=distance, col=Sex)) +
geom_line() +
geom_point() +
xlab("Age (Years)") + ylab("Length (mm)") +
theme_classic(base_size = 14) +
facet_wrap(vars(Subject)) +
scale_color_manual(values=pal2use2)
From this plot, it’s clear that there may be some measurement error for male subject 9, since skull size (and dental length) typically do not decrease substantively with increasing age. A couple of other subjects have much smaller decreases in length between time points (e.g., female subject 1, male subject 1).
We can also plot average response profiles to see whether the mean trajectory differs between boys and girls.
## Response profiles
dental.summ <- dental.long %>%
group_by(Sex, age) %>%
summarise(Mean = mean(distance))
dental.summ %>%
ggplot(aes(x=age, y=Mean, group=Sex, col=Sex)) +
geom_line(aes(linetype=Sex)) +
geom_point() +
xlab("Age (Years)") + ylab("Average Length (mm)") +
ggtitle("Mean Response Profiles (Growth By Sex)") +
theme_classic(base_size = 14) +
scale_color_manual(values=pal2use2)
A few key takeaways from this plot:
Let’s turn to correlation structures. The scatterplot matrix below shows that correlation is high between both adjacent and non-adjacent pairs of time points (and the associations seem reasonably linear between each pair of time points).
## Scatterplot matrix (GGally package)
ggpairs(dental.wide[, c("Age8", "Age10", "Age12", "Age14")]) +
ggtitle("Dental Growth Correlation Structure") +
theme_bw(base_size = 14)
If we separate out by sex, though, the correlation is much higher for girls (0.8-0.95) than for boys (0.3-0.65). This will be relevant when we consider correlation structures.
## Correlation heatmaps
cormat.boys <- dental.wide %>%
filter(Sex == "Male") %>%
dplyr::select(Age8:Age14) %>%
cor() %>%
round(., 3)
cormat.boys
## Age8 Age10 Age12 Age14
## Age8 1.000 0.437 0.558 0.315
## Age10 0.437 1.000 0.387 0.631
## Age12 0.558 0.387 1.000 0.586
## Age14 0.315 0.631 0.586 1.000
cormat.girls <- dental.wide %>%
filter(Sex == "Female") %>%
dplyr::select(Age8:Age14) %>%
cor() %>%
round(., 3)
cormat.girls
## Age8 Age10 Age12 Age14
## Age8 1.000 0.830 0.862 0.841
## Age10 0.830 1.000 0.895 0.879
## Age12 0.862 0.895 1.000 0.948
## Age14 0.841 0.879 0.948 1.000
corplot.boys <- cormat.boys %>%
ggcorrplot(type = "lower") +
labs(title="Correlation Matrix for Boys") +
theme(legend.position = 'bottom')
corplot.girls <- cormat.girls %>%
ggcorrplot(type = "lower") +
labs(title="Correlation Matrix for Girls") +
theme(legend.position = 'bottom')
plot_grid(corplot.boys, corplot.girls, nrow=1)
Our first order of business is to choose the structure for the mean.
Arguably the most well-known mean model is a linear model, and we can start with a linear form of association here (despite seeing mild nonlinearity in the mean trajectory for boys). This model may be written: \[\text{E}[\text{Distance}_{ij} | \text{Male}_i, \text{Age}_{ij}] = \beta_0 + \beta_1\times (\text{Age}_{ij}-8) + \beta_2 \times \text{Male}_i + \beta_3 \times(\text{Age}_{ij}-8) \times \text{Male}_i\] where \(\text{Male}_i\) is 1 if the child is a male, 0 otherwise. We can fit this model using GEE with working independence (this is the default for the corstr
argument in the geeglm()
function). Before we go ahead and fit our model, it’s a good idea to make sure the data are ordered by id
and time
(geeglm()
assumes the data are correctly ordered):
dental.long <- dental.long[order(dental.long$Subject, dental.long$age),]
mod1.gee <- geeglm(distance ~ age.8 * Sex,
id = Subject, data = dental.long)
summary(mod1.gee)
##
## Call:
## geeglm(formula = distance ~ age.8 * Sex, data = dental.long,
## id = Subject)
##
## Coefficients:
## Estimate Std.err Wald Pr(>|W|)
## (Intercept) 21.20909 0.56043 1432.185 < 2e-16 ***
## age.8 0.47955 0.06313 57.697 3.05e-14 ***
## SexMale 1.40653 0.77380 3.304 0.0691 .
## age.8:SexMale 0.30483 0.11687 6.803 0.0091 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation structure = independence
## Estimated Scale Parameters:
##
## Estimate Std.err
## (Intercept) 4.905 1.015
## Number of clusters: 27 Maximum cluster size: 4
## Get predicted values for plot
dental.long$fitted.linear <- mod1.gee$fitted[,1]
## Plot fitted and observed values
ggplot(dental.long) +
geom_point(aes(x=age, y=distance)) +
geom_line(aes(x=age, y=fitted.linear), col = "blue") +
facet_wrap(vars(Sex), nrow=1)
Let’s take a quick look at these results. Here are the geeglm results:
tidy(mod1.gee, conf.int = T)
## # A tibble: 4 x 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 21.2 0.560 1432. 0 20.1 22.3
## 2 age.8 0.480 0.0631 57.7 3.05e-14 0.356 0.603
## 3 SexMale 1.41 0.774 3.30 6.91e- 2 -0.110 2.92
## 4 age.8:SexMale 0.305 0.117 6.80 9.10e- 3 0.0758 0.534
Here is code for fitting the model using different working correlation structures along with a comparison to using lm()
. As a reminder with lm()
, the point estimates will match those from using GEE with working independence, but the standard errors are based off assuming all observations are independent, which we would not expect to be valid in the setting of a longitudinal study (we expect correlation between measurements taken on the same individual):
m_ind <- geeglm(distance ~ age.8*Sex, id = Subject,
data = dental.long, corstr = "independence")
m_exc <- geeglm(distance ~ age.8*Sex, id = Subject,
data = dental.long, corstr = "exchangeable")
m_ar1 <- geeglm(distance ~ age.8*Sex, id = Subject,
data = dental.long, corstr = "ar1")
m_uns <- geeglm(distance ~ age.8*Sex, id = Subject,
data = dental.long, corstr = "unstructured")
m_ols <- lm(distance ~ age.8*Sex, data=dental.long)
We have already seen the output for working independence. The results from working exchangeable:
summary(m_exc)
##
## Call:
## geeglm(formula = distance ~ age.8 * Sex, data = dental.long,
## id = Subject, corstr = "exchangeable")
##
## Coefficients:
## Estimate Std.err Wald Pr(>|W|)
## (Intercept) 21.2091 0.5604 1432.2 < 2e-16 ***
## age.8 0.4795 0.0631 57.7 3.1e-14 ***
## SexMale 1.4065 0.7738 3.3 0.0691 .
## age.8:SexMale 0.3048 0.1169 6.8 0.0091 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation structure = exchangeable
## Estimated Scale Parameters:
##
## Estimate Std.err
## (Intercept) 4.91 1.01
## Link = identity
##
## Estimated Correlation Parameters:
## Estimate Std.err
## alpha 0.618 0.131
## Number of clusters: 27 Maximum cluster size: 4
The estimated working correlation matrix is:
\[\mathbf{R}_i(\hat{\alpha}) = \begin{bmatrix} 1 & & & \\ 0.62 & 1 & & \\ 0.62 & 0.62 & 1 & \\ 0.62 & 0.62 & 0.62 & 1 \end{bmatrix}\]
The results from working AR1:
summary(m_ar1)
##
## Call:
## geeglm(formula = distance ~ age.8 * Sex, data = dental.long,
## id = Subject, corstr = "ar1")
##
## Coefficients:
## Estimate Std.err Wald Pr(>|W|)
## (Intercept) 21.1876 0.5946 1269.84 < 2e-16 ***
## age.8 0.4844 0.0631 58.98 1.6e-14 ***
## SexMale 1.6038 0.8280 3.75 0.053 .
## age.8:SexMale 0.2828 0.1238 5.22 0.022 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation structure = ar1
## Estimated Scale Parameters:
##
## Estimate Std.err
## (Intercept) 4.92 1.01
## Link = identity
##
## Estimated Correlation Parameters:
## Estimate Std.err
## alpha 0.759 0.0959
## Number of clusters: 27 Maximum cluster size: 4
The estimated working correlation matrix is:
\[\mathbf{R}_i(\hat{\alpha}) = \begin{bmatrix} 1 & & & \\ 0.76 & 1 & & \\ 0.76^2 & 0.76 & 1 & \\ 0.76^3 & 0.76^2 & 0.76 & 1 \end{bmatrix}\]
The results from working unstructured:
summary(m_uns)
##
## Call:
## geeglm(formula = distance ~ age.8 * Sex, data = dental.long,
## id = Subject, corstr = "unstructured")
##
## Coefficients:
## Estimate Std.err Wald Pr(>|W|)
## (Intercept) 21.2220 0.5532 1471.73 < 2e-16 ***
## age.8 0.4781 0.0639 56.02 7.2e-14 ***
## SexMale 1.4065 0.7622 3.41 0.0650 .
## age.8:SexMale 0.3100 0.1172 7.00 0.0082 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation structure = unstructured
## Estimated Scale Parameters:
##
## Estimate Std.err
## (Intercept) 4.91 1.01
## Link = identity
##
## Estimated Correlation Parameters:
## Estimate Std.err
## alpha.1:2 0.501 0.1332
## alpha.1:3 0.736 0.1381
## alpha.1:4 0.515 0.1920
## alpha.2:3 0.555 0.2255
## alpha.2:4 0.621 0.0905
## alpha.3:4 0.779 0.1630
## Number of clusters: 27 Maximum cluster size: 4
The estimated working correlation matrix is:
\[\mathbf{R}_i(\hat{\alpha}) = \begin{bmatrix} 1 & & & \\ 0.50 & 1 & & \\ 0.74 & 0.56 & 1 & \\ 0.52 & 0.62 & 0.78 & 1 \end{bmatrix}\]
Here is code for obtaining 95% confidence intervals for the regression parameters:
tidy(m_ind, conf.int = T)
## # A tibble: 4 x 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 21.2 0.560 1432. 0 20.1 22.3
## 2 age.8 0.480 0.0631 57.7 3.05e-14 0.356 0.603
## 3 SexMale 1.41 0.774 3.30 6.91e- 2 -0.110 2.92
## 4 age.8:SexMale 0.305 0.117 6.80 9.10e- 3 0.0758 0.534
If we want to obtain the 95% confidence interval for the slope for male children, \(\beta_1 + \beta_3\), we can use the glht()
function from the multcomp
package:
male_slope <- glht(m_ind, "age.8 + age.8:SexMale = 0")
summary(male_slope)
##
## Simultaneous Tests for General Linear Hypotheses
##
## Fit: geeglm(formula = distance ~ age.8 * Sex, data = dental.long,
## id = Subject, corstr = "independence")
##
## Linear Hypotheses:
## Estimate Std. Error z value Pr(>|z|)
## age.8 + age.8:SexMale == 0 0.7844 0.0983 7.98 1.6e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)
confint(male_slope)
##
## Simultaneous Confidence Intervals
##
## Fit: geeglm(formula = distance ~ age.8 * Sex, data = dental.long,
## id = Subject, corstr = "independence")
##
## Quantile = 1.96
## 95% family-wise confidence level
##
##
## Linear Hypotheses:
## Estimate lwr upr
## age.8 + age.8:SexMale == 0 0.784 0.592 0.977
Another way to model the mean response is to simply estimate the conditional mean separately at each time point for each group, called a response profile model. Notationally, that model may be written: \[\text{E}[\text{Distance}_{ij} | \text{Male}_i, \text{Age}_{ij}] = \beta_0 + \beta_1 \times I_{\text{Age}_{ij}=10} + \beta_2 \times I_{\text{Age}_{ij}=12} + \beta_3 I_{\text{Age}_{ij}=14} \] \[+ \beta_4 \times \text{Male}_i + \beta_5\times I_{\text{Age}_{ij}=10} \times \text{Male}_i + \beta_6 \times I_{\text{Age}_{ij}=12} \times \text{Male}_i + \beta_7 \times I_{\text{Age}_{ij}=14} \times \text{Male}_i\]
Functionally, the mean at each sex and age can be written as a sum of \(\beta\) coefficients. For example, for females age 8, the mean distance would be: \[E[\text{Distance}_{ij} | \text{Male}_i = 0, \text{Age}_{ij}=8 ] = \beta_0\] and for females age 14: \[E[\text{Distance}_{ij} |\text{Male}_i = 0, \text{Age}_{ij}=14] = \beta_0 + \beta_3\] and for males age 8: \[E[\text{Distance}_{ij} | \text{Male}_i = 1, \text{Age}_{ij}=8] = \beta_0 + \beta_4\] and for males age 14: \[E[\text{Distance}_{ij} | \text{Male}_i = 1, \text{Age}_{ij}=14] = \beta_0 + \beta_3 + \beta_4 + \beta_7\]
mod2.gee <- geeglm(distance ~ factor(age) * Sex,
id = Subject, data = dental.long)
summary(mod2.gee)
##
## Call:
## geeglm(formula = distance ~ factor(age) * Sex, data = dental.long,
## id = Subject)
##
## Coefficients:
## Estimate Std.err Wald Pr(>|W|)
## (Intercept) 21.182 0.611 1202.78 < 2e-16 ***
## factor(age)10 1.045 0.343 9.30 0.0023 **
## factor(age)12 1.909 0.345 30.61 3.2e-08 ***
## factor(age)14 2.909 0.379 58.82 1.7e-14 ***
## SexMale 1.693 0.852 3.95 0.0468 *
## factor(age)10:SexMale -0.108 0.685 0.02 0.8747
## factor(age)12:SexMale 0.935 0.677 1.91 0.1674
## factor(age)14:SexMale 1.685 0.750 5.05 0.0247 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation structure = independence
## Estimated Scale Parameters:
##
## Estimate Std.err
## (Intercept) 4.87 1.01
## Number of clusters: 27 Maximum cluster size: 4
Visually, we can compare the estimates for the mean response by age and sex from the response profile (i.e. categorical time; red squares) and the linear time models (blue line):
## Get predicted values for plot
dental.long$fitted.cat <- mod2.gee$fitted[,1]
## Plot fitted and observed values
ggplot(dental.long) +
geom_point(aes(x=age, y=distance), shape = 1) +
geom_line(aes(x=age, y=fitted.linear), col = "blue") +
geom_point(aes(x=age, y=fitted.cat), col = "red", size = 2, shape=15) +
facet_wrap(vars(Sex), nrow=1)
From the figure there doesn’t appear to be much difference between the estimates. It’s also possible to construct a hypothesis test to test whether there is evidence of a departure from a linear trend.
We can fit corresponding mean models with LMM; for brevity, we won’t fit all of them again, but will consider testing hypotheses about fixed effects and about random effects.
Let’s start with a linear mean model with an interaction between age and sex, and random intercepts only. Notationally, this is:
\[\text{Dist}_{ij} = \beta_0 + \beta_1 \times (\text{Age}_{ij}-8) + \beta_2 \times \text{Male}_{i} + \beta_3 \times (\text{Age}_{ij} - 8) \times \text{Male}_{i} + b_i\]
So the sex- and subject-specific models are \[E[\text{Distance}_{ij} | \text{Age}_{ij}, \text{Male}_i = 0] = (\beta_0 + b_i) + \beta_1 \times (\text{Age}_{ij}-8)\] if child \(i\) is a girl, and \[E[\text{Distance}_{ij} | \text{Age}_{ij}, \text{Male}_i = 1] = (\beta_0 + \beta_2 + b_i) + (\beta_1 + \beta_3) \times (\text{Age}_{ij}-8)\] if child \(i\) is a boy.
## Linear mean model
# random intercept
mod1.ri <- lme(fixed = distance ~ age.8 * Sex,
random = ~ 1 | Subject,
data = dental.long)
summary(mod1.ri)
## Linear mixed-effects model fit by REML
## Data: dental.long
## AIC BIC logLik
## 446 462 -217
##
## Random effects:
## Formula: ~1 | Subject
## (Intercept) Residual
## StdDev: 1.82 1.39
##
## Fixed effects: distance ~ age.8 * Sex
## Value Std.Error DF t-value p-value
## (Intercept) 21.21 0.650 79 32.6 0.0000
## age.8 0.48 0.093 79 5.1 0.0000
## SexMale 1.41 0.844 25 1.7 0.1081
## age.8:SexMale 0.30 0.121 79 2.5 0.0141
## Correlation:
## (Intr) age.8 SexMal
## age.8 -0.432
## SexMale -0.770 0.332
## age.8:SexMale 0.332 -0.770 -0.432
##
## Standardized Within-Group Residuals:
## Min Q1 Med Q3 Max
## -3.5980 -0.4546 0.0158 0.5024 3.6862
##
## Number of Observations: 108
## Number of Groups: 27
intervals(mod1.ri)
## Approximate 95% confidence intervals
##
## Fixed effects:
## lower est. upper
## (Intercept) 19.9158 21.209 22.502
## age.8 0.2935 0.480 0.666
## SexMale -0.3318 1.407 3.145
## age.8:SexMale 0.0631 0.305 0.547
## attr(,"label")
## [1] "Fixed effects:"
##
## Random Effects:
## Level: Subject
## lower est. upper
## sd((Intercept)) 1.32 1.82 2.5
##
## Within-group standard error:
## lower est. upper
## 1.19 1.39 1.62
The systematic part of the model is:
\[\widehat{\text{Dist}}_{ij} = 21.21 + 0.48 \times (\text{Age}_{ij} - 8) + 1.41 \times \text{Male}_i + 0.30 \times (\text{Age}_{ij} - 8) \times \text{Male}_i\]
As noted in the lecture, these coefficients may be interpreted marginally or conditionally, since marginal and conditional associations match in linear models. So, both of the following work:
We also see that the estimated variance of the random intercepts is \[\widehat{\text{Var}}(\epsilon) = \hat{\sigma}^2 = 1.39^2 = 1.93\] and the estimated variance of the errors \(\epsilon\) is \[\widehat{\text{Var}}(b_i) = \hat{\tau}^2 = 1.82^2 = 3.31,\] leading to an estimate of \[\widehat{\text{ICC}} = \frac{\hat{\tau}^2}{\hat{\sigma}^2 + \hat{\tau}^2} = \frac{3.31}{1.93+3.31} = 0.632\] for the intra-class correlation coefficient. Remember that we expected this to match the estimated correlation under the exchangeable model from the GEE section – and, up to rounding error, it does!
We can extract different fitted values and residuals from an LMM, depending on the level of grouping of interest. We’ll focus on fitted values for this document; you can explore residual analysis further using the textbook resources suggested in the lectures. “Level 0” refers to the fitted values or residuals using the fixed effects only. The matrix of results below makes this obvious: the fitted values are the same regardless of subject.
# Level 0 fitted values
fixed.pred <- fitted(mod1.ri, level=0)
head(fixed.pred)
## M16 M16 M16 M16 M05 M05
## 22.6 24.2 25.8 27.3 22.6 24.2
head(matrix(fixed.pred, ncol=4, byrow=T))
## [,1] [,2] [,3] [,4]
## [1,] 22.6 24.2 25.8 27.3
## [2,] 22.6 24.2 25.8 27.3
## [3,] 22.6 24.2 25.8 27.3
## [4,] 22.6 24.2 25.8 27.3
## [5,] 22.6 24.2 25.8 27.3
## [6,] 22.6 24.2 25.8 27.3
In contrast, the “level 1” fitted values incorporate the random effects, so they will differ between subjects (by a constant, up to occasional rounding differences, since this is a random intercept model). Notice, for example, that the predictions in row 6 are always 0.1 greater than the predictions in row 5.
# Level 1 fitted values
randint.pred <- fitted(mod1.ri, level=1)
head(randint.pred)
## M16 M16 M16 M16 M05 M05
## 20.9 22.5 24.0 25.6 20.9 22.5
head(matrix(randint.pred, ncol=4, byrow=T))
## [,1] [,2] [,3] [,4]
## [1,] 20.9 22.5 24.0 25.6
## [2,] 20.9 22.5 24.0 25.6
## [3,] 21.2 22.8 24.4 25.9
## [4,] 21.4 23.0 24.6 26.1
## [5,] 21.6 23.1 24.7 26.3
## [6,] 21.7 23.2 24.8 26.4
We can see the level 1 predictions in these plots:
dental.long$fitted.lme <- fitted(mod1.ri)
# All subject-specific trajectories
dental.long %>%
ggplot() +
geom_line(aes(x=age, y=distance, col=Sex)) +
geom_point(aes(x=age, y=distance, col=Sex)) +
geom_line(aes(x=age, y=fitted.lme), col="black", lty = 2, size = 1) +
xlab("Age (Years)") + ylab("Distance (mm)") +
theme_classic(base_size = 14) +
facet_wrap(vars(Subject), nrow=4) +
scale_color_manual(values = pal2use2) +
theme(legend.position = "right")
# Zooming in on two subjects
dental.long %>%
filter(Subject %in% c("M10", "M11")) %>%
ggplot() +
geom_line(aes(x=age, y=distance, col=Subject)) +
geom_point(aes(x=age, y=distance, col=Subject)) +
geom_line(aes(x=age, y=fitted.lme, col=Subject), lty = 2) +
xlab("Age (Years)") + ylab("Distance (mm)") +
ggtitle("Comparison Between Two Male Subjects") +
theme_classic(base_size = 14) +
theme(legend.position = "right") +
scale_color_manual(values = pal2use2)
We could also allow slopes to vary by subject. Notationally, this model is \[E[\text{Distance}_{ij} | \text{Age}_{ij}, \text{Male}_i, b_i] = \beta_0 + \beta_1 \times (\text{Age}_{ij} - 8) + \beta_2 \times \text{Male}_{i} + \beta_3 \times (\text{Age}_{ij} - 8) \times \text{Male}_{i} + b_{i0} + b_{i1} \times (\text{Age}_{ij} - 8)\] Now the sex- and subject-specific models are \[E[\text{Distance}_{ij} | \text{Age}_{ij}, \text{Male}_i = 0, b_i] = (\beta_0 + b_{i0}) + (\beta_1 + b_{i1}) \times (\text{Age}_{ij} - 8)\] if child \(i\) is a girl, and
\[E[\text{Distance}_{ij} | \text{Age}_{ij}, \text{Male}_i = 1, b_i] = (\beta_0 + \beta_2 + b_{i0}) + ( \beta_1 + \beta_3 + b_{i1}) \times (\text{Age}_{ij} - 8)\] if child \(i\) is a boy.
# random intercept, random slope
mod1.rs <- lme(fixed = distance ~ age.8 * Sex,
random = ~ 1 + age | Subject,
data = dental.long)
summary(mod1.rs)
## Linear mixed-effects model fit by REML
## Data: dental.long
## AIC BIC logLik
## 449 470 -216
##
## Random effects:
## Formula: ~1 + age | Subject
## Structure: General positive-definite, Log-Cholesky parametrization
## StdDev Corr
## (Intercept) 2.41 (Intr)
## age 0.18 -0.668
## Residual 1.31
##
## Fixed effects: distance ~ age.8 * Sex
## Value Std.Error DF t-value p-value
## (Intercept) 21.21 0.635 79 33.4 0.0000
## age.8 0.48 0.104 79 4.6 0.0000
## SexMale 1.41 0.825 25 1.7 0.1006
## age.8:SexMale 0.30 0.135 79 2.3 0.0264
## Correlation:
## (Intr) age.8 SexMal
## age.8 -0.396
## SexMale -0.770 0.305
## age.8:SexMale 0.305 -0.770 -0.396
##
## Standardized Within-Group Residuals:
## Min Q1 Med Q3 Max
## -3.1681 -0.3859 0.0071 0.4452 3.8495
##
## Number of Observations: 108
## Number of Groups: 27
Again, we can extract level 0 fitted values (these are the same as above):
# Level 0 fitted values
fixed.pred.2 <- fitted(mod1.rs, level=0)
head(matrix(fixed.pred.2, ncol=4, byrow=T))
## [,1] [,2] [,3] [,4]
## [1,] 22.6 24.2 25.8 27.3
## [2,] 22.6 24.2 25.8 27.3
## [3,] 22.6 24.2 25.8 27.3
## [4,] 22.6 24.2 25.8 27.3
## [5,] 22.6 24.2 25.8 27.3
## [6,] 22.6 24.2 25.8 27.3
or the level 1 fitted values, which no longer differ by a constant between subjects:
# Level 1 fitted values
randsl.pred <- fitted(mod1.rs, level=1)
head(randsl.pred)
## M16 M16 M16 M16 M05 M05
## 21.1 22.5 23.9 25.3 20.9 22.5
head(matrix(randsl.pred, ncol=4, byrow=T))
## [,1] [,2] [,3] [,4]
## [1,] 21.1 22.5 23.9 25.3
## [2,] 20.9 22.5 24.0 25.6
## [3,] 21.3 22.8 24.3 25.8
## [4,] 21.8 23.1 24.4 25.7
## [5,] 21.6 23.1 24.7 26.2
## [6,] 22.0 23.3 24.6 26.0
Plotting the two models (random intercept, random intercept and slope), we observe results that are similar but not identical.
dental.long$fitted.lme.rs <- fitted(mod1.rs)
dental.long %>%
ggplot() +
geom_line(aes(x=age, y=distance, col=Sex)) +
geom_point(aes(x=age, y=distance, col=Sex)) +
geom_line(aes(x=age, y=fitted.lme.rs), col="black", lty = 2) +
xlab("Age (Years)") + ylab("Distance (mm)") +
theme_classic(base_size = 14) +
facet_wrap(vars(Subject), nrow=4) +
theme(legend.position = "right") +
scale_color_manual(values=pal2use2)
The black lines in the zoomed-in plot correspond to the random intercepts and slopes model. For subject M10, the random slope is slightly positive (0.05) and the line is slightly steeper; for subject M11, the random slope is slightly negative (-0.14) and the line is less steep.
coef(mod1.rs)[c("M10", "M11"), ]
## (Intercept) age.8 SexMale age.8:SexMale age
## M10 24.7 0.48 1.41 0.305 0.0507
## M11 21.5 0.48 1.41 0.305 -0.1405
dental.long %>%
filter(Subject %in% c("M10", "M11")) %>%
ggplot() +
geom_line(aes(x=age, y=distance, col=Subject)) +
geom_point(aes(x=age, y=distance, col=Subject)) +
geom_line(aes(x=age, y=fitted.lme, col=Subject), lty = 3) +
geom_line(aes(x=age, y=fitted.lme.rs, group=Subject), col="black", size=1.25, lty = 2) +
xlab("Age (Years)") + ylab("Distance (mm)") +
ggtitle("Comparison Between Two Male Subjects") +
theme_classic(base_size = 14) +
theme(legend.position = "right")
To use a likelihood ratio test to formally test whether the model with random slopes is significantly better than the model with only random intercepts (i.e., \(D_{12} = D_{22} = 0\) in the covariance matrix of the random effects), we could do the following.
# Is the correlation structure better with random slopes?
# Doesn't look like it (p=0.56, AIC and BIC are slightly lower for RI)
anova(mod1.rs, mod1.ri)
## Model df AIC BIC logLik Test L.Ratio p-value
## mod1.rs 1 8 449 470 -216
## mod1.ri 2 6 446 462 -217 1 vs 2 1.18 0.556
The slightly lower AIC and BIC for the random intercept model, along with the nonsignificant p-value, suggest that the random intercept model is sufficient (an exchangeable correlation structure is reasonable).
We noticed earlier that the correlation is much higher for girls than boys, contrary to our assumed correlation structure which is equal between sexes. We could use robust standard errors to avoid any potential problems due to misspecification such as this.
Here are the original inference results, which we’ve seen already:
## Original CIs
orig.tab <- cbind(summary(mod1.ri)$tTable, intervals(mod1.ri)$fixed)
round(orig.tab[, c("Value", "p-value", "lower", "upper")], 3)
## Value p-value lower upper
## (Intercept) 21.209 0.000 19.916 22.502
## age.8 0.480 0.000 0.293 0.666
## SexMale 1.407 0.108 -0.332 3.145
## age.8:SexMale 0.305 0.014 0.063 0.547
Using robust SEs, we find the following:
# clubSandwich
## Robust standard errors (random intercept model)
sig <- vcovCR(mod1.ri, type = "CR2")
summ <- coef_test(mod1.ri, sig)
cis <- conf_int(mod1.ri, sig)
robtab <- merge(summ[, c("Coef", "beta", "p_Satt")],
cis[, c("Coef", "CI_L", "CI_U")])
rownames(robtab) <- robtab$Coef
robtab$Coef <- NULL
round(robtab[c(1,2,4,3), ], 3)
## beta p_Satt CI_L CI_U
## (Intercept) 21.209 0.000 19.899 22.519
## age.8 0.480 0.000 0.332 0.627
## SexMale 1.407 0.095 -0.266 3.079
## age.8:SexMale 0.305 0.020 0.053 0.557
Notice that the estimates are the same; the confidence intervals may be narrower (here, age and sex) or wider (here, the interaction term), and p-values may be correspondingly lower or higher, with robust SEs compared to model-based SEs.
Combinations of fixed effects that may be expressed as nested models are possible to test using likelihood ratio tests. If we wanted to formally test whether, e.g., a response profile model fits significantly better than a linear mean model, we’d express them as nested models:
# use maximum likelihood estimation to compare fixed effects
# linear model
modLIN.ri <- lme(fixed = distance ~ age * Sex,
method = "ML",
random = ~ 1 | Subject,
data = dental.long)
# response profile model
# note we add only two of the four ages because any two points can be fit with a line
modCAT.ri <- lme(fixed = distance ~ age * Sex + I(age == 10) * Sex + I(age == 12) * Sex,
method = "ML",
random = ~ 1 | Subject,
data = dental.long)
# LRT
anova(modLIN.ri, modCAT.ri)
## Model df AIC BIC logLik Test L.Ratio p-value
## modLIN.ri 1 6 441 457 -214
## modCAT.ri 2 10 447 473 -213 1 vs 2 2.01 0.735
We find that the response profile model is not significantly better (unsurprising given the plots we saw in the GEE section).