Goal of This Case Study

Our primary objective for this case study is to go through a few full analyses for continuous outcomes. In practice, we would ideally prespecify an analysis plan rather than fitting all possible models.

Packages

library(tidyverse)
library(nlme)
library(geepack)
library(multcomp)     # glht() function for linear combinations of parameters
library(ggcorrplot)   # correlation heat maps 
library(GGally)       # scatterplot matrix 
library(cowplot)      # convenient for grids of plots 
library(clubSandwich) # robust SE for LMM

pal2use <- c("#000000", "#E69F00", "#56B4E9", "#009E73", 
             "#F0E442", "#0072B2", "#D55E00", "#CC79A7")
pal2use2 <- c("#E69F00", "#0072B2")

The Dataset

These data are from a study of dental growth measurements of the distance (mm) from the center of the pituitary gland to the pteryomaxillary fissure. Measurements were obtained on 11 girls and 16 boys at ages 8, 10, 12, and 14.

(Reference: Pottho R. F. and Roy, S. W. A generalized multivariate analysis of variance model useful especially for growth curve problems. Biometrika 1964;51:313-326)

data("Orthodont") # in package nlme

The dataset contains measurements on 27 children and includes the following 6 columns:

  1. Subject: subject ID
  2. Sex: child’s sex (recorded as binary, Male or Female)
  3. distance: Distance in mm from the pituitary gland to the pteryomaxillary fissure
  4. age: child’s age (8, 10, 12, or 14 years)

By default, the dataset has males as the reference group. We’ll change the reference group to females:

# relevel Sex so that the reference group is female
Orthodont$Sex <- relevel(Orthodont$Sex, ref = "Female")

Here are the first 6 rows of the dataset:

head(Orthodont)
## Grouped Data: distance ~ age | Subject
##   distance age Subject  Sex
## 1     26.0   8     M01 Male
## 2     25.0  10     M01 Male
## 3     29.0  12     M01 Male
## 4     31.0  14     M01 Male
## 5     21.5   8     M02 Male
## 6     22.5  10     M02 Male

This is called “long” format, in which each observation gets its own row. Sometimes it is convenient to have the data in “wide” format, in which each subject’s observations are all in a single row. Wide format can be used with a balanced design (collected at the same set of occasions). Fortunately, it’s not too difficult to switch between these formats in R (at least in simple cases like this one).

Let’s switch to wide format:

dental.wide <- data.frame(Orthodont) %>% 
  pivot_wider(names_from = age, names_prefix = "Age", 
              values_from = distance)
head(dental.wide)
## # A tibble: 6 x 6
##   Subject Sex    Age8 Age10 Age12 Age14
##   <ord>   <fct> <dbl> <dbl> <dbl> <dbl>
## 1 M01     Male   26    25    29    31  
## 2 M02     Male   21.5  22.5  23    26.5
## 3 M03     Male   23    22.5  24    27.5
## 4 M04     Male   25.5  27.5  26.5  27  
## 5 M05     Male   20    23.5  22.5  26  
## 6 M06     Male   24.5  25.5  27    28.5

And, just for practice, back to long:

dental.long <- dental.wide %>% 
  pivot_longer(Age8:Age14, 
               names_to = "age", names_prefix = "Age", 
               values_to = "distance") %>% 
  mutate(age = as.numeric(age)) %>% 
  dplyr::select(distance, age, Subject, Sex)

## make sure we got the same dataset back 
all(Orthodont == dental.long)
## [1] TRUE
## here we will define a new variable age.8 = age - 8
dental.long$age.8 = dental.long$age - 8

Scientific Question

We could consider several scientific questions for this dataset:

  1. Estimate the average growth curve among all children
  2. Estimate the growth curve for individual children
  3. Characterize the degree of heterogeneity across children
  4. Identify factors that predict growth (e.g., sex)

Exploratory Analysis

Let’s first check that the data are balanced and complete, as expected. This is very easy to do from the wide-format data:

any(is.na(dental.wide))
## [1] FALSE

From the long-format data, we could check that every subject ID has the same number of observations:

# Look at number of observations for each subject 
table(dental.long$Subject) 
## 
## M16 M05 M02 M11 M07 M08 M03 M12 M13 M14 M09 M15 M06 M04 M01 M10 F10 F09 
##   4   4   4   4   4   4   4   4   4   4   4   4   4   4   4   4   4   4 
## F06 F01 F05 F07 F02 F08 F03 F04 F11 
##   4   4   4   4   4   4   4   4   4
# How many unique numbers of observations? 
length(unique(table(dental.long$Subject)))
## [1] 1

Now we’ll move on to exploring longitudinal trajectories (individual and aggregate) and the covariance structure.

We can use a spaghetti plot to visualize all individual trajectories.

dental.long %>% 
  ggplot(aes(x=age, y=distance, group=Subject, col=Sex)) + 
  geom_line(aes(linetype=Sex)) + 
  geom_point() + 
  xlab("Age (Years)") + ylab("Length (mm)") + 
  ggtitle("Spaghetti Plot (Individuals' Dental Growth)") + 
  theme_classic(base_size = 14) + 
  scale_color_manual(values=pal2use2)

A few key takeaways from this plot:

  1. Most of the boys’ trajectories are above most of the girls’ (although there is some overlap).
  2. The lines don’t cross very often (especially among girls): those who start with longer dental lengths tend to maintain their longer dental lengths (this is called tracking).
  3. Outliers: One or two of the boys have high variability in their dental length. We can examine further by making separate plots for each subject, since this is a very small study:
dental.long %>% 
  mutate(Subject = as.character(Subject)) %>% # more natural ordering 
  ggplot(aes(x=age, y=distance, col=Sex)) + 
  geom_line() + 
  geom_point() + 
  xlab("Age (Years)") + ylab("Length (mm)") + 
  theme_classic(base_size = 14) + 
  facet_wrap(vars(Subject)) + 
  scale_color_manual(values=pal2use2)

From this plot, it’s clear that there may be some measurement error for male subject 9, since skull size (and dental length) typically do not decrease substantively with increasing age. A couple of other subjects have much smaller decreases in length between time points (e.g., female subject 1, male subject 1).

We can also plot average response profiles to see whether the mean trajectory differs between boys and girls.

## Response profiles 
dental.summ <- dental.long %>% 
  group_by(Sex, age) %>% 
  summarise(Mean = mean(distance))

dental.summ %>% 
  ggplot(aes(x=age, y=Mean, group=Sex, col=Sex)) + 
  geom_line(aes(linetype=Sex)) + 
  geom_point() + 
  xlab("Age (Years)") + ylab("Average Length (mm)") + 
  ggtitle("Mean Response Profiles (Growth By Sex)")  + 
  theme_classic(base_size = 14) + 
  scale_color_manual(values=pal2use2)

A few key takeaways from this plot:

  1. Dental length increases over time for both boys and girls.
  2. Boys have a larger average dental length at every age.
  3. The average increase between age 8 and age 14 appears to be larger for boys than for girls.
  4. The trajectory in boys does not appear to be quite linear - the increase from age 8 to 10 is smaller than any of the subsequent increases.

Let’s turn to correlation structures. The scatterplot matrix below shows that correlation is high between both adjacent and non-adjacent pairs of time points (and the associations seem reasonably linear between each pair of time points).

## Scatterplot matrix (GGally package)
ggpairs(dental.wide[, c("Age8", "Age10", "Age12", "Age14")]) + 
  ggtitle("Dental Growth Correlation Structure")  + 
  theme_bw(base_size = 14) 

If we separate out by sex, though, the correlation is much higher for girls (0.8-0.95) than for boys (0.3-0.65). This will be relevant when we consider correlation structures.

## Correlation heatmaps
cormat.boys <- dental.wide %>% 
  filter(Sex == "Male") %>% 
  dplyr::select(Age8:Age14) %>% 
  cor() %>% 
  round(., 3) 
cormat.boys 
##        Age8 Age10 Age12 Age14
## Age8  1.000 0.437 0.558 0.315
## Age10 0.437 1.000 0.387 0.631
## Age12 0.558 0.387 1.000 0.586
## Age14 0.315 0.631 0.586 1.000
cormat.girls <- dental.wide %>% 
  filter(Sex == "Female") %>% 
  dplyr::select(Age8:Age14) %>% 
  cor() %>% 
  round(., 3)
cormat.girls 
##        Age8 Age10 Age12 Age14
## Age8  1.000 0.830 0.862 0.841
## Age10 0.830 1.000 0.895 0.879
## Age12 0.862 0.895 1.000 0.948
## Age14 0.841 0.879 0.948 1.000
corplot.boys <- cormat.boys %>% 
  ggcorrplot(type = "lower") + 
  labs(title="Correlation Matrix for Boys") + 
  theme(legend.position = 'bottom')
corplot.girls <- cormat.girls %>% 
  ggcorrplot(type = "lower") + 
  labs(title="Correlation Matrix for Girls") + 
  theme(legend.position = 'bottom')

plot_grid(corplot.boys, corplot.girls, nrow=1)

General Linear Models

Mean Models

Our first order of business is to choose the structure for the mean.

Linear

Arguably the most well-known mean model is a linear model, and we can start with a linear form of association here (despite seeing mild nonlinearity in the mean trajectory for boys). This model may be written: \[\text{E}[\text{Distance}_{ij} | \text{Male}_i, \text{Age}_{ij}] = \beta_0 + \beta_1\times (\text{Age}_{ij}-8) + \beta_2 \times \text{Male}_i + \beta_3 \times(\text{Age}_{ij}-8) \times \text{Male}_i\] where \(\text{Male}_i\) is 1 if the child is a male, 0 otherwise. We can fit this model using GEE with working independence (this is the default for the corstr argument in the geeglm() function). Before we go ahead and fit our model, it’s a good idea to make sure the data are ordered by id and time (geeglm() assumes the data are correctly ordered):

dental.long <- dental.long[order(dental.long$Subject, dental.long$age),]

mod1.gee <- geeglm(distance ~ age.8 * Sex, 
               id = Subject, data = dental.long)
summary(mod1.gee)
## 
## Call:
## geeglm(formula = distance ~ age.8 * Sex, data = dental.long, 
##     id = Subject)
## 
##  Coefficients:
##               Estimate  Std.err     Wald Pr(>|W|)    
## (Intercept)   21.20909  0.56043 1432.185  < 2e-16 ***
## age.8          0.47955  0.06313   57.697 3.05e-14 ***
## SexMale        1.40653  0.77380    3.304   0.0691 .  
## age.8:SexMale  0.30483  0.11687    6.803   0.0091 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation structure = independence 
## Estimated Scale Parameters:
## 
##             Estimate Std.err
## (Intercept)    4.905   1.015
## Number of clusters:   27  Maximum cluster size: 4
## Get predicted values for plot 
dental.long$fitted.linear <- mod1.gee$fitted[,1]

## Plot fitted and observed values 
ggplot(dental.long) + 
  geom_point(aes(x=age, y=distance)) + 
  geom_line(aes(x=age, y=fitted.linear), col = "blue") + 
  facet_wrap(vars(Sex), nrow=1)

Let’s take a quick look at these results. Here are the geeglm results:

tidy(mod1.gee, conf.int = T)
## # A tibble: 4 x 7
##   term          estimate std.error statistic  p.value conf.low conf.high
##   <chr>            <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
## 1 (Intercept)     21.2      0.560    1432.   0         20.1       22.3  
## 2 age.8            0.480    0.0631     57.7  3.05e-14   0.356      0.603
## 3 SexMale          1.41     0.774       3.30 6.91e- 2  -0.110      2.92 
## 4 age.8:SexMale    0.305    0.117       6.80 9.10e- 3   0.0758     0.534

Various Working Correlations

Here is code for fitting the model using different working correlation structures along with a comparison to using lm(). As a reminder with lm(), the point estimates will match those from using GEE with working independence, but the standard errors are based off assuming all observations are independent, which we would not expect to be valid in the setting of a longitudinal study (we expect correlation between measurements taken on the same individual):

m_ind <- geeglm(distance ~ age.8*Sex, id = Subject,
                data = dental.long, corstr = "independence")
m_exc <- geeglm(distance ~ age.8*Sex, id = Subject,
                data = dental.long, corstr = "exchangeable")
m_ar1 <- geeglm(distance ~ age.8*Sex, id = Subject,
                data = dental.long, corstr = "ar1")
m_uns <- geeglm(distance ~ age.8*Sex, id = Subject,
                data = dental.long, corstr = "unstructured")

m_ols <- lm(distance ~ age.8*Sex, data=dental.long)

We have already seen the output for working independence. The results from working exchangeable:

summary(m_exc)
## 
## Call:
## geeglm(formula = distance ~ age.8 * Sex, data = dental.long, 
##     id = Subject, corstr = "exchangeable")
## 
##  Coefficients:
##               Estimate Std.err   Wald Pr(>|W|)    
## (Intercept)    21.2091  0.5604 1432.2  < 2e-16 ***
## age.8           0.4795  0.0631   57.7  3.1e-14 ***
## SexMale         1.4065  0.7738    3.3   0.0691 .  
## age.8:SexMale   0.3048  0.1169    6.8   0.0091 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation structure = exchangeable 
## Estimated Scale Parameters:
## 
##             Estimate Std.err
## (Intercept)     4.91    1.01
##   Link = identity 
## 
## Estimated Correlation Parameters:
##       Estimate Std.err
## alpha    0.618   0.131
## Number of clusters:   27  Maximum cluster size: 4

The estimated working correlation matrix is:

\[\mathbf{R}_i(\hat{\alpha}) = \begin{bmatrix} 1 & & & \\ 0.62 & 1 & & \\ 0.62 & 0.62 & 1 & \\ 0.62 & 0.62 & 0.62 & 1 \end{bmatrix}\]

The results from working AR1:

summary(m_ar1)
## 
## Call:
## geeglm(formula = distance ~ age.8 * Sex, data = dental.long, 
##     id = Subject, corstr = "ar1")
## 
##  Coefficients:
##               Estimate Std.err    Wald Pr(>|W|)    
## (Intercept)    21.1876  0.5946 1269.84  < 2e-16 ***
## age.8           0.4844  0.0631   58.98  1.6e-14 ***
## SexMale         1.6038  0.8280    3.75    0.053 .  
## age.8:SexMale   0.2828  0.1238    5.22    0.022 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation structure = ar1 
## Estimated Scale Parameters:
## 
##             Estimate Std.err
## (Intercept)     4.92    1.01
##   Link = identity 
## 
## Estimated Correlation Parameters:
##       Estimate Std.err
## alpha    0.759  0.0959
## Number of clusters:   27  Maximum cluster size: 4

The estimated working correlation matrix is:

\[\mathbf{R}_i(\hat{\alpha}) = \begin{bmatrix} 1 & & & \\ 0.76 & 1 & & \\ 0.76^2 & 0.76 & 1 & \\ 0.76^3 & 0.76^2 & 0.76 & 1 \end{bmatrix}\]

The results from working unstructured:

summary(m_uns)
## 
## Call:
## geeglm(formula = distance ~ age.8 * Sex, data = dental.long, 
##     id = Subject, corstr = "unstructured")
## 
##  Coefficients:
##               Estimate Std.err    Wald Pr(>|W|)    
## (Intercept)    21.2220  0.5532 1471.73  < 2e-16 ***
## age.8           0.4781  0.0639   56.02  7.2e-14 ***
## SexMale         1.4065  0.7622    3.41   0.0650 .  
## age.8:SexMale   0.3100  0.1172    7.00   0.0082 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation structure = unstructured 
## Estimated Scale Parameters:
## 
##             Estimate Std.err
## (Intercept)     4.91    1.01
##   Link = identity 
## 
## Estimated Correlation Parameters:
##           Estimate Std.err
## alpha.1:2    0.501  0.1332
## alpha.1:3    0.736  0.1381
## alpha.1:4    0.515  0.1920
## alpha.2:3    0.555  0.2255
## alpha.2:4    0.621  0.0905
## alpha.3:4    0.779  0.1630
## Number of clusters:   27  Maximum cluster size: 4

The estimated working correlation matrix is:

\[\mathbf{R}_i(\hat{\alpha}) = \begin{bmatrix} 1 & & & \\ 0.50 & 1 & & \\ 0.74 & 0.56 & 1 & \\ 0.52 & 0.62 & 0.78 & 1 \end{bmatrix}\]

Here is code for obtaining 95% confidence intervals for the regression parameters:

tidy(m_ind, conf.int = T)
## # A tibble: 4 x 7
##   term          estimate std.error statistic  p.value conf.low conf.high
##   <chr>            <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
## 1 (Intercept)     21.2      0.560    1432.   0         20.1       22.3  
## 2 age.8            0.480    0.0631     57.7  3.05e-14   0.356      0.603
## 3 SexMale          1.41     0.774       3.30 6.91e- 2  -0.110      2.92 
## 4 age.8:SexMale    0.305    0.117       6.80 9.10e- 3   0.0758     0.534

If we want to obtain the 95% confidence interval for the slope for male children, \(\beta_1 + \beta_3\), we can use the glht() function from the multcomp package:

male_slope <- glht(m_ind, "age.8 + age.8:SexMale = 0")
summary(male_slope)
## 
##   Simultaneous Tests for General Linear Hypotheses
## 
## Fit: geeglm(formula = distance ~ age.8 * Sex, data = dental.long, 
##     id = Subject, corstr = "independence")
## 
## Linear Hypotheses:
##                            Estimate Std. Error z value Pr(>|z|)    
## age.8 + age.8:SexMale == 0   0.7844     0.0983    7.98  1.6e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)
confint(male_slope)
## 
##   Simultaneous Confidence Intervals
## 
## Fit: geeglm(formula = distance ~ age.8 * Sex, data = dental.long, 
##     id = Subject, corstr = "independence")
## 
## Quantile = 1.96
## 95% family-wise confidence level
##  
## 
## Linear Hypotheses:
##                            Estimate lwr   upr  
## age.8 + age.8:SexMale == 0 0.784    0.592 0.977

Response Profile

Another way to model the mean response is to simply estimate the conditional mean separately at each time point for each group, called a response profile model. Notationally, that model may be written: \[\text{E}[\text{Distance}_{ij} | \text{Male}_i, \text{Age}_{ij}] = \beta_0 + \beta_1 \times I_{\text{Age}_{ij}=10} + \beta_2 \times I_{\text{Age}_{ij}=12} + \beta_3 I_{\text{Age}_{ij}=14} \] \[+ \beta_4 \times \text{Male}_i + \beta_5\times I_{\text{Age}_{ij}=10} \times \text{Male}_i + \beta_6 \times I_{\text{Age}_{ij}=12} \times \text{Male}_i + \beta_7 \times I_{\text{Age}_{ij}=14} \times \text{Male}_i\]

Functionally, the mean at each sex and age can be written as a sum of \(\beta\) coefficients. For example, for females age 8, the mean distance would be: \[E[\text{Distance}_{ij} | \text{Male}_i = 0, \text{Age}_{ij}=8 ] = \beta_0\] and for females age 14: \[E[\text{Distance}_{ij} |\text{Male}_i = 0, \text{Age}_{ij}=14] = \beta_0 + \beta_3\] and for males age 8: \[E[\text{Distance}_{ij} | \text{Male}_i = 1, \text{Age}_{ij}=8] = \beta_0 + \beta_4\] and for males age 14: \[E[\text{Distance}_{ij} | \text{Male}_i = 1, \text{Age}_{ij}=14] = \beta_0 + \beta_3 + \beta_4 + \beta_7\]

mod2.gee <- geeglm(distance ~ factor(age) * Sex, 
                   id = Subject, data = dental.long)
summary(mod2.gee)
## 
## Call:
## geeglm(formula = distance ~ factor(age) * Sex, data = dental.long, 
##     id = Subject)
## 
##  Coefficients:
##                       Estimate Std.err    Wald Pr(>|W|)    
## (Intercept)             21.182   0.611 1202.78  < 2e-16 ***
## factor(age)10            1.045   0.343    9.30   0.0023 ** 
## factor(age)12            1.909   0.345   30.61  3.2e-08 ***
## factor(age)14            2.909   0.379   58.82  1.7e-14 ***
## SexMale                  1.693   0.852    3.95   0.0468 *  
## factor(age)10:SexMale   -0.108   0.685    0.02   0.8747    
## factor(age)12:SexMale    0.935   0.677    1.91   0.1674    
## factor(age)14:SexMale    1.685   0.750    5.05   0.0247 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation structure = independence 
## Estimated Scale Parameters:
## 
##             Estimate Std.err
## (Intercept)     4.87    1.01
## Number of clusters:   27  Maximum cluster size: 4

Visually, we can compare the estimates for the mean response by age and sex from the response profile (i.e. categorical time; red squares) and the linear time models (blue line):

## Get predicted values for plot 
dental.long$fitted.cat <- mod2.gee$fitted[,1]

## Plot fitted and observed values 
ggplot(dental.long) + 
  geom_point(aes(x=age, y=distance), shape = 1) + 
  geom_line(aes(x=age, y=fitted.linear), col = "blue") + 
  geom_point(aes(x=age, y=fitted.cat), col = "red", size = 2, shape=15) + 
  facet_wrap(vars(Sex), nrow=1)

From the figure there doesn’t appear to be much difference between the estimates. It’s also possible to construct a hypothesis test to test whether there is evidence of a departure from a linear trend.

Linear Mixed Models

We can fit corresponding mean models with LMM; for brevity, we won’t fit all of them again, but will consider testing hypotheses about fixed effects and about random effects.

Random Intercepts Model

Let’s start with a linear mean model with an interaction between age and sex, and random intercepts only. Notationally, this is:

\[\text{Dist}_{ij} = \beta_0 + \beta_1 \times (\text{Age}_{ij}-8) + \beta_2 \times \text{Male}_{i} + \beta_3 \times (\text{Age}_{ij} - 8) \times \text{Male}_{i} + b_i\]

So the sex- and subject-specific models are \[E[\text{Distance}_{ij} | \text{Age}_{ij}, \text{Male}_i = 0] = (\beta_0 + b_i) + \beta_1 \times (\text{Age}_{ij}-8)\] if child \(i\) is a girl, and \[E[\text{Distance}_{ij} | \text{Age}_{ij}, \text{Male}_i = 1] = (\beta_0 + \beta_2 + b_i) + (\beta_1 + \beta_3) \times (\text{Age}_{ij}-8)\] if child \(i\) is a boy.

## Linear mean model 
# random intercept 
mod1.ri <- lme(fixed = distance ~ age.8 * Sex, 
               random = ~ 1 | Subject, 
               data = dental.long)
summary(mod1.ri)
## Linear mixed-effects model fit by REML
##  Data: dental.long 
##   AIC BIC logLik
##   446 462   -217
## 
## Random effects:
##  Formula: ~1 | Subject
##         (Intercept) Residual
## StdDev:        1.82     1.39
## 
## Fixed effects: distance ~ age.8 * Sex 
##               Value Std.Error DF t-value p-value
## (Intercept)   21.21     0.650 79    32.6  0.0000
## age.8          0.48     0.093 79     5.1  0.0000
## SexMale        1.41     0.844 25     1.7  0.1081
## age.8:SexMale  0.30     0.121 79     2.5  0.0141
##  Correlation: 
##               (Intr) age.8  SexMal
## age.8         -0.432              
## SexMale       -0.770  0.332       
## age.8:SexMale  0.332 -0.770 -0.432
## 
## Standardized Within-Group Residuals:
##     Min      Q1     Med      Q3     Max 
## -3.5980 -0.4546  0.0158  0.5024  3.6862 
## 
## Number of Observations: 108
## Number of Groups: 27
intervals(mod1.ri)
## Approximate 95% confidence intervals
## 
##  Fixed effects:
##                 lower   est.  upper
## (Intercept)   19.9158 21.209 22.502
## age.8          0.2935  0.480  0.666
## SexMale       -0.3318  1.407  3.145
## age.8:SexMale  0.0631  0.305  0.547
## attr(,"label")
## [1] "Fixed effects:"
## 
##  Random Effects:
##   Level: Subject 
##                 lower est. upper
## sd((Intercept))  1.32 1.82   2.5
## 
##  Within-group standard error:
## lower  est. upper 
##  1.19  1.39  1.62

The systematic part of the model is:

\[\widehat{\text{Dist}}_{ij} = 21.21 + 0.48 \times (\text{Age}_{ij} - 8) + 1.41 \times \text{Male}_i + 0.30 \times (\text{Age}_{ij} - 8) \times \text{Male}_i\]

As noted in the lecture, these coefficients may be interpreted marginally or conditionally, since marginal and conditional associations match in linear models. So, both of the following work:

We also see that the estimated variance of the random intercepts is \[\widehat{\text{Var}}(\epsilon) = \hat{\sigma}^2 = 1.39^2 = 1.93\] and the estimated variance of the errors \(\epsilon\) is \[\widehat{\text{Var}}(b_i) = \hat{\tau}^2 = 1.82^2 = 3.31,\] leading to an estimate of \[\widehat{\text{ICC}} = \frac{\hat{\tau}^2}{\hat{\sigma}^2 + \hat{\tau}^2} = \frac{3.31}{1.93+3.31} = 0.632\] for the intra-class correlation coefficient. Remember that we expected this to match the estimated correlation under the exchangeable model from the GEE section – and, up to rounding error, it does!

We can extract different fitted values and residuals from an LMM, depending on the level of grouping of interest. We’ll focus on fitted values for this document; you can explore residual analysis further using the textbook resources suggested in the lectures. “Level 0” refers to the fitted values or residuals using the fixed effects only. The matrix of results below makes this obvious: the fitted values are the same regardless of subject.

# Level 0 fitted values 
fixed.pred <- fitted(mod1.ri, level=0)
head(fixed.pred)
##  M16  M16  M16  M16  M05  M05 
## 22.6 24.2 25.8 27.3 22.6 24.2
head(matrix(fixed.pred, ncol=4, byrow=T))
##      [,1] [,2] [,3] [,4]
## [1,] 22.6 24.2 25.8 27.3
## [2,] 22.6 24.2 25.8 27.3
## [3,] 22.6 24.2 25.8 27.3
## [4,] 22.6 24.2 25.8 27.3
## [5,] 22.6 24.2 25.8 27.3
## [6,] 22.6 24.2 25.8 27.3

In contrast, the “level 1” fitted values incorporate the random effects, so they will differ between subjects (by a constant, up to occasional rounding differences, since this is a random intercept model). Notice, for example, that the predictions in row 6 are always 0.1 greater than the predictions in row 5.

# Level 1 fitted values 
randint.pred <- fitted(mod1.ri, level=1)
head(randint.pred)
##  M16  M16  M16  M16  M05  M05 
## 20.9 22.5 24.0 25.6 20.9 22.5
head(matrix(randint.pred, ncol=4, byrow=T))
##      [,1] [,2] [,3] [,4]
## [1,] 20.9 22.5 24.0 25.6
## [2,] 20.9 22.5 24.0 25.6
## [3,] 21.2 22.8 24.4 25.9
## [4,] 21.4 23.0 24.6 26.1
## [5,] 21.6 23.1 24.7 26.3
## [6,] 21.7 23.2 24.8 26.4

We can see the level 1 predictions in these plots:

dental.long$fitted.lme <- fitted(mod1.ri)

# All subject-specific trajectories 
dental.long %>% 
  ggplot() + 
  geom_line(aes(x=age, y=distance, col=Sex)) + 
  geom_point(aes(x=age, y=distance, col=Sex)) + 
  geom_line(aes(x=age, y=fitted.lme), col="black", lty = 2, size = 1) + 
  xlab("Age (Years)") + ylab("Distance (mm)") + 
  theme_classic(base_size = 14) + 
  facet_wrap(vars(Subject), nrow=4) + 
  scale_color_manual(values = pal2use2) + 
  theme(legend.position = "right")

# Zooming in on two subjects 
dental.long %>% 
  filter(Subject %in% c("M10", "M11")) %>% 
  ggplot() + 
  geom_line(aes(x=age, y=distance, col=Subject)) + 
  geom_point(aes(x=age, y=distance, col=Subject)) + 
  geom_line(aes(x=age, y=fitted.lme, col=Subject), lty = 2) + 
  xlab("Age (Years)") + ylab("Distance (mm)") + 
  ggtitle("Comparison Between Two Male Subjects") + 
  theme_classic(base_size = 14) + 
  theme(legend.position = "right") + 
  scale_color_manual(values = pal2use2)

Random Intercepts and Slopes Model

We could also allow slopes to vary by subject. Notationally, this model is \[E[\text{Distance}_{ij} | \text{Age}_{ij}, \text{Male}_i, b_i] = \beta_0 + \beta_1 \times (\text{Age}_{ij} - 8) + \beta_2 \times \text{Male}_{i} + \beta_3 \times (\text{Age}_{ij} - 8) \times \text{Male}_{i} + b_{i0} + b_{i1} \times (\text{Age}_{ij} - 8)\] Now the sex- and subject-specific models are \[E[\text{Distance}_{ij} | \text{Age}_{ij}, \text{Male}_i = 0, b_i] = (\beta_0 + b_{i0}) + (\beta_1 + b_{i1}) \times (\text{Age}_{ij} - 8)\] if child \(i\) is a girl, and
\[E[\text{Distance}_{ij} | \text{Age}_{ij}, \text{Male}_i = 1, b_i] = (\beta_0 + \beta_2 + b_{i0}) + ( \beta_1 + \beta_3 + b_{i1}) \times (\text{Age}_{ij} - 8)\] if child \(i\) is a boy.

# random intercept, random slope 
mod1.rs <- lme(fixed = distance ~ age.8 * Sex, 
               random = ~ 1 + age | Subject, 
               data = dental.long)
summary(mod1.rs)
## Linear mixed-effects model fit by REML
##  Data: dental.long 
##   AIC BIC logLik
##   449 470   -216
## 
## Random effects:
##  Formula: ~1 + age | Subject
##  Structure: General positive-definite, Log-Cholesky parametrization
##             StdDev Corr  
## (Intercept) 2.41   (Intr)
## age         0.18   -0.668
## Residual    1.31         
## 
## Fixed effects: distance ~ age.8 * Sex 
##               Value Std.Error DF t-value p-value
## (Intercept)   21.21     0.635 79    33.4  0.0000
## age.8          0.48     0.104 79     4.6  0.0000
## SexMale        1.41     0.825 25     1.7  0.1006
## age.8:SexMale  0.30     0.135 79     2.3  0.0264
##  Correlation: 
##               (Intr) age.8  SexMal
## age.8         -0.396              
## SexMale       -0.770  0.305       
## age.8:SexMale  0.305 -0.770 -0.396
## 
## Standardized Within-Group Residuals:
##     Min      Q1     Med      Q3     Max 
## -3.1681 -0.3859  0.0071  0.4452  3.8495 
## 
## Number of Observations: 108
## Number of Groups: 27

Again, we can extract level 0 fitted values (these are the same as above):

# Level 0 fitted values 
fixed.pred.2 <- fitted(mod1.rs, level=0)
head(matrix(fixed.pred.2, ncol=4, byrow=T))
##      [,1] [,2] [,3] [,4]
## [1,] 22.6 24.2 25.8 27.3
## [2,] 22.6 24.2 25.8 27.3
## [3,] 22.6 24.2 25.8 27.3
## [4,] 22.6 24.2 25.8 27.3
## [5,] 22.6 24.2 25.8 27.3
## [6,] 22.6 24.2 25.8 27.3

or the level 1 fitted values, which no longer differ by a constant between subjects:

# Level 1 fitted values 
randsl.pred <- fitted(mod1.rs, level=1)
head(randsl.pred)
##  M16  M16  M16  M16  M05  M05 
## 21.1 22.5 23.9 25.3 20.9 22.5
head(matrix(randsl.pred, ncol=4, byrow=T))
##      [,1] [,2] [,3] [,4]
## [1,] 21.1 22.5 23.9 25.3
## [2,] 20.9 22.5 24.0 25.6
## [3,] 21.3 22.8 24.3 25.8
## [4,] 21.8 23.1 24.4 25.7
## [5,] 21.6 23.1 24.7 26.2
## [6,] 22.0 23.3 24.6 26.0

Plotting the two models (random intercept, random intercept and slope), we observe results that are similar but not identical.

dental.long$fitted.lme.rs <- fitted(mod1.rs)
dental.long %>% 
  ggplot() + 
  geom_line(aes(x=age, y=distance, col=Sex)) + 
  geom_point(aes(x=age, y=distance, col=Sex)) + 
  geom_line(aes(x=age, y=fitted.lme.rs), col="black", lty = 2) + 
  xlab("Age (Years)") + ylab("Distance (mm)") + 
  theme_classic(base_size = 14) + 
  facet_wrap(vars(Subject), nrow=4) + 
  theme(legend.position = "right") + 
  scale_color_manual(values=pal2use2)

The black lines in the zoomed-in plot correspond to the random intercepts and slopes model. For subject M10, the random slope is slightly positive (0.05) and the line is slightly steeper; for subject M11, the random slope is slightly negative (-0.14) and the line is less steep.

coef(mod1.rs)[c("M10", "M11"), ]
##     (Intercept) age.8 SexMale age.8:SexMale     age
## M10        24.7  0.48    1.41         0.305  0.0507
## M11        21.5  0.48    1.41         0.305 -0.1405
dental.long %>% 
  filter(Subject %in% c("M10", "M11")) %>% 
  ggplot() + 
  geom_line(aes(x=age, y=distance, col=Subject)) + 
  geom_point(aes(x=age, y=distance, col=Subject)) + 
  geom_line(aes(x=age, y=fitted.lme, col=Subject), lty = 3) + 
  geom_line(aes(x=age, y=fitted.lme.rs, group=Subject), col="black", size=1.25, lty = 2) + 
  xlab("Age (Years)") + ylab("Distance (mm)") + 
  ggtitle("Comparison Between Two Male Subjects") + 
  theme_classic(base_size = 14) + 
  theme(legend.position = "right")

To use a likelihood ratio test to formally test whether the model with random slopes is significantly better than the model with only random intercepts (i.e., \(D_{12} = D_{22} = 0\) in the covariance matrix of the random effects), we could do the following.

# Is the correlation structure better with random slopes? 
# Doesn't look like it (p=0.56, AIC and BIC are slightly lower for RI)
anova(mod1.rs, mod1.ri)
##         Model df AIC BIC logLik   Test L.Ratio p-value
## mod1.rs     1  8 449 470   -216                       
## mod1.ri     2  6 446 462   -217 1 vs 2    1.18   0.556

The slightly lower AIC and BIC for the random intercept model, along with the nonsignificant p-value, suggest that the random intercept model is sufficient (an exchangeable correlation structure is reasonable).

Robust SE

We noticed earlier that the correlation is much higher for girls than boys, contrary to our assumed correlation structure which is equal between sexes. We could use robust standard errors to avoid any potential problems due to misspecification such as this.

Here are the original inference results, which we’ve seen already:

## Original CIs 
orig.tab <- cbind(summary(mod1.ri)$tTable, intervals(mod1.ri)$fixed)
round(orig.tab[, c("Value", "p-value", "lower", "upper")], 3)
##                Value p-value  lower  upper
## (Intercept)   21.209   0.000 19.916 22.502
## age.8          0.480   0.000  0.293  0.666
## SexMale        1.407   0.108 -0.332  3.145
## age.8:SexMale  0.305   0.014  0.063  0.547

Using robust SEs, we find the following:

# clubSandwich 
## Robust standard errors (random intercept model)
sig <- vcovCR(mod1.ri, type = "CR2")
summ <- coef_test(mod1.ri, sig)
cis <- conf_int(mod1.ri, sig)
robtab <- merge(summ[, c("Coef", "beta", "p_Satt")], 
      cis[, c("Coef", "CI_L", "CI_U")])
rownames(robtab) <- robtab$Coef 
robtab$Coef <- NULL 
round(robtab[c(1,2,4,3), ], 3)
##                 beta p_Satt   CI_L   CI_U
## (Intercept)   21.209  0.000 19.899 22.519
## age.8          0.480  0.000  0.332  0.627
## SexMale        1.407  0.095 -0.266  3.079
## age.8:SexMale  0.305  0.020  0.053  0.557

Notice that the estimates are the same; the confidence intervals may be narrower (here, age and sex) or wider (here, the interaction term), and p-values may be correspondingly lower or higher, with robust SEs compared to model-based SEs.

Testing Fixed Effects

Combinations of fixed effects that may be expressed as nested models are possible to test using likelihood ratio tests. If we wanted to formally test whether, e.g., a response profile model fits significantly better than a linear mean model, we’d express them as nested models:

# use maximum likelihood estimation to compare fixed effects 
# linear model 
modLIN.ri <- lme(fixed = distance ~ age * Sex, 
                 method = "ML",
                 random = ~ 1 | Subject,
                 data = dental.long)

# response profile model 
# note we add only two of the four ages because any two points can be fit with a line
modCAT.ri <- lme(fixed = distance ~ age * Sex + I(age == 10) * Sex + I(age == 12) * Sex, 
                 method = "ML",
                 random = ~ 1 | Subject,
                 data = dental.long)

# LRT 
anova(modLIN.ri, modCAT.ri)
##           Model df AIC BIC logLik   Test L.Ratio p-value
## modLIN.ri     1  6 441 457   -214                       
## modCAT.ri     2 10 447 473   -213 1 vs 2    2.01   0.735

We find that the response profile model is not significantly better (unsurprising given the plots we saw in the GEE section).