Goal of This Case Study

Our primary objective with this case study is to demonstrate different mean models (i.e., forms for the systematic component of a GEE or LMM).

Packages

library(tidyverse)
library(lattice)
library(geepack)

The Dataset

This dataset comes from the Six Cities Study of Air Pollution and Health. The data are from a cohort of \(N=300\) school-age female children living in Topeka, Kansas. Most children enrolled in the first or second grade (between the ages of six and seven). They were then measured annually until high school graduation or loss to follow-up.

dat <- read.csv("../Datasets/Topeka.csv", row.names=1)
head(dat)
##   id height     age height0   age0 logFEV1
## 1  1   1.20  9.3415     1.2 9.3415 0.21511
## 2  1   1.28 10.3929     1.2 9.3415 0.37156
## 3  1   1.33 11.4524     1.2 9.3415 0.48858
## 4  1   1.42 12.4600     1.2 9.3415 0.75142
## 5  1   1.48 13.4182     1.2 9.3415 0.83291
## 6  1   1.50 15.4743     1.2 9.3415 0.89200
# age 0 is the age of entry
dat <- dat %>% 
  mutate(time_from_entry = age - age0)

Exploratory Analysis

Histogram of age at entry:

dat %>% 
  filter(age==age0) %>%
  ggplot(aes(x=age0)) +
  geom_histogram(fill="white", color="black", binwidth = 0.5) +
  labs(title="",
       x="Age at entry",
       y = "Count")+
  theme_classic(base_size = 15)

From this figure, we see that most children enrolled between ages of seven and nine.

Scatterplot of log(FEV1) (y-axis) vs age (x-axis):

dat %>% 
  ggplot(aes(x=age, y=logFEV1)) +
  geom_point()+
  labs(title="",
       x="Age",
       y = "Log(FEV1)")+
  theme_classic(base_size = 15)

We observe that log(FEV1) is an increasing function of age, but the rate of increase seems to decrease after ages around 14. The variability in log(FEV1) is close to constant across the observed ages.

Scatterplot of log(FEV1) (y-axis) vs time from entry (x-axis):

dat %>% 
  ggplot(aes(x=time_from_entry, y=logFEV1)) +
  geom_point()+
  labs(title="",
       x="Time from entry",
       y = "Log(FEV1)")+
  theme_classic(base_size = 15)

Note: Although children were measured roughly every year, when we use age rather than chronological time (i.e., time from entry), the data are highly imbalanced. We will use age as our time scale.

Modeling the Mean

Step Function

Notation:

  • \(\text{logFEV1}_{ij}\): log FEV1 for subject \(i\) at measurement occasion \(j\)
  • \(\text{age}_{ij}\): age of subject \(i\) at measurement occasion \(j\)
  • \(\text{AGE}_{ij}(k)\): dummy (indicator) variable for age category \(k\) (e.g., \(\text{AGE}_{ij}(10-12) = 1\) if \(\text{age}_{ij}\) is between 10 and 12)

Mean Model:

\[E[\text{logFEV1}_{ij}|\text{age}_{ij}] = \beta_0 + \beta_1 \text{AGE}_{ij}(8-10) + \beta_2 \text{AGE}_{ij}(10-12) + \cdots + \beta_{5} \text{AGE}_{ij}(16+)\]

In this model, the rate of change is zero (flat) within intervals and then jumps to the next interval.

# Create indicators for each age category 
dat$age_cat <- cut(dat$age, c(0, 8, 10, 12, 14, 16, Inf))

# Fit the step function model 
mod1 <- geeglm(logFEV1 ~ age_cat, id = id, data = dat)
summary(mod1)
## 
## Call:
## geeglm(formula = logFEV1 ~ age_cat, data = dat, id = id)
## 
##  Coefficients:
##                 Estimate Std.err   Wald Pr(>|W|)    
## (Intercept)      0.30050 0.01158  673.7   <2e-16 ***
## age_cat(8,10]    0.17924 0.01218  216.6   <2e-16 ***
## age_cat(10,12]   0.39024 0.01342  846.0   <2e-16 ***
## age_cat(12,14]   0.63170 0.01330 2256.1   <2e-16 ***
## age_cat(14,16]   0.78744 0.01272 3833.1   <2e-16 ***
## age_cat(16,Inf]  0.83472 0.01309 4066.1   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation structure = independence 
## Estimated Scale Parameters:
## 
##             Estimate  Std.err
## (Intercept)  0.02451 0.001989
## Number of clusters:   300  Maximum cluster size: 12
# Add fitted values to dataset and then plot (with nice colors) 
dat$fitted1 <- mod1$fitted

pal2use <- c("#E69F00", "#0072B2")
xyplot(logFEV1 ~ age , data=dat, 
       panel=function(x,y){
         panel.xyplot(x, y, pch=16, col=pal2use[2])
         panel.xyplot(dat$age, dat$fitted1, pch=16, col=pal2use[1])
       })

Linear Trend

Notation:

  • \(\text{logFEV1}_{ij}\): log FEV1 for subject \(i\) at measurement occasion \(j\)
  • \(\text{age}_{ij}\): age of subject \(i\) at measurement occasion \(j\)

Mean Model:

\[E[\text{logFEV1}_{ij} | \text{age}_{ij}] = \beta_0 + \beta_1 \text{age}_{ij}\]

In this model, the rate of change is constant (\(\beta_1\))

mod2 <- geeglm(logFEV1 ~ age, id = id, data = dat)
summary(mod2)
## 
## Call:
## geeglm(formula = logFEV1 ~ age, data = dat, id = id)
## 
##  Coefficients:
##             Estimate  Std.err Wald Pr(>|W|)    
## (Intercept) -0.27415  0.01579  301   <2e-16 ***
## age          0.08669  0.00113 5918   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation structure = independence 
## Estimated Scale Parameters:
## 
##             Estimate Std.err
## (Intercept)   0.0263 0.00185
## Number of clusters:   300  Maximum cluster size: 12
dat$fitted2 <- mod2$fitted
xyplot(logFEV1 ~ age , data=dat, 
       panel=function(x,y){
         panel.xyplot(x, y, pch=16, col=pal2use[2])
         panel.xyplot(dat$age, dat$fitted2, pch=16, col=pal2use[1])
       })

Quadratic Trend

Notation:

  • \(\text{logFEV1}_{ij}\): log FEV1 for subject \(i\) at measurement occasion \(j\)
  • \(\text{age}_{ij}\): age of subject \(i\) at measurement occasion \(j\)

Mean Model:

\[E[\text{logFEV1}_{ij} | \text{age}_{ij}] = \beta_0 + \beta_1 \text{age}_{ij} + \beta_2 \text{age}_{ij}^2\]

In this model, the rate of change is non-constant (\(\beta_1 + 2 \beta_2 \text{age}_{ij}\))

# I() notation allows you to include mathematical operators in a formula 
# Alternatively, you could create an age.2 = age^2 variable in the dataset 
mod3 <- geeglm(logFEV1 ~ age + I(age^2), id = id, data = dat)
summary(mod3)
## 
## Call:
## geeglm(formula = logFEV1 ~ age + I(age^2), data = dat, id = id)
## 
##  Coefficients:
##              Estimate   Std.err Wald Pr(>|W|)    
## (Intercept) -1.172120  0.051319  522   <2e-16 ***
## age          0.240005  0.008977  715   <2e-16 ***
## I(age^2)    -0.006088  0.000356  292   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation structure = independence 
## Estimated Scale Parameters:
## 
##             Estimate Std.err
## (Intercept)   0.0225 0.00189
## Number of clusters:   300  Maximum cluster size: 12
dat$fitted3 <- mod3$fitted
xyplot(logFEV1 ~ age , data=dat, 
       panel=function(x,y){
         panel.xyplot(x, y, pch=16, col=pal2use[2])
         panel.xyplot(dat$age, dat$fitted3, pch=16, col=pal2use[1])
       })

Linear Splines

Notation:

  • \(\text{logFEV1}_{ij}\): log FEV1 for subject \(i\) at measurement occasion \(j\)
  • \(\text{age}_{ij}\): age of subject \(i\) at measurement occasion \(j\)
  • \((\text{age}_{ij}-\text{age}_{k})_+ = \text{max}(\text{age}_{ij}-\text{age}_k,0)\): linear spline based on knot \(\text{age}_k\)

Mean Model:

\[E[\text{logFEV1}_{ij} | \text{age}_{ij}] = \beta_0 + \beta_1 \text{age}_{ij} + \beta_2 (\text{age}_{ij}-10)_+ + \beta_3 (\text{age}_{ij} - 14)_+\]

In this model, the rate of change is:

  • \(\beta_1\): for \(\text{age}_{ij} < 10\)
  • \(\beta_1 + \beta_2\): for \(10 \leq \text{age}_{ij} < 14\)
  • \(\beta_1 + \beta_2 + \beta_3\): for \(\text{age}_{ij} \geq 14\)
# Create variables for "years since age X" (0 if age < age X)
dat$ageSpline10 <- pmax(dat$age - 10, 0)
dat$ageSpline14 <- pmax(dat$age - 14, 0)

# Model fitting and results 
mod4 <- geeglm(logFEV1 ~ age + ageSpline10 + ageSpline14, 
               id = id, data = dat)
summary(mod4)
## 
## Call:
## geeglm(formula = logFEV1 ~ age + ageSpline10 + ageSpline14, data = dat, 
##     id = id)
## 
##  Coefficients:
##             Estimate  Std.err   Wald Pr(>|W|)    
## (Intercept) -0.47374  0.03975 142.04   <2e-16 ***
## age          0.10530  0.00461 522.30   <2e-16 ***
## ageSpline10  0.01365  0.00627   4.75    0.029 *  
## ageSpline14 -0.09315  0.00492 358.25   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation structure = independence 
## Estimated Scale Parameters:
## 
##             Estimate Std.err
## (Intercept)   0.0219 0.00189
## Number of clusters:   300  Maximum cluster size: 12
# Add fitted values to dataset and plot 
dat$fitted4 <- mod4$fitted
xyplot(logFEV1 ~ age , data=dat, 
       panel=function(x,y){
         panel.xyplot(x, y, pch=16, col=pal2use[2])
         panel.xyplot(dat$age, dat$fitted4, pch=16, col=pal2use[1])
       })

Summary

Explored various mean models as regression with time, and perhaps additional functions of time

  • Polynomial models:
    • Quadratic: \(\beta_1 t_{ij} + \beta_2 t_{ij}^2\)
    • Others: higher order polynomials
    • Allow for non-linear mean curve
    • Allow for non-constant rate of change
  • Regression splines:
    • Linear splines: \(\beta_1 t_{ij} + \beta_2 (t_{ij} - 14)_+\)
    • Others: cubic splines
    • Allow for non-linear mean curve
    • Allow for non-constant rate of change

Note that, while we used these mean models in the context of GEE, they could be used for LMM just as well (and random effects could allow subject-specific intercepts, slopes, quadratic trends, spline slopes, etc., if they are called for given your scientific question and if your dataset is large enough to support estimation).