Our primary objective with this case study is to demonstrate different mean models (i.e., forms for the systematic component of a GEE or LMM).
library(tidyverse)
library(lattice)
library(geepack)
This dataset comes from the Six Cities Study of Air Pollution and Health. The data are from a cohort of \(N=300\) school-age female children living in Topeka, Kansas. Most children enrolled in the first or second grade (between the ages of six and seven). They were then measured annually until high school graduation or loss to follow-up.
dat <- read.csv("../Datasets/Topeka.csv", row.names=1)
head(dat)
## id height age height0 age0 logFEV1
## 1 1 1.20 9.3415 1.2 9.3415 0.21511
## 2 1 1.28 10.3929 1.2 9.3415 0.37156
## 3 1 1.33 11.4524 1.2 9.3415 0.48858
## 4 1 1.42 12.4600 1.2 9.3415 0.75142
## 5 1 1.48 13.4182 1.2 9.3415 0.83291
## 6 1 1.50 15.4743 1.2 9.3415 0.89200
# age 0 is the age of entry
dat <- dat %>%
mutate(time_from_entry = age - age0)
Histogram of age at entry:
dat %>%
filter(age==age0) %>%
ggplot(aes(x=age0)) +
geom_histogram(fill="white", color="black", binwidth = 0.5) +
labs(title="",
x="Age at entry",
y = "Count")+
theme_classic(base_size = 15)
From this figure, we see that most children enrolled between ages of seven and nine.
Scatterplot of log(FEV1) (y-axis) vs age (x-axis):
dat %>%
ggplot(aes(x=age, y=logFEV1)) +
geom_point()+
labs(title="",
x="Age",
y = "Log(FEV1)")+
theme_classic(base_size = 15)
We observe that log(FEV1) is an increasing function of age, but the rate of increase seems to decrease after ages around 14. The variability in log(FEV1) is close to constant across the observed ages.
Scatterplot of log(FEV1) (y-axis) vs time from entry (x-axis):
dat %>%
ggplot(aes(x=time_from_entry, y=logFEV1)) +
geom_point()+
labs(title="",
x="Time from entry",
y = "Log(FEV1)")+
theme_classic(base_size = 15)
Note: Although children were measured roughly every year, when we use age rather than chronological time (i.e., time from entry), the data are highly imbalanced. We will use age as our time scale.
Notation:
Mean Model:
\[E[\text{logFEV1}_{ij}|\text{age}_{ij}] = \beta_0 + \beta_1 \text{AGE}_{ij}(8-10) + \beta_2 \text{AGE}_{ij}(10-12) + \cdots + \beta_{5} \text{AGE}_{ij}(16+)\]
In this model, the rate of change is zero (flat) within intervals and then jumps to the next interval.
# Create indicators for each age category
dat$age_cat <- cut(dat$age, c(0, 8, 10, 12, 14, 16, Inf))
# Fit the step function model
mod1 <- geeglm(logFEV1 ~ age_cat, id = id, data = dat)
summary(mod1)
##
## Call:
## geeglm(formula = logFEV1 ~ age_cat, data = dat, id = id)
##
## Coefficients:
## Estimate Std.err Wald Pr(>|W|)
## (Intercept) 0.30050 0.01158 673.7 <2e-16 ***
## age_cat(8,10] 0.17924 0.01218 216.6 <2e-16 ***
## age_cat(10,12] 0.39024 0.01342 846.0 <2e-16 ***
## age_cat(12,14] 0.63170 0.01330 2256.1 <2e-16 ***
## age_cat(14,16] 0.78744 0.01272 3833.1 <2e-16 ***
## age_cat(16,Inf] 0.83472 0.01309 4066.1 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation structure = independence
## Estimated Scale Parameters:
##
## Estimate Std.err
## (Intercept) 0.02451 0.001989
## Number of clusters: 300 Maximum cluster size: 12
# Add fitted values to dataset and then plot (with nice colors)
dat$fitted1 <- mod1$fitted
pal2use <- c("#E69F00", "#0072B2")
xyplot(logFEV1 ~ age , data=dat,
panel=function(x,y){
panel.xyplot(x, y, pch=16, col=pal2use[2])
panel.xyplot(dat$age, dat$fitted1, pch=16, col=pal2use[1])
})
Notation:
Mean Model:
\[E[\text{logFEV1}_{ij} | \text{age}_{ij}] = \beta_0 + \beta_1 \text{age}_{ij}\]
In this model, the rate of change is constant (\(\beta_1\))
mod2 <- geeglm(logFEV1 ~ age, id = id, data = dat)
summary(mod2)
##
## Call:
## geeglm(formula = logFEV1 ~ age, data = dat, id = id)
##
## Coefficients:
## Estimate Std.err Wald Pr(>|W|)
## (Intercept) -0.27415 0.01579 301 <2e-16 ***
## age 0.08669 0.00113 5918 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation structure = independence
## Estimated Scale Parameters:
##
## Estimate Std.err
## (Intercept) 0.0263 0.00185
## Number of clusters: 300 Maximum cluster size: 12
dat$fitted2 <- mod2$fitted
xyplot(logFEV1 ~ age , data=dat,
panel=function(x,y){
panel.xyplot(x, y, pch=16, col=pal2use[2])
panel.xyplot(dat$age, dat$fitted2, pch=16, col=pal2use[1])
})
Notation:
Mean Model:
\[E[\text{logFEV1}_{ij} | \text{age}_{ij}] = \beta_0 + \beta_1 \text{age}_{ij} + \beta_2 \text{age}_{ij}^2\]
In this model, the rate of change is non-constant (\(\beta_1 + 2 \beta_2 \text{age}_{ij}\))
# I() notation allows you to include mathematical operators in a formula
# Alternatively, you could create an age.2 = age^2 variable in the dataset
mod3 <- geeglm(logFEV1 ~ age + I(age^2), id = id, data = dat)
summary(mod3)
##
## Call:
## geeglm(formula = logFEV1 ~ age + I(age^2), data = dat, id = id)
##
## Coefficients:
## Estimate Std.err Wald Pr(>|W|)
## (Intercept) -1.172120 0.051319 522 <2e-16 ***
## age 0.240005 0.008977 715 <2e-16 ***
## I(age^2) -0.006088 0.000356 292 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation structure = independence
## Estimated Scale Parameters:
##
## Estimate Std.err
## (Intercept) 0.0225 0.00189
## Number of clusters: 300 Maximum cluster size: 12
dat$fitted3 <- mod3$fitted
xyplot(logFEV1 ~ age , data=dat,
panel=function(x,y){
panel.xyplot(x, y, pch=16, col=pal2use[2])
panel.xyplot(dat$age, dat$fitted3, pch=16, col=pal2use[1])
})
Notation:
Mean Model:
\[E[\text{logFEV1}_{ij} | \text{age}_{ij}] = \beta_0 + \beta_1 \text{age}_{ij} + \beta_2 (\text{age}_{ij}-10)_+ + \beta_3 (\text{age}_{ij} - 14)_+\]
In this model, the rate of change is:
# Create variables for "years since age X" (0 if age < age X)
dat$ageSpline10 <- pmax(dat$age - 10, 0)
dat$ageSpline14 <- pmax(dat$age - 14, 0)
# Model fitting and results
mod4 <- geeglm(logFEV1 ~ age + ageSpline10 + ageSpline14,
id = id, data = dat)
summary(mod4)
##
## Call:
## geeglm(formula = logFEV1 ~ age + ageSpline10 + ageSpline14, data = dat,
## id = id)
##
## Coefficients:
## Estimate Std.err Wald Pr(>|W|)
## (Intercept) -0.47374 0.03975 142.04 <2e-16 ***
## age 0.10530 0.00461 522.30 <2e-16 ***
## ageSpline10 0.01365 0.00627 4.75 0.029 *
## ageSpline14 -0.09315 0.00492 358.25 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation structure = independence
## Estimated Scale Parameters:
##
## Estimate Std.err
## (Intercept) 0.0219 0.00189
## Number of clusters: 300 Maximum cluster size: 12
# Add fitted values to dataset and plot
dat$fitted4 <- mod4$fitted
xyplot(logFEV1 ~ age , data=dat,
panel=function(x,y){
panel.xyplot(x, y, pch=16, col=pal2use[2])
panel.xyplot(dat$age, dat$fitted4, pch=16, col=pal2use[1])
})
Explored various mean models as regression with time, and perhaps additional functions of time
Note that, while we used these mean models in the context of GEE, they could be used for LMM just as well (and random effects could allow subject-specific intercepts, slopes, quadratic trends, spline slopes, etc., if they are called for given your scientific question and if your dataset is large enough to support estimation).