This file provides an analysis in R of the data collected from a randomized controlled trial of treatment with D-penicillamine for primary biliary cirrhosis (PBC), now primary biliary cholangitis, a rare autoimmune liver disease.

Disclaimer: because this is a didactic exercise, we will consider a range of potential models. In practice, model fitting should be guided by a priori scientific knowledge and exploratory data analysis, and should be directly relevant to the scientific question of interest.

1 Preliminaries

First, install the geepack, dplyr, ggplot2, and JM extension packages.

install.packages(c('geepack', 'dplyr', 'ggplot2', 'JM'), repos='https://cloud.r-project.org')

Next, load these packages. I also included some options for graphics as a user-specified function. (Some of the figures in this document are meant to be of publication quality, while others are only used for exploring the data and evaluating fitted models.)

library(geepack)
library(dplyr)
library(ggplot2)
library(JM)

theme_set(theme_bw())
mytheme <- function(...){
  theme(panel.grid.minor=element_blank(),
        plot.margin=unit(c(0.5, 0.5, 0.5, 0.5), 'lines'),
        axis.title.x=element_text(size=12, vjust=0),
        axis.title.y=element_text(size=12, vjust=1, angle=90),
        legend.title=element_text(size=12, face='bold'),
        legend.text=element_text(size=12),
        legend.position='top', legend.direction='horizontal', 
        legend.key=element_blank(),
        legend.key.width=unit(1.5,"cm"),
        legend.key.height=unit(0.35,'cm'),
        strip.text.y=element_text(size=12))}

2 PBC data

The PBC data are provided as part of the JM extension package. Note that there is a help file for the dataset that can be accessed using ?pbc2. The dataset is in so-called long format, in which each participant has multiple rows, with each row corresponding to a measurement time. The data are sorted by id and year; software packages for longitudinal data analysis typically assume that the data are properly sorted.

data(pbc2)
head(pbc2, 11)
##    id    years status      drug      age    sex      year ascites hepatomegaly spiders
## 1   1  1.09517   dead D-penicil 58.76684 female 0.0000000     Yes          Yes     Yes
## 2   1  1.09517   dead D-penicil 58.76684 female 0.5256817     Yes          Yes     Yes
## 3   2 14.15234  alive D-penicil 56.44782 female 0.0000000      No          Yes     Yes
## 4   2 14.15234  alive D-penicil 56.44782 female 0.4983025      No          Yes     Yes
## 5   2 14.15234  alive D-penicil 56.44782 female 0.9993429      No          Yes     Yes
## 6   2 14.15234  alive D-penicil 56.44782 female 2.1027270      No          Yes     Yes
## 7   2 14.15234  alive D-penicil 56.44782 female 4.9008871     Yes          Yes     Yes
## 8   2 14.15234  alive D-penicil 56.44782 female 5.8892783     Yes          Yes     Yes
## 9   2 14.15234  alive D-penicil 56.44782 female 6.8858833     Yes          Yes     Yes
## 10  2 14.15234  alive D-penicil 56.44782 female 7.8907020     Yes          Yes     Yes
## 11  2 14.15234  alive D-penicil 56.44782 female 8.8325485     Yes          Yes     Yes
##                      edema serBilir serChol albumin alkaline  SGOT platelets prothrombin histologic status2
## 1  edema despite diuretics     14.5     261    2.60     1718 138.0       190        12.2          4       1
## 2  edema despite diuretics     21.3      NA    2.94     1612   6.2       183        11.2          4       1
## 3                 No edema      1.1     302    4.14     7395 113.5       221        10.6          3       0
## 4                 No edema      0.8      NA    3.60     2107 139.5       188        11.0          3       0
## 5                 No edema      1.0      NA    3.55     1711 144.2       161        11.6          3       0
## 6                 No edema      1.9      NA    3.92     1365 144.2       122        10.6          3       0
## 7       edema no diuretics      2.6     230    3.32     1110 131.8       135        11.3          3       0
## 8  edema despite diuretics      3.6      NA    2.92      996 131.8       100        11.5          3       0
## 9  edema despite diuretics      4.2      NA    2.73      860 145.7       103        11.5          3       0
## 10 edema despite diuretics      3.6     244    2.80      779 119.0       113        11.5          3       0
## 11 edema despite diuretics      4.6     237    2.67      669  88.0       100        11.5          3       0
str(pbc2)
## 'data.frame':    1945 obs. of  20 variables:
##  $ id          : Factor w/ 312 levels "1","2","3","4",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ years       : num  1.1 1.1 14.2 14.2 14.2 ...
##  $ status      : Factor w/ 3 levels "alive","transplanted",..: 3 3 1 1 1 1 1 1 1 1 ...
##  $ drug        : Factor w/ 2 levels "placebo","D-penicil": 2 2 2 2 2 2 2 2 2 2 ...
##  $ age         : num  58.8 58.8 56.4 56.4 56.4 ...
##  $ sex         : Factor w/ 2 levels "male","female": 2 2 2 2 2 2 2 2 2 2 ...
##  $ year        : num  0 0.526 0 0.498 0.999 ...
##  $ ascites     : Factor w/ 2 levels "No","Yes": 2 2 1 1 1 1 2 2 2 2 ...
##  $ hepatomegaly: Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ spiders     : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ edema       : Factor w/ 3 levels "No edema","edema no diuretics",..: 3 3 1 1 1 1 2 3 3 3 ...
##  $ serBilir    : num  14.5 21.3 1.1 0.8 1 1.9 2.6 3.6 4.2 3.6 ...
##  $ serChol     : int  261 NA 302 NA NA NA 230 NA NA 244 ...
##  $ albumin     : num  2.6 2.94 4.14 3.6 3.55 3.92 3.32 2.92 2.73 2.8 ...
##  $ alkaline    : int  1718 1612 7395 2107 1711 1365 1110 996 860 779 ...
##  $ SGOT        : num  138 6.2 113.5 139.5 144.2 ...
##  $ platelets   : int  190 183 221 188 161 122 135 100 103 113 ...
##  $ prothrombin : num  12.2 11.2 10.6 11 11.6 10.6 11.3 11.5 11.5 11.5 ...
##  $ histologic  : int  4 4 3 3 3 3 3 3 3 3 ...
##  $ status2     : num  1 1 0 0 0 0 0 0 0 0 ...

Data are available for 312 study participants, with an unequal number of observations per participant. Calculate various summary statistics for the number of observations per participant.

pbc2 %>%
  group_by(id) %>%
  summarize(m=sum(!is.na(albumin))) %>%
  summarize(min(m), median(m), max(m))
## # A tibble: 1 x 3
##   `min(m)` `median(m)` `max(m)`
##      <int>       <dbl>    <int>
## 1        1           5       16
subset(pbc2, !is.na(albumin)) %>% 
  count(id, name='m') -> num
with(num, table(m))
## m
##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 
## 27 26 32 44 30 23 19 23 17 23 16 12  6  5  6  3

Using ggplot, plot the observed serum albumin levels over time for the first 16 study participants, along with a linear regression line.

ggplot(subset(pbc2, is.element(id, 1:16)), aes(year, albumin)) +
  facet_wrap( ~ id) +
  geom_point() +
  geom_smooth(se=FALSE, method='lm') + 
  scale_x_continuous(name='Time, years', limits=c(0,15)) +
  scale_y_continuous(name='Serum albumin, g/dl') +
  mytheme()
## `geom_smooth()` using formula 'y ~ x'

Note that there are other options for scatterplot smoothers; see ?geom_smooth for details.

3 Exploratory data analysis

With longitudinal data, exploratory data analysis should focus on both the mean (e.g., how the average outcome changes over time, whether the change over time differs between groups) and covariance (i.e., the variance in the outcome and the correlation between repeated measures). First, explore the mean by plotting the observed individual-level profiles (or, trajectories) in albumin level over time, along with a smoothing line for each treatment group to display the average trend.

pbc2 <- within(pbc2, {
  levels(drug) <- c('Placebo', 'Treatment')})

ggplot(subset(pbc2, albumin<6), aes(year, albumin, group=id)) +
  geom_line(aes(color=drug), alpha=0.4) +
  geom_smooth(aes(year, albumin, group=drug, color=drug), se=FALSE, size=1.5) +
  scale_x_continuous(name='Time, years', limits=c(0,15)) +
  scale_y_continuous(name='Serum albumin, g/dl') + 
  scale_color_manual(name='Intervention', values=c('#00BFC4', '#F8766D')) +
  mytheme()

Next, calculate summary statistics for the variance in albumin levels and correlation between pairs of albumin measurements. Such summary statistics can be obtained by reshaping the dataset into so-called wide format, in which each participant has a single row of data, but multiple columns, with each column corresponding to a measurement time.

pbc2 %>% 
  group_by(id) %>% 
  mutate(obs_num = row_number()) %>%
  ungroup() %>%
  as.data.frame() -> pbc2

pbc2_wide <- reshape(subset(pbc2, select=c(id, obs_num, albumin)),
                     v.names='albumin', idvar='id', timevar='obs_num', direction='wide')
head(pbc2_wide, 2)
##   id albumin.1 albumin.2 albumin.3 albumin.4 albumin.5 albumin.6 albumin.7 albumin.8 albumin.9 albumin.10
## 1  1      2.60      2.94        NA        NA        NA        NA        NA        NA        NA         NA
## 3  2      4.14      3.60      3.55      3.92      3.32      2.92      2.73       2.8      2.67         NA
##   albumin.11 albumin.12 albumin.13 albumin.14 albumin.15 albumin.16
## 1         NA         NA         NA         NA         NA         NA
## 3         NA         NA         NA         NA         NA         NA

The commands for calculating variances and correlations, as well as the number of observations at each pair of measurement times (or ‘pairwise complete’ observations), are somewhat inelegant and not displayed below. Note that for the purposes of illustration, I restricted the calculation to the first 8 pairs of measurements.

##    1    2    3    4    5    6    7    8 
## 0.18 0.27 0.26 0.28 0.19 0.25 0.25 0.19
##      1    2    3    4    5    6    7    8
## 1 1.00 0.45 0.38 0.35 0.41 0.29 0.36 0.24
## 2 0.45 1.00 0.48 0.47 0.49 0.45 0.41 0.29
## 3 0.38 0.48 1.00 0.49 0.52 0.49 0.55 0.38
## 4 0.35 0.47 0.49 1.00 0.65 0.56 0.56 0.33
## 5 0.41 0.49 0.52 0.65 1.00 0.67 0.59 0.51
## 6 0.29 0.45 0.49 0.56 0.67 1.00 0.68 0.58
## 7 0.36 0.41 0.55 0.56 0.59 0.68 1.00 0.68
## 8 0.24 0.29 0.38 0.33 0.51 0.58 0.68 1.00
##     1   2   3   4   5   6   7   8
## 1 312 285 259 227 183 153 130 111
## 2 285 285 259 227 183 153 130 111
## 3 259 259 259 227 183 153 130 111
## 4 227 227 227 227 183 153 130 111
## 5 183 183 183 183 183 153 130 111
## 6 153 153 153 153 153 153 130 111
## 7 130 130 130 130 130 130 130 111
## 8 111 111 111 111 111 111 111 111

When interpreting these summary statistics, remember that the data are unbalanced and not equally spaced. Data collection occurred approximately every 6 months for the first year, and yearly thereafter.

4 Estimating equations

Using geeglm, fit a series of models based on (generalized) estimating equations, considering alternative specifications for the mean model (i.e., main effects of treatment and time, treatment-by-time interaction) and alternative specifications for the correlation model (i.e., independence, exchangeable). In addition to the typical formula and data arguments, geeglm requires the following arguments: family, to specify the error distribution and link function; id, to identify the clustering variable; and corstr, to specify the correlation model. See ?geeglm for details. Note that by default, geeglm includes a scale (or dispersion) parameter for the variance. This parameter can be particularly useful for Poisson (count) regression or logistic (binary) regression models when the variance is not equal to the model-assumed variance (for example, for Poisson outcomes, the variance is assumed to be equal to the mean). In my opinion, this parameter is less useful for continuous outcomes; therefore, I specified scale.fix=TRUE to fix the scale parameter at 1.

# Main effects of treatment and time, independence correlation
gee_m1 <- geeglm(formula=albumin ~ drug + year, family='gaussian', 
                 data=pbc2, id=id, corstr='independence', scale.fix=TRUE)
summary(gee_m1)
## 
## Call:
## geeglm(formula = albumin ~ drug + year, family = "gaussian", 
##     data = pbc2, id = id, corstr = "independence", scale.fix = TRUE)
## 
##  Coefficients:
##                Estimate   Std.err      Wald Pr(>|W|)    
## (Intercept)    3.502991  0.031715 12199.424  < 2e-16 ***
## drugTreatment  0.017907  0.045254     0.157    0.692    
## year          -0.038939  0.005593    48.478 3.34e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation structure = independence 
## Scale is fixed.
## 
## Number of clusters:   312  Maximum cluster size: 16
# Main effects of treatment and time, exchangeable correlation
gee_m2 <- geeglm(formula=albumin ~ drug + year,  family='gaussian', 
                 data=pbc2, id=id, corstr='exchangeable', scale.fix=TRUE)
summary(gee_m2)
## 
## Call:
## geeglm(formula = albumin ~ drug + year, family = "gaussian", 
##     data = pbc2, id = id, corstr = "exchangeable", scale.fix = TRUE)
## 
##  Coefficients:
##               Estimate  Std.err     Wald Pr(>|W|)    
## (Intercept)    3.51804  0.03128 12649.65   <2e-16 ***
## drugTreatment  0.00417  0.04547     0.01     0.93    
## year          -0.07745  0.00497   242.57   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation structure = exchangeable 
## Scale is fixed.
## 
##   Link = identity 
## 
## Estimated Correlation Parameters:
##       Estimate Std.err
## alpha    0.704  0.0661
## Number of clusters:   312  Maximum cluster size: 16
# Interaction between treatment and time, exchangeable correlation
gee_m3 <- geeglm(formula=albumin ~ drug*year, family='gaussian', 
                 data=pbc2, id=id, corstr='exchangeable', scale.fix=TRUE)
summary(gee_m3)
## 
## Call:
## geeglm(formula = albumin ~ drug * year, family = "gaussian", 
##     data = pbc2, id = id, corstr = "exchangeable", scale.fix = TRUE)
## 
##  Coefficients:
##                    Estimate  Std.err     Wald Pr(>|W|)    
## (Intercept)         3.52180  0.03168 12355.38   <2e-16 ***
## drugTreatment      -0.00317  0.04559     0.00     0.94    
## year               -0.07914  0.00769   105.78   <2e-16 ***
## drugTreatment:year  0.00331  0.00999     0.11     0.74    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation structure = exchangeable 
## Scale is fixed.
## 
##   Link = identity 
## 
## Estimated Correlation Parameters:
##       Estimate Std.err
## alpha    0.703  0.0657
## Number of clusters:   312  Maximum cluster size: 16
# Treatment-specific time effects, exchangeable correlation
gee_m4 <- geeglm(formula=albumin ~ drug:year, family='gaussian', 
                 data=pbc2, id=id, corstr='exchangeable', scale.fix=TRUE)
summary(gee_m4)
## 
## Call:
## geeglm(formula = albumin ~ drug:year, family = "gaussian", data = pbc2, 
##     id = id, corstr = "exchangeable", scale.fix = TRUE)
## 
##  Coefficients:
##                    Estimate  Std.err  Wald Pr(>|W|)    
## (Intercept)         3.52021  0.02280 23840   <2e-16 ***
## drugPlacebo:year   -0.07910  0.00762   108   <2e-16 ***
## drugTreatment:year -0.07585  0.00632   144   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation structure = exchangeable 
## Scale is fixed.
## 
##   Link = identity 
## 
## Estimated Correlation Parameters:
##       Estimate Std.err
## alpha    0.703  0.0656
## Number of clusters:   312  Maximum cluster size: 16
# 95% confidence intervals
broom::tidy(gee_m4, conf.int=TRUE)
## # A tibble: 3 x 7
##   term               estimate std.error statistic p.value conf.low conf.high
##   <chr>                 <dbl>     <dbl>     <dbl>   <dbl>    <dbl>     <dbl>
## 1 (Intercept)          3.52     0.0228     23840.       0   3.48      3.56  
## 2 drugPlacebo:year    -0.0791   0.00762      108.       0  -0.0940   -0.0642
## 3 drugTreatment:year  -0.0759   0.00632      144.       0  -0.0882   -0.0635

According to the model with treatment-specific time effects and an exchangeable correlation model (gee_m4), the mean albumin level at baseline (time 0) is 3.52 g/dl. For the placebo group, a one-year increase in time is associated with a 0.079 g/dl decrease in albumin levels. For the treatment group, a one-year increase in time is associated with a 0.076 g/dl decrease in bilirubin levels. There is no evidence that the average trend in albumin levels differs between the placebo and treatment groups.

Residual analysis can be used to evaluate the adequacy of a fitted model.

plot(pbc2$year, residuals(gee_m4), xlab='Time, years', ylab='Residuals')
abline(h=0, col='gray', lwd=2)
lines(lowess(pbc2$year, residuals(gee_m4)), col='red', lwd=2)

Ignoring the two outliers, which in practice we would treat more seriously, assumptions regarding the mean model do not appear to be satisfied (the lowess smoother does not approximate the flat line at 0). Therefore, we will evaluate whether an alternative specification for the mean model provides a better fit. In particular, in the plot of observed individual-level profiles in albumin level over time, the scatterplot smoothers suggested a non-linear association between time and albumin level for both the treatment and placebo groups. Consider adding quadratic terms for time to the ‘preferred’ model.

gee_m5 <- geeglm(formula=albumin ~ drug:year + drug:I(year^2), family='gaussian',
                 data=pbc2, id=id, corstr='exchangeable', scale.fix=TRUE)
anova(gee_m5)
## Analysis of 'Wald statistic' Table
## Model: gaussian, link: identity
## Response: albumin
## Terms added sequentially (first to last)
## 
##                Df  X2 P(>|Chi|)    
## drug:year       2 241    <2e-16 ***
## drug:I(year^2)  2   7     0.031 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(pbc2$year, residuals(gee_m5), xlab='Time, years', ylab='Residuals')
abline(h=0, col='gray', lwd=2)
lines(lowess(pbc2$year, residuals(gee_m5)), col='red', lwd=2)

This file was created from:

## R version 4.1.0 (2021-05-18)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur 10.16
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
## 
## attached base packages:
## [1] splines   stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] JM_1.4-8        survival_3.2-11 nlme_3.1-152    MASS_7.3-54     ggplot2_3.3.5   dplyr_1.0.7    
## [7] geepack_1.3-2  
## 
## loaded via a namespace (and not attached):
##  [1] highr_0.9         pillar_1.6.1      compiler_4.1.0    tools_4.1.0       digest_0.6.27    
##  [6] evaluate_0.14     lifecycle_1.0.0   tibble_3.1.2      gtable_0.3.0      lattice_0.20-44  
## [11] mgcv_1.8-35       pkgconfig_2.0.3   rlang_0.4.11      Matrix_1.3-3      cli_3.0.0        
## [16] yaml_2.2.1        xfun_0.24         withr_2.4.2       stringr_1.4.0     knitr_1.33       
## [21] generics_0.1.0    vctrs_0.3.8       grid_4.1.0        tidyselect_1.1.1  glue_1.4.2       
## [26] R6_2.5.0          fansi_0.5.0       rmarkdown_2.9     farver_2.1.0      purrr_0.3.4      
## [31] tidyr_1.1.3       magrittr_2.0.1    backports_1.2.1   scales_1.1.1      ellipsis_0.3.2   
## [36] htmltools_0.5.1.1 colorspace_2.0-2  labeling_0.4.2    utf8_1.2.1        stringi_1.6.2    
## [41] munsell_0.5.0     broom_0.7.8       crayon_1.4.1