This file provides an analysis in R of the data collected from a randomized controlled trial of treatment with D-penicillamine for primary biliary cirrhosis (PBC), now primary biliary cholangitis, a rare autoimmune liver disease.
Disclaimer: because this is a didactic exercise, we will consider a range of potential models. In practice, model fitting should be guided by a priori scientific knowledge and exploratory data analysis, and should be directly relevant to the scientific question of interest.
First, install the geepack, dplyr, ggplot2, and JM extension packages.
install.packages(c('geepack', 'dplyr', 'ggplot2', 'JM'), repos='https://cloud.r-project.org')
Next, load these packages. I also included some options for graphics as a user-specified function. (Some of the figures in this document are meant to be of publication quality, while others are only used for exploring the data and evaluating fitted models.)
library(geepack)
library(dplyr)
library(ggplot2)
library(JM)
theme_set(theme_bw())
mytheme <- function(...){
theme(panel.grid.minor=element_blank(),
plot.margin=unit(c(0.5, 0.5, 0.5, 0.5), 'lines'),
axis.title.x=element_text(size=12, vjust=0),
axis.title.y=element_text(size=12, vjust=1, angle=90),
legend.title=element_text(size=12, face='bold'),
legend.text=element_text(size=12),
legend.position='top', legend.direction='horizontal',
legend.key=element_blank(),
legend.key.width=unit(1.5,"cm"),
legend.key.height=unit(0.35,'cm'),
strip.text.y=element_text(size=12))}
The PBC data are provided as part of the JM extension package. Note that there is a help file for the dataset that can be accessed using ?pbc2. The dataset is in so-called long format, in which each participant has multiple rows, with each row corresponding to a measurement time. The data are sorted by id and year; software packages for longitudinal data analysis typically assume that the data are properly sorted.
data(pbc2)
head(pbc2, 11)
## id years status drug age sex year ascites hepatomegaly spiders
## 1 1 1.09517 dead D-penicil 58.76684 female 0.0000000 Yes Yes Yes
## 2 1 1.09517 dead D-penicil 58.76684 female 0.5256817 Yes Yes Yes
## 3 2 14.15234 alive D-penicil 56.44782 female 0.0000000 No Yes Yes
## 4 2 14.15234 alive D-penicil 56.44782 female 0.4983025 No Yes Yes
## 5 2 14.15234 alive D-penicil 56.44782 female 0.9993429 No Yes Yes
## 6 2 14.15234 alive D-penicil 56.44782 female 2.1027270 No Yes Yes
## 7 2 14.15234 alive D-penicil 56.44782 female 4.9008871 Yes Yes Yes
## 8 2 14.15234 alive D-penicil 56.44782 female 5.8892783 Yes Yes Yes
## 9 2 14.15234 alive D-penicil 56.44782 female 6.8858833 Yes Yes Yes
## 10 2 14.15234 alive D-penicil 56.44782 female 7.8907020 Yes Yes Yes
## 11 2 14.15234 alive D-penicil 56.44782 female 8.8325485 Yes Yes Yes
## edema serBilir serChol albumin alkaline SGOT platelets prothrombin histologic status2
## 1 edema despite diuretics 14.5 261 2.60 1718 138.0 190 12.2 4 1
## 2 edema despite diuretics 21.3 NA 2.94 1612 6.2 183 11.2 4 1
## 3 No edema 1.1 302 4.14 7395 113.5 221 10.6 3 0
## 4 No edema 0.8 NA 3.60 2107 139.5 188 11.0 3 0
## 5 No edema 1.0 NA 3.55 1711 144.2 161 11.6 3 0
## 6 No edema 1.9 NA 3.92 1365 144.2 122 10.6 3 0
## 7 edema no diuretics 2.6 230 3.32 1110 131.8 135 11.3 3 0
## 8 edema despite diuretics 3.6 NA 2.92 996 131.8 100 11.5 3 0
## 9 edema despite diuretics 4.2 NA 2.73 860 145.7 103 11.5 3 0
## 10 edema despite diuretics 3.6 244 2.80 779 119.0 113 11.5 3 0
## 11 edema despite diuretics 4.6 237 2.67 669 88.0 100 11.5 3 0
str(pbc2)
## 'data.frame': 1945 obs. of 20 variables:
## $ id : Factor w/ 312 levels "1","2","3","4",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ years : num 1.1 1.1 14.2 14.2 14.2 ...
## $ status : Factor w/ 3 levels "alive","transplanted",..: 3 3 1 1 1 1 1 1 1 1 ...
## $ drug : Factor w/ 2 levels "placebo","D-penicil": 2 2 2 2 2 2 2 2 2 2 ...
## $ age : num 58.8 58.8 56.4 56.4 56.4 ...
## $ sex : Factor w/ 2 levels "male","female": 2 2 2 2 2 2 2 2 2 2 ...
## $ year : num 0 0.526 0 0.498 0.999 ...
## $ ascites : Factor w/ 2 levels "No","Yes": 2 2 1 1 1 1 2 2 2 2 ...
## $ hepatomegaly: Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ spiders : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ edema : Factor w/ 3 levels "No edema","edema no diuretics",..: 3 3 1 1 1 1 2 3 3 3 ...
## $ serBilir : num 14.5 21.3 1.1 0.8 1 1.9 2.6 3.6 4.2 3.6 ...
## $ serChol : int 261 NA 302 NA NA NA 230 NA NA 244 ...
## $ albumin : num 2.6 2.94 4.14 3.6 3.55 3.92 3.32 2.92 2.73 2.8 ...
## $ alkaline : int 1718 1612 7395 2107 1711 1365 1110 996 860 779 ...
## $ SGOT : num 138 6.2 113.5 139.5 144.2 ...
## $ platelets : int 190 183 221 188 161 122 135 100 103 113 ...
## $ prothrombin : num 12.2 11.2 10.6 11 11.6 10.6 11.3 11.5 11.5 11.5 ...
## $ histologic : int 4 4 3 3 3 3 3 3 3 3 ...
## $ status2 : num 1 1 0 0 0 0 0 0 0 0 ...
Data are available for 312 study participants, with an unequal number of observations per participant. Calculate various summary statistics for the number of observations per participant.
pbc2 %>%
group_by(id) %>%
summarize(m=sum(!is.na(albumin))) %>%
summarize(min(m), median(m), max(m))
## # A tibble: 1 x 3
## `min(m)` `median(m)` `max(m)`
## <int> <dbl> <int>
## 1 1 5 16
subset(pbc2, !is.na(albumin)) %>%
count(id, name='m') -> num
with(num, table(m))
## m
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 27 26 32 44 30 23 19 23 17 23 16 12 6 5 6 3
Using ggplot, plot the observed serum albumin levels over time for the first 16 study participants, along with a linear regression line.
ggplot(subset(pbc2, is.element(id, 1:16)), aes(year, albumin)) +
facet_wrap( ~ id) +
geom_point() +
geom_smooth(se=FALSE, method='lm') +
scale_x_continuous(name='Time, years', limits=c(0,15)) +
scale_y_continuous(name='Serum albumin, g/dl') +
mytheme()
## `geom_smooth()` using formula 'y ~ x'
Note that there are other options for scatterplot smoothers; see ?geom_smooth for details.
With longitudinal data, exploratory data analysis should focus on both the mean (e.g., how the average outcome changes over time, whether the change over time differs between groups) and covariance (i.e., the variance in the outcome and the correlation between repeated measures). First, explore the mean by plotting the observed individual-level profiles (or, trajectories) in albumin level over time, along with a smoothing line for each treatment group to display the average trend.
pbc2 <- within(pbc2, {
levels(drug) <- c('Placebo', 'Treatment')})
ggplot(subset(pbc2, albumin<6), aes(year, albumin, group=id)) +
geom_line(aes(color=drug), alpha=0.4) +
geom_smooth(aes(year, albumin, group=drug, color=drug), se=FALSE, size=1.5) +
scale_x_continuous(name='Time, years', limits=c(0,15)) +
scale_y_continuous(name='Serum albumin, g/dl') +
scale_color_manual(name='Intervention', values=c('#00BFC4', '#F8766D')) +
mytheme()
Next, calculate summary statistics for the variance in albumin levels and correlation between pairs of albumin measurements. Such summary statistics can be obtained by reshaping the dataset into so-called wide format, in which each participant has a single row of data, but multiple columns, with each column corresponding to a measurement time.
pbc2 %>%
group_by(id) %>%
mutate(obs_num = row_number()) %>%
ungroup() %>%
as.data.frame() -> pbc2
pbc2_wide <- reshape(subset(pbc2, select=c(id, obs_num, albumin)),
v.names='albumin', idvar='id', timevar='obs_num', direction='wide')
head(pbc2_wide, 2)
## id albumin.1 albumin.2 albumin.3 albumin.4 albumin.5 albumin.6 albumin.7 albumin.8 albumin.9 albumin.10
## 1 1 2.60 2.94 NA NA NA NA NA NA NA NA
## 3 2 4.14 3.60 3.55 3.92 3.32 2.92 2.73 2.8 2.67 NA
## albumin.11 albumin.12 albumin.13 albumin.14 albumin.15 albumin.16
## 1 NA NA NA NA NA NA
## 3 NA NA NA NA NA NA
The commands for calculating variances and correlations, as well as the number of observations at each pair of measurement times (or ‘pairwise complete’ observations), are somewhat inelegant and not displayed below. Note that for the purposes of illustration, I restricted the calculation to the first 8 pairs of measurements.
## 1 2 3 4 5 6 7 8
## 0.18 0.27 0.26 0.28 0.19 0.25 0.25 0.19
## 1 2 3 4 5 6 7 8
## 1 1.00 0.45 0.38 0.35 0.41 0.29 0.36 0.24
## 2 0.45 1.00 0.48 0.47 0.49 0.45 0.41 0.29
## 3 0.38 0.48 1.00 0.49 0.52 0.49 0.55 0.38
## 4 0.35 0.47 0.49 1.00 0.65 0.56 0.56 0.33
## 5 0.41 0.49 0.52 0.65 1.00 0.67 0.59 0.51
## 6 0.29 0.45 0.49 0.56 0.67 1.00 0.68 0.58
## 7 0.36 0.41 0.55 0.56 0.59 0.68 1.00 0.68
## 8 0.24 0.29 0.38 0.33 0.51 0.58 0.68 1.00
## 1 2 3 4 5 6 7 8
## 1 312 285 259 227 183 153 130 111
## 2 285 285 259 227 183 153 130 111
## 3 259 259 259 227 183 153 130 111
## 4 227 227 227 227 183 153 130 111
## 5 183 183 183 183 183 153 130 111
## 6 153 153 153 153 153 153 130 111
## 7 130 130 130 130 130 130 130 111
## 8 111 111 111 111 111 111 111 111
When interpreting these summary statistics, remember that the data are unbalanced and not equally spaced. Data collection occurred approximately every 6 months for the first year, and yearly thereafter.
Using geeglm, fit a series of models based on (generalized) estimating equations, considering alternative specifications for the mean model (i.e., main effects of treatment and time, treatment-by-time interaction) and alternative specifications for the correlation model (i.e., independence, exchangeable). In addition to the typical formula and data arguments, geeglm requires the following arguments: family, to specify the error distribution and link function; id, to identify the clustering variable; and corstr, to specify the correlation model. See ?geeglm for details. Note that by default, geeglm includes a scale (or dispersion) parameter for the variance. This parameter can be particularly useful for Poisson (count) regression or logistic (binary) regression models when the variance is not equal to the model-assumed variance (for example, for Poisson outcomes, the variance is assumed to be equal to the mean). In my opinion, this parameter is less useful for continuous outcomes; therefore, I specified scale.fix=TRUE to fix the scale parameter at 1.
# Main effects of treatment and time, independence correlation
gee_m1 <- geeglm(formula=albumin ~ drug + year, family='gaussian',
data=pbc2, id=id, corstr='independence', scale.fix=TRUE)
summary(gee_m1)
##
## Call:
## geeglm(formula = albumin ~ drug + year, family = "gaussian",
## data = pbc2, id = id, corstr = "independence", scale.fix = TRUE)
##
## Coefficients:
## Estimate Std.err Wald Pr(>|W|)
## (Intercept) 3.502991 0.031715 12199.424 < 2e-16 ***
## drugTreatment 0.017907 0.045254 0.157 0.692
## year -0.038939 0.005593 48.478 3.34e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation structure = independence
## Scale is fixed.
##
## Number of clusters: 312 Maximum cluster size: 16
# Main effects of treatment and time, exchangeable correlation
gee_m2 <- geeglm(formula=albumin ~ drug + year, family='gaussian',
data=pbc2, id=id, corstr='exchangeable', scale.fix=TRUE)
summary(gee_m2)
##
## Call:
## geeglm(formula = albumin ~ drug + year, family = "gaussian",
## data = pbc2, id = id, corstr = "exchangeable", scale.fix = TRUE)
##
## Coefficients:
## Estimate Std.err Wald Pr(>|W|)
## (Intercept) 3.51804 0.03128 12649.65 <2e-16 ***
## drugTreatment 0.00417 0.04547 0.01 0.93
## year -0.07745 0.00497 242.57 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation structure = exchangeable
## Scale is fixed.
##
## Link = identity
##
## Estimated Correlation Parameters:
## Estimate Std.err
## alpha 0.704 0.0661
## Number of clusters: 312 Maximum cluster size: 16
# Interaction between treatment and time, exchangeable correlation
gee_m3 <- geeglm(formula=albumin ~ drug*year, family='gaussian',
data=pbc2, id=id, corstr='exchangeable', scale.fix=TRUE)
summary(gee_m3)
##
## Call:
## geeglm(formula = albumin ~ drug * year, family = "gaussian",
## data = pbc2, id = id, corstr = "exchangeable", scale.fix = TRUE)
##
## Coefficients:
## Estimate Std.err Wald Pr(>|W|)
## (Intercept) 3.52180 0.03168 12355.38 <2e-16 ***
## drugTreatment -0.00317 0.04559 0.00 0.94
## year -0.07914 0.00769 105.78 <2e-16 ***
## drugTreatment:year 0.00331 0.00999 0.11 0.74
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation structure = exchangeable
## Scale is fixed.
##
## Link = identity
##
## Estimated Correlation Parameters:
## Estimate Std.err
## alpha 0.703 0.0657
## Number of clusters: 312 Maximum cluster size: 16
# Treatment-specific time effects, exchangeable correlation
gee_m4 <- geeglm(formula=albumin ~ drug:year, family='gaussian',
data=pbc2, id=id, corstr='exchangeable', scale.fix=TRUE)
summary(gee_m4)
##
## Call:
## geeglm(formula = albumin ~ drug:year, family = "gaussian", data = pbc2,
## id = id, corstr = "exchangeable", scale.fix = TRUE)
##
## Coefficients:
## Estimate Std.err Wald Pr(>|W|)
## (Intercept) 3.52021 0.02280 23840 <2e-16 ***
## drugPlacebo:year -0.07910 0.00762 108 <2e-16 ***
## drugTreatment:year -0.07585 0.00632 144 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation structure = exchangeable
## Scale is fixed.
##
## Link = identity
##
## Estimated Correlation Parameters:
## Estimate Std.err
## alpha 0.703 0.0656
## Number of clusters: 312 Maximum cluster size: 16
# 95% confidence intervals
broom::tidy(gee_m4, conf.int=TRUE)
## # A tibble: 3 x 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 3.52 0.0228 23840. 0 3.48 3.56
## 2 drugPlacebo:year -0.0791 0.00762 108. 0 -0.0940 -0.0642
## 3 drugTreatment:year -0.0759 0.00632 144. 0 -0.0882 -0.0635
According to the model with treatment-specific time effects and an exchangeable correlation model (gee_m4), the mean albumin level at baseline (time 0) is 3.52 g/dl. For the placebo group, a one-year increase in time is associated with a 0.079 g/dl decrease in albumin levels. For the treatment group, a one-year increase in time is associated with a 0.076 g/dl decrease in bilirubin levels. There is no evidence that the average trend in albumin levels differs between the placebo and treatment groups.
Residual analysis can be used to evaluate the adequacy of a fitted model.
plot(pbc2$year, residuals(gee_m4), xlab='Time, years', ylab='Residuals')
abline(h=0, col='gray', lwd=2)
lines(lowess(pbc2$year, residuals(gee_m4)), col='red', lwd=2)
Ignoring the two outliers, which in practice we would treat more seriously, assumptions regarding the mean model do not appear to be satisfied (the lowess smoother does not approximate the flat line at 0). Therefore, we will evaluate whether an alternative specification for the mean model provides a better fit. In particular, in the plot of observed individual-level profiles in albumin level over time, the scatterplot smoothers suggested a non-linear association between time and albumin level for both the treatment and placebo groups. Consider adding quadratic terms for time to the ‘preferred’ model.
gee_m5 <- geeglm(formula=albumin ~ drug:year + drug:I(year^2), family='gaussian',
data=pbc2, id=id, corstr='exchangeable', scale.fix=TRUE)
anova(gee_m5)
## Analysis of 'Wald statistic' Table
## Model: gaussian, link: identity
## Response: albumin
## Terms added sequentially (first to last)
##
## Df X2 P(>|Chi|)
## drug:year 2 241 <2e-16 ***
## drug:I(year^2) 2 7 0.031 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(pbc2$year, residuals(gee_m5), xlab='Time, years', ylab='Residuals')
abline(h=0, col='gray', lwd=2)
lines(lowess(pbc2$year, residuals(gee_m5)), col='red', lwd=2)
This file was created from:
## R version 4.1.0 (2021-05-18)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur 10.16
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
##
## attached base packages:
## [1] splines stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] JM_1.4-8 survival_3.2-11 nlme_3.1-152 MASS_7.3-54 ggplot2_3.3.5 dplyr_1.0.7
## [7] geepack_1.3-2
##
## loaded via a namespace (and not attached):
## [1] highr_0.9 pillar_1.6.1 compiler_4.1.0 tools_4.1.0 digest_0.6.27
## [6] evaluate_0.14 lifecycle_1.0.0 tibble_3.1.2 gtable_0.3.0 lattice_0.20-44
## [11] mgcv_1.8-35 pkgconfig_2.0.3 rlang_0.4.11 Matrix_1.3-3 cli_3.0.0
## [16] yaml_2.2.1 xfun_0.24 withr_2.4.2 stringr_1.4.0 knitr_1.33
## [21] generics_0.1.0 vctrs_0.3.8 grid_4.1.0 tidyselect_1.1.1 glue_1.4.2
## [26] R6_2.5.0 fansi_0.5.0 rmarkdown_2.9 farver_2.1.0 purrr_0.3.4
## [31] tidyr_1.1.3 magrittr_2.0.1 backports_1.2.1 scales_1.1.1 ellipsis_0.3.2
## [36] htmltools_0.5.1.1 colorspace_2.0-2 labeling_0.4.2 utf8_1.2.1 stringi_1.6.2
## [41] munsell_0.5.0 broom_0.7.8 crayon_1.4.1