This file provides an analysis in R of the data collected from a randomized controlled trial of treatment with D-penicillamine for primary biliary cirrhosis (PBC), now primary biliary cholangitis, a rare autoimmune liver disease.
Disclaimer: because this is a didactic exercise, we will consider a range of potential models. In practice, model fitting should be guided by a priori scientific knowledge and exploratory data analysis, and should be directly relevant to the scientific question of interest.
First, install the lme4, dplyr, ggplot2, and JM extension packages.
install.packages(c('lme4', 'dplyr', 'ggplot2', 'JM'), repos='https://cloud.r-project.org')
Next, load these packages. I also included some options for graphics as a user-specified function. (Some of the figures in this document are meant to be of publication quality, while others are only used for exploring the data and evaluating fitted models.)
library(lme4)
library(dplyr)
library(ggplot2)
library(JM)
theme_set(theme_bw())
mytheme <- function(...){
theme(panel.grid.minor=element_blank(),
plot.margin=unit(c(0.5, 0.5, 0.5, 0.5), 'lines'),
axis.title.x=element_text(size=12, vjust=0),
axis.title.y=element_text(size=12, vjust=1, angle=90),
legend.title=element_text(size=12, face='bold'),
legend.text=element_text(size=12),
legend.position='top', legend.direction='horizontal',
legend.key=element_blank(),
legend.key.width=unit(1.5,"cm"),
legend.key.height=unit(0.35,'cm'),
strip.text.y=element_text(size=12))}
The PBC data are provided as part of the JM extension package. Note that there is a help file for the dataset that can be accessed using ?pbc2. The dataset is in so-called long format, in which each participant has multiple rows, with each row corresponding to a measurement time. The data are sorted by id and year; software packages for longitudinal data analysis typically assume that the data are properly sorted.
data(pbc2)
head(pbc2, 11)
## id years status drug age sex year ascites hepatomegaly spiders
## 1 1 1.09517 dead D-penicil 58.76684 female 0.0000000 Yes Yes Yes
## 2 1 1.09517 dead D-penicil 58.76684 female 0.5256817 Yes Yes Yes
## 3 2 14.15234 alive D-penicil 56.44782 female 0.0000000 No Yes Yes
## 4 2 14.15234 alive D-penicil 56.44782 female 0.4983025 No Yes Yes
## 5 2 14.15234 alive D-penicil 56.44782 female 0.9993429 No Yes Yes
## 6 2 14.15234 alive D-penicil 56.44782 female 2.1027270 No Yes Yes
## 7 2 14.15234 alive D-penicil 56.44782 female 4.9008871 Yes Yes Yes
## 8 2 14.15234 alive D-penicil 56.44782 female 5.8892783 Yes Yes Yes
## 9 2 14.15234 alive D-penicil 56.44782 female 6.8858833 Yes Yes Yes
## 10 2 14.15234 alive D-penicil 56.44782 female 7.8907020 Yes Yes Yes
## 11 2 14.15234 alive D-penicil 56.44782 female 8.8325485 Yes Yes Yes
## edema serBilir serChol albumin alkaline SGOT platelets prothrombin histologic status2
## 1 edema despite diuretics 14.5 261 2.60 1718 138.0 190 12.2 4 1
## 2 edema despite diuretics 21.3 NA 2.94 1612 6.2 183 11.2 4 1
## 3 No edema 1.1 302 4.14 7395 113.5 221 10.6 3 0
## 4 No edema 0.8 NA 3.60 2107 139.5 188 11.0 3 0
## 5 No edema 1.0 NA 3.55 1711 144.2 161 11.6 3 0
## 6 No edema 1.9 NA 3.92 1365 144.2 122 10.6 3 0
## 7 edema no diuretics 2.6 230 3.32 1110 131.8 135 11.3 3 0
## 8 edema despite diuretics 3.6 NA 2.92 996 131.8 100 11.5 3 0
## 9 edema despite diuretics 4.2 NA 2.73 860 145.7 103 11.5 3 0
## 10 edema despite diuretics 3.6 244 2.80 779 119.0 113 11.5 3 0
## 11 edema despite diuretics 4.6 237 2.67 669 88.0 100 11.5 3 0
str(pbc2)
## 'data.frame': 1945 obs. of 20 variables:
## $ id : Factor w/ 312 levels "1","2","3","4",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ years : num 1.1 1.1 14.2 14.2 14.2 ...
## $ status : Factor w/ 3 levels "alive","transplanted",..: 3 3 1 1 1 1 1 1 1 1 ...
## $ drug : Factor w/ 2 levels "placebo","D-penicil": 2 2 2 2 2 2 2 2 2 2 ...
## $ age : num 58.8 58.8 56.4 56.4 56.4 ...
## $ sex : Factor w/ 2 levels "male","female": 2 2 2 2 2 2 2 2 2 2 ...
## $ year : num 0 0.526 0 0.498 0.999 ...
## $ ascites : Factor w/ 2 levels "No","Yes": 2 2 1 1 1 1 2 2 2 2 ...
## $ hepatomegaly: Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ spiders : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ edema : Factor w/ 3 levels "No edema","edema no diuretics",..: 3 3 1 1 1 1 2 3 3 3 ...
## $ serBilir : num 14.5 21.3 1.1 0.8 1 1.9 2.6 3.6 4.2 3.6 ...
## $ serChol : int 261 NA 302 NA NA NA 230 NA NA 244 ...
## $ albumin : num 2.6 2.94 4.14 3.6 3.55 3.92 3.32 2.92 2.73 2.8 ...
## $ alkaline : int 1718 1612 7395 2107 1711 1365 1110 996 860 779 ...
## $ SGOT : num 138 6.2 113.5 139.5 144.2 ...
## $ platelets : int 190 183 221 188 161 122 135 100 103 113 ...
## $ prothrombin : num 12.2 11.2 10.6 11 11.6 10.6 11.3 11.5 11.5 11.5 ...
## $ histologic : int 4 4 3 3 3 3 3 3 3 3 ...
## $ status2 : num 1 1 0 0 0 0 0 0 0 0 ...
Data are available for 312 study participants, with an unequal number of observations per participant. Calculate various summary statistics for the number of observations per participant.
pbc2 %>%
group_by(id) %>%
summarize(m=sum(!is.na(albumin))) %>%
summarize(min(m), median(m), max(m))
## # A tibble: 1 x 3
## `min(m)` `median(m)` `max(m)`
## <int> <dbl> <int>
## 1 1 5 16
subset(pbc2, !is.na(albumin)) %>%
count(id, name='m') -> num
with(num, table(m))
## m
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 27 26 32 44 30 23 19 23 17 23 16 12 6 5 6 3
Using ggplot, plot the observed serum albumin levels over time for the first 16 study participants, along with a linear regression line and an indicator of when the participant died or was censored. (Liver transplantation is included as a censoring event, which we could debate.)
ggplot(subset(pbc2, is.element(id, 1:16)), aes(year, albumin)) +
facet_wrap( ~ id) +
geom_vline(aes(xintercept=years, color=factor(status!='dead')), lwd=1) +
geom_point() +
geom_smooth(se=FALSE, method='lm') +
scale_x_continuous(name='Time, years', limits=c(0,15)) +
scale_y_continuous(name='Serum albumin, g/dl') +
scale_colour_manual(name='Status', breaks=c(FALSE, TRUE), values=c('red', 'darkgray'),
labels=c('Death', 'Censored')) +
mytheme()
## `geom_smooth()` using formula 'y ~ x'
Note that there are other options for scatterplot smoothers; see ?geom_smooth for details.
In our analysis, we will assume that albumin levels follow (approximately) a Normal distribution. A histogram of albumin levels at baseline (year 0) can help to verify this assumption.
pbc2 <- within(pbc2, {
levels(drug) <- c('Placebo', 'Treatment')})
ggplot(subset(pbc2, year==0), aes(x=albumin)) +
geom_histogram(color='black', fill='white') +
facet_wrap(~ drug) +
scale_x_continuous(name='Serum albumin, g/dl') +
scale_y_continuous(name='Count') +
mytheme()
With longitudinal data, exploratory data analysis should focus on both the mean (e.g., how the average outcome changes over time, whether the change over time differs between groups) and covariance (i.e., the variance in the outcome and the correlation between repeated measures). First, explore the mean by plotting the observed individual-level profiles (or, trajectories) in albumin level over time, along with a smoothing line for each treatment group to display the average trend.
ggplot(subset(pbc2, albumin<6), aes(year, albumin, group=id)) +
geom_line(aes(color=drug), alpha=0.4) +
geom_smooth(aes(year, albumin, group=drug, color=drug), se=FALSE, size=1.5) +
scale_x_continuous(name='Time, years', limits=c(0,15)) +
scale_y_continuous(name='Serum albumin, g/dl') +
scale_color_manual(name='Intervention', values=c('#00BFC4', '#F8766D')) +
mytheme()
Next, calculate summary statistics for the variance in albumin levels and correlation between pairs of albumin measurements. Such summary statistics can be obtained by reshaping the dataset into so-called wide format, in which each participant has a single row of data, but multiple columns, with each column corresponding to a measurement time.
pbc2 %>%
group_by(id) %>%
mutate(obs_num = row_number()) %>%
ungroup() %>%
as.data.frame() -> pbc2
pbc2_wide <- reshape(subset(pbc2, select=c(id, obs_num, albumin)),
v.names='albumin', idvar='id', timevar='obs_num', direction='wide')
head(pbc2_wide, 2)
## id albumin.1 albumin.2 albumin.3 albumin.4 albumin.5 albumin.6 albumin.7 albumin.8 albumin.9 albumin.10
## 1 1 2.60 2.94 NA NA NA NA NA NA NA NA
## 3 2 4.14 3.60 3.55 3.92 3.32 2.92 2.73 2.8 2.67 NA
## albumin.11 albumin.12 albumin.13 albumin.14 albumin.15 albumin.16
## 1 NA NA NA NA NA NA
## 3 NA NA NA NA NA NA
The commands for calculating variances and correlations, as well as the number of observations at each pair of measurement times (or ‘pairwise complete’ observations), are somewhat inelegant and not displayed below. Note that for the purposes of illustration, I restricted the calculation to the first 8 pairs of measurements.
## 1 2 3 4 5 6 7 8
## 0.18 0.27 0.26 0.28 0.19 0.25 0.25 0.19
## 1 2 3 4 5 6 7 8
## 1 1.00 0.45 0.38 0.35 0.41 0.29 0.36 0.24
## 2 0.45 1.00 0.48 0.47 0.49 0.45 0.41 0.29
## 3 0.38 0.48 1.00 0.49 0.52 0.49 0.55 0.38
## 4 0.35 0.47 0.49 1.00 0.65 0.56 0.56 0.33
## 5 0.41 0.49 0.52 0.65 1.00 0.67 0.59 0.51
## 6 0.29 0.45 0.49 0.56 0.67 1.00 0.68 0.58
## 7 0.36 0.41 0.55 0.56 0.59 0.68 1.00 0.68
## 8 0.24 0.29 0.38 0.33 0.51 0.58 0.68 1.00
## 1 2 3 4 5 6 7 8
## 1 312 285 259 227 183 153 130 111
## 2 285 285 259 227 183 153 130 111
## 3 259 259 259 227 183 153 130 111
## 4 227 227 227 227 183 153 130 111
## 5 183 183 183 183 183 153 130 111
## 6 153 153 153 153 153 153 130 111
## 7 130 130 130 130 130 130 130 111
## 8 111 111 111 111 111 111 111 111
When interpreting these summary statistics, remember that the data are unbalanced and not equally spaced. Data collection occurred approximately every 6 months for the first year, and yearly thereafter.
Using lmer, fit a series a linear mixed-effects models, considering alternative specifications for the mean model (i.e., main effects of treatment and time, treatment-by-time interaction) and alternative specifications for the random effects (i.e., random intercepts only, or random intercepts and random slopes). Note that random-effects terms are included in the formula argument and are distinguished by a vertical bar that separates expressions for covariates (1 for random intercepts only) from grouping factors; see ?lmer for details. (To fit generalized linear mixed-effects models for count or binary outcomes, use glmer; see ?glmer for details.)
# Main effects of treatment and time, random intercepts
lme_m1 <- lmer(formula=albumin ~ (1 | id) + drug + year, data=pbc2)
summary(lme_m1)
## Linear mixed model fit by REML ['lmerMod']
## Formula: albumin ~ (1 | id) + drug + year
## Data: pbc2
##
## REML criterion at convergence: 2047.4
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -4.3506 -0.5243 0.0726 0.5420 10.7479
##
## Random effects:
## Groups Name Variance Std.Dev.
## id (Intercept) 0.1281 0.3579
## Residual 0.1234 0.3514
## Number of obs: 1945, groups: id, 312
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 3.520719 0.032582 108.058
## drugTreatment 0.005631 0.044905 0.125
## year -0.074009 0.003030 -24.429
##
## Correlation of Fixed Effects:
## (Intr) drgTrt
## drugTretmnt -0.692
## year -0.212 -0.005
# Interaction between treatment and time, random intercepts
lme_m2 <- lmer(formula=albumin ~ (1 | id) + drug*year, data=pbc2)
summary(lme_m2)
## Linear mixed model fit by REML ['lmerMod']
## Formula: albumin ~ (1 | id) + drug * year
## Data: pbc2
##
## REML criterion at convergence: 2055.4
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -4.3440 -0.5248 0.0767 0.5411 10.7322
##
## Random effects:
## Groups Name Variance Std.Dev.
## id (Intercept) 0.1280 0.3578
## Residual 0.1235 0.3514
## Number of obs: 1945, groups: id, 312
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 3.524882 0.033350 105.694
## drugTreatment -0.002564 0.047047 -0.055
## year -0.075826 0.004350 -17.431
## drugTreatment:year 0.003537 0.006063 0.583
##
## Correlation of Fixed Effects:
## (Intr) drgTrt year
## drugTretmnt -0.709
## year -0.298 0.211
## drgTrtmnt:y 0.214 -0.299 -0.717
# Treatment-specific time effects, random intercepts
lme_m3 <- lmer(formula=albumin ~ (1 | id) + drug:year, data=pbc2)
summary(lme_m3)
## Linear mixed model fit by REML ['lmerMod']
## Formula: albumin ~ (1 | id) + drug:year
## Data: pbc2
##
## REML criterion at convergence: 2051.1
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -4.3450 -0.5257 0.0770 0.5407 10.7340
##
## Random effects:
## Groups Name Variance Std.Dev.
## id (Intercept) 0.1275 0.3571
## Residual 0.1235 0.3514
## Number of obs: 1945, groups: id, 312
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 3.523609 0.023490 150.00
## drugPlacebo:year -0.075756 0.004251 -17.82
## drugTreatment:year -0.072316 0.004128 -17.52
##
## Correlation of Fixed Effects:
## (Intr) drgPl:
## drugPlcb:yr -0.215
## drgTrtmnt:y -0.217 0.047
# Treatment-specific time effects, random intercepts + slopes
lme_m4 <- lmer(formula=albumin ~ (year | id) + drug:year, data=pbc2)
summary(lme_m4)
## Linear mixed model fit by REML ['lmerMod']
## Formula: albumin ~ (year | id) + drug:year
## Data: pbc2
##
## REML criterion at convergence: 1939.3
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -4.6469 -0.5083 0.0583 0.5318 11.0328
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## id (Intercept) 0.120600 0.34728
## year 0.003055 0.05527 -0.01
## Residual 0.104488 0.32325
## Number of obs: 1945, groups: id, 312
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 3.540634 0.022798 155.30
## drugPlacebo:year -0.090361 0.007665 -11.79
## drugTreatment:year -0.087260 0.007511 -11.62
##
## Correlation of Fixed Effects:
## (Intr) drgPl:
## drugPlcb:yr -0.166
## drgTrtmnt:y -0.169 0.028
# 95% confidence intervals
confint(lme_m4, parm='beta_')
## Computing profile confidence intervals ...
## 2.5 % 97.5 %
## (Intercept) 3.4952518 3.58531708
## drugPlacebo:year -0.1066624 -0.07430536
## drugTreatment:year -0.1031688 -0.07154347
# Likelihood ratio test for inclusion of random slopes
anova(lme_m4, lme_m3)
## refitting model(s) with ML (instead of REML)
## Data: pbc2
## Models:
## lme_m3: albumin ~ (1 | id) + drug:year
## lme_m4: albumin ~ (year | id) + drug:year
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## lme_m3 5 2037.2 2065.0 -1013.6 2027.2
## lme_m4 7 1931.6 1970.6 -958.8 1917.6 109.54 2 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
I prefer the model with treatment-specific time effects and random intercepts and slopes (lme_m4). At baseline, the mean albumin level is 3.54 g/dl, which is constrained to be equal between the treatment groups. There is modest heterogeneity in albumin levels at baseline; compare \(\hat{\beta}_0=3.54\) to \(\hat{D}_{11}=0.35^2\). Note that \(3.54 \pm 1.96\times 0.35 = (2.86, 4.22)\) g/dl quantifies variability across participants in albumin levels at baseline. NB: this is not a confidence interval for \(\beta_0\).
For the placebo group, a one-year increase in time is associated with a 0.09 g/dl decrease in albumin levels. For the treatment group, a one-year increase in time is associated with a 0.087 g/dl decrease in bilirubin levels. There is no evidence that the average trend in albumin levels differs between the placebo and treatment groups.
There is substantial heterogeneity in albumin trends over time; compare \(\hat{\beta}_1=-0.09\) and \(\hat{\beta}_2=-0.087\) to \(\hat{D}_{22}=0.055^2\). Note that \(-0.09 \pm 1.96\times 0.055 = ( -0.199, 0.018)\) and \(-0.087 \pm 1.96\times 0.055 = (-0.196, 0.021)\) quantify the variability across participants in the per-year decrease in albumin levels among the placebo and treatment groups, respectively.
R programming note: everything in R is an object. Objects have classes. A call to lmer (and glmer) create an object of class merMod. There are functions to extract components from objects of class merMod, such as fixef and ranef to extract the fixed-effects and random-effects estimates, respectively. See methods(class=“merMod”) for details.
Residual analysis can be used to evaluate the adequacy of a fitted model. See ?residuals.merMod and ?fitted.merMod for details on extracting the residuals and fitted values, respectively, from a merMod object.
plotResid <- function(x, y, col.lowess='red', ...){
plot(x, y, ...)
abline(h=0, col='gray', lwd=2)
lines(lowess(x, y), col=col.lowess, lwd=2)}
plotQQ <- function(x, ...){
qqnorm(x, ...)
qqline(x, col='gray', lwd=2)}
par(mfrow=c(2,2), mar=c(5,5,2,1))
plotResid(fitted(lme_m4), residuals(lme_m4), xlab='Fitted values', ylab='Residuals')
plotQQ(residuals(lme_m4), main='Residuals')
plotQQ(ranef(lme_m4)$id[,1], main='Random intercepts')
plotQQ(ranef(lme_m4)$id[,2], main='Random slopes')
Ignoring the two outliers, which in practice we would treat more seriously, assumptions regarding the mean model (the red lowess smoother approximates the flat line at 0) and Normality appear to be satisfied.
In the plot of observed individual-level profiles in albumin level over time, the scatterplot smoothers suggested a non-linear association between time and albumin level for both the treatment and placebo groups. Consider adding quadratic terms for time to the ‘preferred’ models.
library(geepack)
gee_m5 <- geeglm(formula=albumin ~ drug:year + drug:I(year^2), family='gaussian',
data=pbc2, id=id, corstr='exchangeable', scale.fix=TRUE)
anova(gee_m5)
## Analysis of 'Wald statistic' Table
## Model: gaussian, link: identity
## Response: albumin
## Terms added sequentially (first to last)
##
## Df X2 P(>|Chi|)
## drug:year 2 240.698 < 2e-16 ***
## drug:I(year^2) 2 6.952 0.03092 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
lme_m5 <- lmer(formula=albumin ~ (year | id) + drug:year + drug:I(year^2), data=pbc2)
anova(lme_m5, lme_m4)
## refitting model(s) with ML (instead of REML)
## Data: pbc2
## Models:
## lme_m4: albumin ~ (year | id) + drug:year
## lme_m5: albumin ~ (year | id) + drug:year + drug:I(year^2)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## lme_m4 7 1931.6 1970.6 -958.8 1917.6
## lme_m5 9 1935.6 1985.8 -958.8 1917.6 0.0092 2 0.9954
What could explain the disparity in these results? Hint: consider why participants are lost to follow-up.
This file was created from:
## R version 4.1.0 (2021-05-18)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur 10.16
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
##
## attached base packages:
## [1] splines stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] geepack_1.3-2 JM_1.4-8 survival_3.2-11 nlme_3.1-152 MASS_7.3-54 ggplot2_3.3.5
## [7] dplyr_1.0.7 lme4_1.1-27.1 Matrix_1.3-3
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.7 highr_0.9 compiler_4.1.0 pillar_1.6.1 nloptr_1.2.2.2
## [6] tools_4.1.0 boot_1.3-28 digest_0.6.27 gtable_0.3.0 evaluate_0.14
## [11] lifecycle_1.0.0 tibble_3.1.2 lattice_0.20-44 mgcv_1.8-35 pkgconfig_2.0.3
## [16] rlang_0.4.11 cli_3.0.0 yaml_2.2.1 xfun_0.24 withr_2.4.2
## [21] stringr_1.4.0 knitr_1.33 generics_0.1.0 vctrs_0.3.8 grid_4.1.0
## [26] tidyselect_1.1.1 glue_1.4.2 R6_2.5.0 fansi_0.5.0 rmarkdown_2.9
## [31] minqa_1.2.4 tidyr_1.1.3 farver_2.1.0 purrr_0.3.4 magrittr_2.0.1
## [36] backports_1.2.1 scales_1.1.1 htmltools_0.5.1.1 ellipsis_0.3.2 colorspace_2.0-2
## [41] labeling_0.4.2 utf8_1.2.1 stringi_1.6.2 munsell_0.5.0 broom_0.7.8
## [46] crayon_1.4.1