Goal of This Case Study

Our primary objective for this case study is to go through a full analyses and missing data approaches with a binary outcome variable (amenorrhea).

Packages

library(tidyverse)
library(mice)
library(knitr)
library(kableExtra)
library(geepack)
library(multcomp) # for glht()
library(wgeesel) # for IPW-GEE
library(gridExtra)
library(nlme) # lme() 
library(lme4) # glmer() 

pal2use <- c("#000000", "#E69F00", "#56B4E9", "#009E73", 
             "#F0E442", "#0072B2", "#D55E00", "#CC79A7")
pal2use2 <- c("#E69F00", "#0072B2")

The Dataset

These data are from a longitudinal clinical trial of contracepting women. In this trial women received an injection of either 100 mg or 150 mg of depot-medroxyprogesterone acetate (DMPA) on the day of randomization and three additional injections at 90-day intervals. There was a final follow-up visit 90 days after the fourth injection, i.e., one year after the first injection. Throughout the study each woman completed a menstrual diary that recorded any vaginal bleeding pattern disturbances. The diary data were used to determine whether a women experienced amenorrhea, the absence of menstrual bleeding for a specified number of days.

A total of 1151 women completed the menstrual diaries and the diary data were used to generate a binary sequence for each woman according to whether or not she had experienced amenorrhea in the four successive three month intervals. In clinical trials of modern hormonal contraceptives, pregnancy is exceedingly rare (and would be regarded as a failure of the contraceptive method), and is not the main outcome of interest in this study. Instead, the outcome of interest is a binary response indicating whether a woman experienced amenorrhea in the four successive three month intervals.

A feature of this clinical trial is that there was substantial dropout. More than one third of the women dropped out before the completion of the trial.

Machin D, Farley T, Busca B, Campbell M and d’Arcangues C. (1988). Assessing changes in vaginal bleeding patterns in contracepting women. Contraception, 38, 165-179.

# Clinical trial of contracepting women (Machin 1988, ALA p415)
ctcw <- read.csv("../Datasets/amenorrhea.csv", 
                 na.strings = ".") %>% 
  mutate(Dosage = ifelse(dose == 1, "150mg", "100mg"))
head(ctcw)
##   id dose visit amenorrhea Dosage
## 1  1    0     1          0  100mg
## 2  1    0     2         NA  100mg
## 3  1    0     3         NA  100mg
## 4  1    0     4         NA  100mg
## 5  2    0     1          0  100mg
## 6  2    0     2         NA  100mg

The dataset contains measurements on 1151 women and contains the following columns:

  1. id: subject ID
  2. dose: 150 mg (1) or 100 mg (0) of DMPA
  3. visit: 1 through 4, indicating the period following each of the four shots
  4. amenorrhea: experiencing amenorrhea (1) or not experiencing it (0), or missing (NA)
  5. (Dosage: human-readable version of dose)

Exploratory Data Analysis

For this dataset, let’s begin by looking at the missing data pattern, since we have a hint above that it’s important. From the plot below, it is clear that missing data are due to dropout: there are no non-monotone missing data patterns in the plot. That is, if Visit 2 is missing, then visits 3 and 4 are also missing. The figure shows that for 198/1151 women (17.2%), visits 2-4 are all missing; for 155/1151 women (13.5%), visits 3-4 are missing; and 84/1151 (7.3%) are missing only visit 4. That leaves 714/1151 (62% of women) with complete data.

ctcw_wide <- ctcw %>% 
  pivot_wider(names_from = visit, values_from = amenorrhea)

# Missing data patterns (package mice) 
mdpatt <- md.pattern(ctcw_wide)

Such a high level of dropout is quite concerning! Stay tuned for the Special Topics portion of Day 2 for more about this. In the meantime, we’re going to forge ahead.

The outcome of amenorrhea is binary, so a spaghetti plot (or any visualization of individual trajectories) won’t get us very far, but we can look at prevalence by time:

# prevalence by visit number and dosage
prev_by_time <- ctcw %>% 
  group_by(visit, Dosage) %>% 
  summarize(n = n(), prevalence = mean(amenorrhea, na.rm = TRUE))

ggplot(prev_by_time, aes(x=visit, y=prevalence, 
                        group=factor(Dosage), 
                        color=factor(Dosage))) + 
  geom_line() + 
  geom_point() + 
  theme_classic(base_size = 13) + 
  labs(color = "Dose") + 
  scale_color_manual(values = pal2use2) + 
  xlab("Time Point (quarterly visits)") + ylab("Prevalence of Amenorrhea") + 
  ggtitle("Prevalence of Amenorrhea by Dosage and Time")

It looks like amenorrhea is more likely at the higher dose across all time points, and prevalence also increases as subjects are on the treatment for a longer period of time. The increase in risk may be slightly higher at the higher dose.

Generalized Estimating Equations (GEE)

We will use the following model:

\[\log\left(\frac{\mu_{ij}}{1-\mu_{ij}} \right) = \beta_0 + \beta_1\text t_{ij} + \beta_2 t_{ij}^2 + \beta_3\text{Dose}_i\times t_{ij} + \beta_4\text{Dose}_i\times t_{ij}^2\] where \(\mu_{ij} = \text{Pr}(Y_{ij}=1)\)

First, we’ll create a new column in the dataset for \(t_{ij}^2\):

ctcw$visit2 <- ctcw$visit^2

Before we go ahead and fit our model, it’s a good idea to make sure the data are ordered by id and time (geeglm() assumes the data are grouped by id):

ctcw <- ctcw[order(ctcw$id, ctcw$visit),]

We’ll experiment with various working correlation structures and compare to using glm(). As a reminder with glm(), the point estimates will match those from using GEE with working independence, but the standard errors are based off assuming all observations are independent, which we would not expect to be valid in the setting of a longitudinal study (we expect dependence between measurements taken on the same individual):

m_ind <- geeglm(amenorrhea ~ visit + visit:dose + visit2 + visit2:dose,
                id = id, data = ctcw, family = binomial, waves = visit,
                corstr = "independence")
m_exc <- geeglm(amenorrhea ~ visit + visit:dose + visit2 + visit2:dose,
                id = id, data = ctcw, family = binomial, waves = visit,
                corstr = "exchangeable")
m_ar1 <- geeglm(amenorrhea ~ visit + visit:dose + visit2 + visit2:dose,
                id = id, data = ctcw, family = binomial, waves = visit,
                corstr = "ar1")
m_uns <- geeglm(amenorrhea ~ visit + visit:dose + visit2 + visit2:dose,
                id = id, data = ctcw, family = binomial, waves = visit,
                corstr = "unstructured")

m_glm <- glm(amenorrhea ~ visit + visit:dose + visit2 + visit2:dose,
             data = ctcw, family = binomial)

In the above, you’ll notice the use of the waves argument. This is important when using working AR1 and unstructured if you have intermittent missingness where the rows may not correspond to sequential visits starting from the first one. For example, if you have someone attend visit 1, 3, and 4 (i.e., misses visit 2) without the waves = visit argument, geeglm() will treat the observations as corresponding to the first, second, and third visit when estimating the working correlation structure. This ends up not being necessary in this example– we do not have intermittent missingness (we only have dropout) and we ordered by time before fitting the models.

Results from working independence:

summary(m_ind)
## 
## Call:
## geeglm(formula = amenorrhea ~ visit + visit:dose + visit2 + visit2:dose, 
##     family = binomial, data = ctcw, id = id, waves = visit, corstr = "independence")
## 
##  Coefficients:
##             Estimate  Std.err    Wald Pr(>|W|)    
## (Intercept) -2.19554  0.17845 151.374  < 2e-16 ***
## visit        0.66983  0.16225  17.045 3.65e-05 ***
## visit2      -0.03031  0.03276   0.856  0.35490    
## visit:dose   0.29730  0.11345   6.867  0.00878 ** 
## dose:visit2 -0.06243  0.02963   4.439  0.03512 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation structure = independence 
## Estimated Scale Parameters:
## 
##             Estimate Std.err
## (Intercept)    1.001 0.02894
## Number of clusters:   1151  Maximum cluster size: 4

Results from working exchangeable:

summary(m_exc)
## 
## Call:
## geeglm(formula = amenorrhea ~ visit + visit:dose + visit2 + visit2:dose, 
##     family = binomial, data = ctcw, id = id, waves = visit, corstr = "exchangeable")
## 
##  Coefficients:
##             Estimate Std.err   Wald Pr(>|W|)    
## (Intercept)  -2.2370  0.1765 160.64  < 2e-16 ***
## visit         0.6967  0.1586  19.31  1.1e-05 ***
## visit2       -0.0328  0.0320   1.05   0.3055    
## visit:dose    0.3284  0.1100   8.91   0.0028 ** 
## dose:visit2  -0.0637  0.0286   4.97   0.0259 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation structure = exchangeable 
## Estimated Scale Parameters:
## 
##             Estimate Std.err
## (Intercept)        1  0.0287
##   Link = identity 
## 
## Estimated Correlation Parameters:
##       Estimate Std.err
## alpha    0.363  0.0243
## Number of clusters:   1151  Maximum cluster size: 4

Results from working AR1:

summary(m_ar1)
## 
## Call:
## geeglm(formula = amenorrhea ~ visit + visit:dose + visit2 + visit2:dose, 
##     family = binomial, data = ctcw, id = id, waves = visit, corstr = "ar1")
## 
##  Coefficients:
##             Estimate Std.err   Wald Pr(>|W|)    
## (Intercept)  -2.2485  0.1759 163.33  < 2e-16 ***
## visit         0.7069  0.1578  20.07  7.5e-06 ***
## visit2       -0.0330  0.0318   1.08   0.2990    
## visit:dose    0.3633  0.1107  10.76   0.0010 ** 
## dose:visit2  -0.0790  0.0288   7.50   0.0062 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation structure = ar1 
## Estimated Scale Parameters:
## 
##             Estimate Std.err
## (Intercept)    0.997  0.0284
##   Link = identity 
## 
## Estimated Correlation Parameters:
##       Estimate Std.err
## alpha    0.493  0.0252
## Number of clusters:   1151  Maximum cluster size: 4

Results from working unstructured:

summary(m_uns)
## 
## Call:
## geeglm(formula = amenorrhea ~ visit + visit:dose + visit2 + visit2:dose, 
##     family = binomial, data = ctcw, id = id, waves = visit, corstr = "unstructured")
## 
##  Coefficients:
##             Estimate Std.err   Wald Pr(>|W|)    
## (Intercept)  -2.2408  0.1765 161.19   <2e-16 ***
## visit         0.6979  0.1581  19.49    1e-05 ***
## visit2       -0.0314  0.0318   0.98   0.3233    
## visit:dose    0.3380  0.1097   9.50   0.0021 ** 
## dose:visit2  -0.0690  0.0284   5.90   0.0151 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation structure = unstructured 
## Estimated Scale Parameters:
## 
##             Estimate Std.err
## (Intercept)        1  0.0285
##   Link = identity 
## 
## Estimated Correlation Parameters:
##           Estimate Std.err
## alpha.1:2    0.344  0.0367
## alpha.1:3    0.259  0.0359
## alpha.1:4    0.273  0.0341
## alpha.2:3    0.426  0.0358
## alpha.2:4    0.386  0.0345
## alpha.3:4    0.499  0.0361
## Number of clusters:   1151  Maximum cluster size: 4

Suppose we want to estimate the population odds ratio at 12 months (i.e. \(\text{visit} = 4\) and \(\text{visit}^2 = 16\)). We can use the following code:

pop_odds12 <- glht(m_exc, "4*visit:dose + 16*dose:visit2 = 0")
summary(pop_odds12)
## 
##   Simultaneous Tests for General Linear Hypotheses
## 
## Fit: geeglm(formula = amenorrhea ~ visit + visit:dose + visit2 + visit2:dose, 
##     family = binomial, data = ctcw, id = id, waves = visit, corstr = "exchangeable")
## 
## Linear Hypotheses:
##                                        Estimate Std. Error z value Pr(>|z|)  
## 4 * visit:dose + 16 * dose:visit2 == 0    0.295      0.144    2.05    0.041 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)
confint(pop_odds12)
## 
##   Simultaneous Confidence Intervals
## 
## Fit: geeglm(formula = amenorrhea ~ visit + visit:dose + visit2 + visit2:dose, 
##     family = binomial, data = ctcw, id = id, waves = visit, corstr = "exchangeable")
## 
## Quantile = 1.96
## 95% family-wise confidence level
##  
## 
## Linear Hypotheses:
##                                        Estimate lwr    upr   
## 4 * visit:dose + 16 * dose:visit2 == 0 0.2949   0.0125 0.5773

Generalized Linear Mixed Models

We will address two conditional questions using GLMM:

  1. How do subject-specific risks of amenorrhea change over the course of the study?
  2. What is the influence of dosage on amenorrhea risk? (That is, for each woman in the study, how do we estimate her risk would differ if she was in the other dosage group?)

Model Fitting

We’ll use the same fixed effects/systematic component of the model as we used above for the GEE, but add random intercepts. (Note that we are not going to fit a random intercepts and random slopes model in this case, for reasons discussed in the lecture: there simply isn’t enough information in such a short sequence of observations to estimate subject-specific changes over time in the probability of amenorrhea.) Here’s our model: \[\log\left(\frac{\mu_{ij}}{1-\mu_{ij}} \right) = \beta_0 + \beta_1\text t_{ij} + \beta_2 t_{ij}^2 + \beta_3\text{Dose}_i\times t_{ij} + \beta_4\text{Dose}_i\times t_{ij}^2 + b_i\] where \(\mu_{ij} = \text{Pr}(Y_{ij}=1)\). The nAGQ argument (only for random intercept models) may be increased from the default of 1 to improve the accuracy of the estimates, at the cost of some extra computation time. The documentation recommends increasing to up to 25, although at some point the estimates no longer change and increasing nAGQ further does not provide additional improvement.

glmm.ri <- glmer(amenorrhea ~ visit + visit:dose + visit2 + visit2:dose + (1 | id), 
                 data = ctcw, nAGQ = 10, family = "binomial")
summary(glmm.ri)
## Generalized linear mixed model fit by maximum likelihood (Adaptive
##   Gauss-Hermite Quadrature, nAGQ = 10) [glmerMod]
##  Family: binomial  ( logit )
## Formula: amenorrhea ~ visit + visit:dose + visit2 + visit2:dose + (1 |      id)
##    Data: ctcw
## 
##      AIC      BIC   logLik deviance df.resid 
##     3881     3918    -1934     3869     3610 
## 
## Scaled residuals: 
##    Min     1Q Median     3Q    Max 
## -2.917 -0.430 -0.227  0.471  4.008 
## 
## Random effects:
##  Groups Name        Variance Std.Dev.
##  id     (Intercept) 5.05     2.25    
## Number of obs: 3616, groups:  id, 1151
## 
## Fixed effects:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -3.8035     0.3047  -12.48  < 2e-16 ***
## visit         1.1326     0.2681    4.22  2.4e-05 ***
## visit2       -0.0419     0.0548   -0.76   0.4448    
## visit:dose    0.5642     0.1921    2.94   0.0033 ** 
## dose:visit2  -0.1095     0.0496   -2.21   0.0272 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) visit  visit2 vst:ds
## visit       -0.832                     
## visit2       0.710 -0.967              
## visit:dose  -0.052 -0.345  0.434       
## dose:visit2  0.041  0.333 -0.456 -0.953

The fitted model is: \[\text{logit}(P(\text{Amenorrhea}_i | \text{Dose}_i, \text{Visit}_{ij}, b_i)) = -3.80 + 1.13 \times \text{Visit}_{ij}\] \[ -0.04 \times \text{Visit}_{ij}^2 + 0.56 \times \text{Dose}_i \times \text{Visit}_{ij} - 0.11 \times \text{Dose}_{i} \times \text{Visit}_{ij}^2 + b_i\]

Let’s plot the estimated probability by subject and the average of all subject-specific trajectories (the average conditional mean):

## Plot point estimates 
point_est_df <- data.frame(dose = rep(c(0,1), each = 4), 
                           Dosage = rep(c("100mg", "150mg"), each=4),  
                           visit = 1:4, visit2 = (1:4)^2)
point_est_df$glmm_est <- predict(glmm.ri, 
                                 newdata = point_est_df, 
                                 type = "response",  # probabilities 
                                 re.form = NA) # no random effects
point_est_df$glmm_logodds <- predict(glmm.ri, 
                                 newdata = point_est_df, 
                                 type = "link",  # log odds 
                                 re.form = NA) # no random effects
ctcw$pred_prob <- predict(glmm.ri, 
                          newdata = ctcw, 
                          type = "response",  # probabilities
                          re.form = NULL)  # all random effects 
ctcw$pred_logodds <- predict(glmm.ri, 
                          newdata = ctcw, 
                          type = "link",  # log odds
                          re.form = NULL)  # all random effects 

prob.plot <- ggplot(data = point_est_df, 
       aes(x=visit, y=glmm_est, group=Dosage, col=Dosage)) +
  geom_line(size = 1.5) + 
  geom_point(size = 2.5) +
  geom_line(data = ctcw, 
            aes(x=visit, y=pred_prob, group=id, col=Dosage), 
            size=0.25, linetype = 2) + 
  theme_classic(base_size = 16) + 
  ylab("Probability of Amenorrhea") + xlab("Visit Number")  + 
  theme(legend.position="bottom") + 
  scale_color_manual(values = pal2use2)
logodds.plot <- ggplot(data = point_est_df, 
       aes(x=visit, y=glmm_logodds, group=Dosage, col=Dosage)) +
  geom_line(size = 1.5) + 
  geom_point(size = 2.5) +
  geom_line(data = ctcw, 
            aes(x=visit, y=pred_logodds, group=id, col=Dosage), 
            size=0.25, linetype = 2) + 
  theme_classic(base_size = 16) + 
  ylab("Log Odds of Amenorrhea") + xlab("Visit Number")  + 
  theme(legend.position="bottom") + 
  scale_color_manual(values = pal2use2)
grid.arrange(logodds.plot, prob.plot, nrow=1)

Notice that since the only subject-specific element is a random intercept, the lines are parallel on a log-odds scale (within each dose group; they’re still not straight because of the quadratic time term) but not on a probability scale.

Model Results

Since more than one term corresponds to time, inference on the time coefficients (while statistically reasonable) is difficult to interpret. For a more interpretable test, we’ll examine differences between doses at visit 4:

## Estimated ratio at visit 4 
avg_odds12 <- glht(glmm.ri, "4*visit:dose + 16*dose:visit2 = 0")
summary(avg_odds12)
## 
##   Simultaneous Tests for General Linear Hypotheses
## 
## Fit: glmer(formula = amenorrhea ~ visit + visit:dose + visit2 + visit2:dose + 
##     (1 | id), data = ctcw, family = "binomial", nAGQ = 10)
## 
## Linear Hypotheses:
##                                        Estimate Std. Error z value Pr(>|z|)  
## 4 * visit:dose + 16 * dose:visit2 == 0    0.504      0.241    2.09    0.036 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)
exp(0.504)   # exponentiate to get odds ratio 
## [1] 1.66
confint(avg_odds12)
## 
##   Simultaneous Confidence Intervals
## 
## Fit: glmer(formula = amenorrhea ~ visit + visit:dose + visit2 + visit2:dose + 
##     (1 | id), data = ctcw, family = "binomial", nAGQ = 10)
## 
## Quantile = 1.96
## 95% family-wise confidence level
##  
## 
## Linear Hypotheses:
##                                        Estimate lwr    upr   
## 4 * visit:dose + 16 * dose:visit2 == 0 0.5043   0.0322 0.9764
paste(round(exp(0.032), 3), 
      round(exp(0.976), 3), 
      sep=", ")  # exponentiate to get odds ratios 
## [1] "1.033, 2.654"
## Wald test 
# Full model: visit, visit^2, and full interactions with dose 
glmm.ri.1 <- glmer(amenorrhea ~ visit + visit:dose + visit2 + 
                     visit2:dose + (1 | id), 
                 data = ctcw, nAGQ = 10, family = "binomial")
# Reduced model: visit, visit^2, no dose 
glmm.ri.0 <- glmer(amenorrhea ~ visit + visit2 + (1 | id), 
                 data = ctcw, nAGQ = 10, family = "binomial")
# Likelihood ratio test: 
anova(glmm.ri.1, glmm.ri.0) 
## Data: ctcw
## Models:
## glmm.ri.0: amenorrhea ~ visit + visit2 + (1 | id)
## glmm.ri.1: amenorrhea ~ visit + visit:dose + visit2 + visit2:dose + (1 | id)
##           npar  AIC  BIC logLik deviance Chisq Df Pr(>Chisq)   
## glmm.ri.0    4 3890 3914  -1941     3882                       
## glmm.ri.1    6 3881 3918  -1935     3869  12.7  2     0.0018 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

On average, the odds of amenorrhea at 12 months (Visit 4) for a woman who took the high dose is 1.66 times higher than a woman with the same baseline risk who took the low dose (95% CI: 1.03-2.65). Conducting a likelihood ratio test (\(H_0: \beta_3 = \beta_4 = 0\)), we find a statistically significant effect of dose (p=0.002).

GLMM vs. GEE for Amenorrhea (slide 50)

Comparing GEE to GLMM trajectories, we can see that the marginal and conditional associations differ (as expected given the binary outcome). Also as expected, the marginal (GEE) estimates are less steep – attenuated – compared to the conditional (GLMM) estimates.

m_exc <- geeglm(amenorrhea ~ visit + visit:dose + visit2 + visit2:dose,
                id = id, data = ctcw, family = binomial, waves = visit,
                corstr = "exchangeable")
point_est_df$gee_est <- predict(m_exc, newdata = point_est_df, 
                                 type = "response")

point_est_df %>% 
  pivot_longer(c(glmm_est, gee_est), 
               names_to = "Model", 
               values_to = "Estimate") %>% 
  mutate(Model = ifelse(Model == "glmm_est", "GLMM (RI)", "GEE (Exch)")) %>% 
  ggplot(aes(x=visit, y=Estimate, group=interaction(Model, Dosage), 
             col=Dosage, linetype=Model)) +
  geom_line() + 
  geom_point()  +
  theme_classic(base_size = 16) + 
  ylab("Predicted P(Amenorrhea)") + xlab("Visit Number")  + 
  theme(legend.position="bottom") + 
  scale_color_manual(values = pal2use2) +
  guides(col=guide_legend(nrow=2,byrow=TRUE), 
         linetype=guide_legend(nrow=2, byrow=TRUE))

  • GEE: marginal estimates (population-averaged odds ratios)
  • GLMM: conditional estimates (within-subject odds ratios)
  • Marginal is relative to conditional
  • Note: unlike LMM, random intercept in GLMM does not induce exchangeable correlation
    • So these models do not assume the same correlation structure

Missing Data Methods

1. Complete Case Analysis

We’ll first run a complete case analysis, where we only include subjects who attended every visit:

# IDS where at least one visit is missing
ids_miss <- ctcw$id[is.na(ctcw$amenorrhea)]

# Create dataset for "complete case" analysis
ctcw_complete <- ctcw[!ctcw$id %in% ids_miss, ]

m_cc <- geeglm(amenorrhea ~ visit + visit:dose + visit2 + visit2:dose,
               id = id, data = ctcw, family = binomial, 
               waves = visit, corstr = "exchangeable")

summary(m_cc)
## 
## Call:
## geeglm(formula = amenorrhea ~ visit + visit:dose + visit2 + visit2:dose, 
##     family = binomial, data = ctcw, id = id, waves = visit, corstr = "exchangeable")
## 
##  Coefficients:
##             Estimate Std.err   Wald Pr(>|W|)    
## (Intercept)  -2.2370  0.1765 160.64  < 2e-16 ***
## visit         0.6967  0.1586  19.31  1.1e-05 ***
## visit2       -0.0328  0.0320   1.05   0.3055    
## visit:dose    0.3284  0.1100   8.91   0.0028 ** 
## dose:visit2  -0.0637  0.0286   4.97   0.0259 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation structure = exchangeable 
## Estimated Scale Parameters:
## 
##             Estimate Std.err
## (Intercept)        1  0.0287
##   Link = identity 
## 
## Estimated Correlation Parameters:
##       Estimate Std.err
## alpha    0.363  0.0243
## Number of clusters:   1151  Maximum cluster size: 4

2. Available Data Analysis

Next, we’ll run an available data analysis, where we include all available observations on all subjects (for example: if a person misses the 4th visit only they would be excluded from a complete case analysis, and their responses from the first 3 visits would be included in an available data analysis). This is the default of geeglm(), so we have already seen these results earlier:

ctcw_available <- ctcw[!is.na(ctcw$amenorrhea), ]
m_ad <- geeglm(amenorrhea ~ visit + visit:dose + visit2 + visit2:dose,
               id = id, data = ctcw_available, family = binomial, 
               waves = visit, corstr = "exchangeable")
summary(m_ad)
## 
## Call:
## geeglm(formula = amenorrhea ~ visit + visit:dose + visit2 + visit2:dose, 
##     family = binomial, data = ctcw_available, id = id, waves = visit, 
##     corstr = "exchangeable")
## 
##  Coefficients:
##             Estimate Std.err   Wald Pr(>|W|)    
## (Intercept)  -2.2370  0.1765 160.64  < 2e-16 ***
## visit         0.6967  0.1586  19.31  1.1e-05 ***
## visit2       -0.0328  0.0320   1.05   0.3055    
## visit:dose    0.3284  0.1100   8.91   0.0028 ** 
## dose:visit2  -0.0637  0.0286   4.97   0.0259 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation structure = exchangeable 
## Estimated Scale Parameters:
## 
##             Estimate Std.err
## (Intercept)        1  0.0287
##   Link = identity 
## 
## Estimated Correlation Parameters:
##       Estimate Std.err
## alpha    0.363  0.0243
## Number of clusters:   1151  Maximum cluster size: 4

3. WGEE

Finally, we’ll use IPW-GEE where our model for dropout is: \[\log\left(\frac{\pi_{ij}}{1-\pi_{ij}}\right) = \theta_0 + \theta_1 \text{Dose}_{i} + \theta_2 t_{ij} + \theta_3 Y_{i,j-1} + \theta_4 \text{Dose}_i \times Y_{i,j-1}\] where \(\pi_{ij} = \text{Pr}(R_{ij} = 1|, R_{i,j-1} = 1, \mathbf{X}_i, Y_{i1},\dots, Y_{i,j-1})\)

This is specified through the mismodel argument in the wgee() function.

library(wgeesel)
# Create indicator for dropout. R = 1 if observed; R = 0 if missing
ctcw$R <- 1 
ctcw$R[is.na(ctcw$amenorrhea)] <- 0
# Our dropout model includes the response at the previous time point as potentially being informative of dropout
ctcw$lag1y <- ylag(ctcw$id, ctcw$amenorrhea, 1)
head(ctcw)
##   id dose visit amenorrhea Dosage visit2 pred_prob pred_logodds R lag1y
## 1  1    0     1          0  100mg      1    0.0492       -2.961 1    NA
## 2  1    0     2         NA  100mg      4    0.1241       -1.954 0     0
## 3  1    0     3         NA  100mg      9    0.2628       -1.031 0    NA
## 4  1    0     4         NA  100mg     16    0.4522       -0.192 0    NA
## 5  2    0     1          0  100mg      1    0.0492       -2.961 1    NA
## 6  2    0     2         NA  100mg      4    0.1241       -1.954 0     0
m_ipw <- wgee(model = amenorrhea ~ visit + visit:dose + visit2 + visit2:dose,
              data = ctcw, id = ctcw$id,
              family="binomial", corstr ="exchangeable",
              scale = NULL, mismodel = R ~ dose + visit + lag1y*dose)

summary(m_ipw)
## Call:
## wgee(model = amenorrhea ~ visit + visit:dose + visit2 + visit2:dose, 
##     data = ctcw, id = ctcw$id, family = "binomial", corstr = "exchangeable", 
##     scale = NULL, mismodel = R ~ dose + visit + lag1y * dose)
## 
##             Estimates Robust SE z value Pr(>|z|)    
## (Intercept)   -2.2562    0.1776  -12.71   <2e-16 ***
## visit          0.7067    0.1602    4.41    1e-05 ***
## visit2        -0.0325    0.0322   -1.01   0.3136    
## visit:dose     0.3394    0.1124    3.02   0.0025 ** 
## dose:visit2   -0.0688    0.0289   -2.38   0.0174 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Estimated Scale Parameter:  1 
## 
##  Estimated Correlation:  0.382

We can take a look at the results for the dropout model:

summary(m_ipw$mis_fit)
## 
## Call:
## glm(formula = mismodel, family = binomial(), data = data[adjusted_idx, 
##     ])
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -2.209   0.441   0.544   0.603   0.793  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   0.9460     0.2014    4.70  2.7e-06 ***
## dose          0.0681     0.1313    0.52   0.6040    
## visit         0.3338     0.0682    4.90  9.7e-07 ***
## lag1y        -0.4453     0.1618   -2.75   0.0059 ** 
## dose:lag1y   -0.2421     0.2193   -1.10   0.2696    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2459.3  on 2901  degrees of freedom
## Residual deviance: 2417.1  on 2897  degrees of freedom
## AIC: 2427
## 
## Number of Fisher Scoring iterations: 4

It’s a good idea to check the distribution of estimated weights for presence of large weights (these could then have large influence on the results):

# Histogram of estimated weights where response is observed
hist(m_ipw$weight[ctcw$R==1], 
     xlab="Estimated IPW Weights", 
     main="Estimated Weights where Response is Observed")

The estimated weights range from 1 to 2.06, so there isn’t a concern in this analysis. If the dropout model is correctly specified, WGEE would be valid for data that is MAR.