Our primary objective for this case study is to go through a full analyses and missing data approaches with a binary outcome variable (amenorrhea).
library(tidyverse)
library(mice)
library(knitr)
library(kableExtra)
library(geepack)
library(multcomp) # for glht()
library(wgeesel) # for IPW-GEE
library(gridExtra)
library(nlme) # lme()
library(lme4) # glmer()
pal2use <- c("#000000", "#E69F00", "#56B4E9", "#009E73",
"#F0E442", "#0072B2", "#D55E00", "#CC79A7")
pal2use2 <- c("#E69F00", "#0072B2")
These data are from a longitudinal clinical trial of contracepting women. In this trial women received an injection of either 100 mg or 150 mg of depot-medroxyprogesterone acetate (DMPA) on the day of randomization and three additional injections at 90-day intervals. There was a final follow-up visit 90 days after the fourth injection, i.e., one year after the first injection. Throughout the study each woman completed a menstrual diary that recorded any vaginal bleeding pattern disturbances. The diary data were used to determine whether a women experienced amenorrhea, the absence of menstrual bleeding for a specified number of days.
A total of 1151 women completed the menstrual diaries and the diary data were used to generate a binary sequence for each woman according to whether or not she had experienced amenorrhea in the four successive three month intervals. In clinical trials of modern hormonal contraceptives, pregnancy is exceedingly rare (and would be regarded as a failure of the contraceptive method), and is not the main outcome of interest in this study. Instead, the outcome of interest is a binary response indicating whether a woman experienced amenorrhea in the four successive three month intervals.
A feature of this clinical trial is that there was substantial dropout. More than one third of the women dropped out before the completion of the trial.
Machin D, Farley T, Busca B, Campbell M and d’Arcangues C. (1988). Assessing changes in vaginal bleeding patterns in contracepting women. Contraception, 38, 165-179.
# Clinical trial of contracepting women (Machin 1988, ALA p415)
ctcw <- read.csv("../Datasets/amenorrhea.csv",
na.strings = ".") %>%
mutate(Dosage = ifelse(dose == 1, "150mg", "100mg"))
head(ctcw)
## id dose visit amenorrhea Dosage
## 1 1 0 1 0 100mg
## 2 1 0 2 NA 100mg
## 3 1 0 3 NA 100mg
## 4 1 0 4 NA 100mg
## 5 2 0 1 0 100mg
## 6 2 0 2 NA 100mg
The dataset contains measurements on 1151 women and contains the following columns:
For this dataset, let’s begin by looking at the missing data pattern, since we have a hint above that it’s important. From the plot below, it is clear that missing data are due to dropout: there are no non-monotone missing data patterns in the plot. That is, if Visit 2 is missing, then visits 3 and 4 are also missing. The figure shows that for 198/1151 women (17.2%), visits 2-4 are all missing; for 155/1151 women (13.5%), visits 3-4 are missing; and 84/1151 (7.3%) are missing only visit 4. That leaves 714/1151 (62% of women) with complete data.
ctcw_wide <- ctcw %>%
pivot_wider(names_from = visit, values_from = amenorrhea)
# Missing data patterns (package mice)
mdpatt <- md.pattern(ctcw_wide)
Such a high level of dropout is quite concerning! Stay tuned for the Special Topics portion of Day 2 for more about this. In the meantime, we’re going to forge ahead.
The outcome of amenorrhea is binary, so a spaghetti plot (or any visualization of individual trajectories) won’t get us very far, but we can look at prevalence by time:
# prevalence by visit number and dosage
prev_by_time <- ctcw %>%
group_by(visit, Dosage) %>%
summarize(n = n(), prevalence = mean(amenorrhea, na.rm = TRUE))
ggplot(prev_by_time, aes(x=visit, y=prevalence,
group=factor(Dosage),
color=factor(Dosage))) +
geom_line() +
geom_point() +
theme_classic(base_size = 13) +
labs(color = "Dose") +
scale_color_manual(values = pal2use2) +
xlab("Time Point (quarterly visits)") + ylab("Prevalence of Amenorrhea") +
ggtitle("Prevalence of Amenorrhea by Dosage and Time")
It looks like amenorrhea is more likely at the higher dose across all time points, and prevalence also increases as subjects are on the treatment for a longer period of time. The increase in risk may be slightly higher at the higher dose.
We will use the following model:
\[\log\left(\frac{\mu_{ij}}{1-\mu_{ij}} \right) = \beta_0 + \beta_1\text t_{ij} + \beta_2 t_{ij}^2 + \beta_3\text{Dose}_i\times t_{ij} + \beta_4\text{Dose}_i\times t_{ij}^2\] where \(\mu_{ij} = \text{Pr}(Y_{ij}=1)\)
First, we’ll create a new column in the dataset for \(t_{ij}^2\):
ctcw$visit2 <- ctcw$visit^2
Before we go ahead and fit our model, it’s a good idea to make sure
the data are ordered by id
and time
(geeglm()
assumes the data are grouped by id):
ctcw <- ctcw[order(ctcw$id, ctcw$visit),]
We’ll experiment with various working correlation structures and
compare to using glm()
. As a reminder with
glm()
, the point estimates will match those from using GEE
with working independence, but the standard errors are based off
assuming all observations are independent, which we would not expect to
be valid in the setting of a longitudinal study (we expect dependence
between measurements taken on the same individual):
m_ind <- geeglm(amenorrhea ~ visit + visit:dose + visit2 + visit2:dose,
id = id, data = ctcw, family = binomial, waves = visit,
corstr = "independence")
m_exc <- geeglm(amenorrhea ~ visit + visit:dose + visit2 + visit2:dose,
id = id, data = ctcw, family = binomial, waves = visit,
corstr = "exchangeable")
m_ar1 <- geeglm(amenorrhea ~ visit + visit:dose + visit2 + visit2:dose,
id = id, data = ctcw, family = binomial, waves = visit,
corstr = "ar1")
m_uns <- geeglm(amenorrhea ~ visit + visit:dose + visit2 + visit2:dose,
id = id, data = ctcw, family = binomial, waves = visit,
corstr = "unstructured")
m_glm <- glm(amenorrhea ~ visit + visit:dose + visit2 + visit2:dose,
data = ctcw, family = binomial)
In the above, you’ll notice the use of the waves
argument. This is important when using working AR1 and unstructured if
you have intermittent missingness where the rows may not correspond to
sequential visits starting from the first one. For example, if you have
someone attend visit 1, 3, and 4 (i.e., misses visit 2) without the
waves = visit
argument, geeglm()
will treat
the observations as corresponding to the first, second, and third visit
when estimating the working correlation structure. This ends up not
being necessary in this example– we do not have intermittent missingness
(we only have dropout) and we ordered by time before fitting the
models.
Results from working independence:
summary(m_ind)
##
## Call:
## geeglm(formula = amenorrhea ~ visit + visit:dose + visit2 + visit2:dose,
## family = binomial, data = ctcw, id = id, waves = visit, corstr = "independence")
##
## Coefficients:
## Estimate Std.err Wald Pr(>|W|)
## (Intercept) -2.19554 0.17845 151.374 < 2e-16 ***
## visit 0.66983 0.16225 17.045 3.65e-05 ***
## visit2 -0.03031 0.03276 0.856 0.35490
## visit:dose 0.29730 0.11345 6.867 0.00878 **
## dose:visit2 -0.06243 0.02963 4.439 0.03512 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation structure = independence
## Estimated Scale Parameters:
##
## Estimate Std.err
## (Intercept) 1.001 0.02894
## Number of clusters: 1151 Maximum cluster size: 4
Results from working exchangeable:
summary(m_exc)
##
## Call:
## geeglm(formula = amenorrhea ~ visit + visit:dose + visit2 + visit2:dose,
## family = binomial, data = ctcw, id = id, waves = visit, corstr = "exchangeable")
##
## Coefficients:
## Estimate Std.err Wald Pr(>|W|)
## (Intercept) -2.2370 0.1765 160.64 < 2e-16 ***
## visit 0.6967 0.1586 19.31 1.1e-05 ***
## visit2 -0.0328 0.0320 1.05 0.3055
## visit:dose 0.3284 0.1100 8.91 0.0028 **
## dose:visit2 -0.0637 0.0286 4.97 0.0259 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation structure = exchangeable
## Estimated Scale Parameters:
##
## Estimate Std.err
## (Intercept) 1 0.0287
## Link = identity
##
## Estimated Correlation Parameters:
## Estimate Std.err
## alpha 0.363 0.0243
## Number of clusters: 1151 Maximum cluster size: 4
Results from working AR1:
summary(m_ar1)
##
## Call:
## geeglm(formula = amenorrhea ~ visit + visit:dose + visit2 + visit2:dose,
## family = binomial, data = ctcw, id = id, waves = visit, corstr = "ar1")
##
## Coefficients:
## Estimate Std.err Wald Pr(>|W|)
## (Intercept) -2.2485 0.1759 163.33 < 2e-16 ***
## visit 0.7069 0.1578 20.07 7.5e-06 ***
## visit2 -0.0330 0.0318 1.08 0.2990
## visit:dose 0.3633 0.1107 10.76 0.0010 **
## dose:visit2 -0.0790 0.0288 7.50 0.0062 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation structure = ar1
## Estimated Scale Parameters:
##
## Estimate Std.err
## (Intercept) 0.997 0.0284
## Link = identity
##
## Estimated Correlation Parameters:
## Estimate Std.err
## alpha 0.493 0.0252
## Number of clusters: 1151 Maximum cluster size: 4
Results from working unstructured:
summary(m_uns)
##
## Call:
## geeglm(formula = amenorrhea ~ visit + visit:dose + visit2 + visit2:dose,
## family = binomial, data = ctcw, id = id, waves = visit, corstr = "unstructured")
##
## Coefficients:
## Estimate Std.err Wald Pr(>|W|)
## (Intercept) -2.2408 0.1765 161.19 <2e-16 ***
## visit 0.6979 0.1581 19.49 1e-05 ***
## visit2 -0.0314 0.0318 0.98 0.3233
## visit:dose 0.3380 0.1097 9.50 0.0021 **
## dose:visit2 -0.0690 0.0284 5.90 0.0151 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation structure = unstructured
## Estimated Scale Parameters:
##
## Estimate Std.err
## (Intercept) 1 0.0285
## Link = identity
##
## Estimated Correlation Parameters:
## Estimate Std.err
## alpha.1:2 0.344 0.0367
## alpha.1:3 0.259 0.0359
## alpha.1:4 0.273 0.0341
## alpha.2:3 0.426 0.0358
## alpha.2:4 0.386 0.0345
## alpha.3:4 0.499 0.0361
## Number of clusters: 1151 Maximum cluster size: 4
Suppose we want to estimate the population odds ratio at 12 months (i.e. \(\text{visit} = 4\) and \(\text{visit}^2 = 16\)). We can use the following code:
pop_odds12 <- glht(m_exc, "4*visit:dose + 16*dose:visit2 = 0")
summary(pop_odds12)
##
## Simultaneous Tests for General Linear Hypotheses
##
## Fit: geeglm(formula = amenorrhea ~ visit + visit:dose + visit2 + visit2:dose,
## family = binomial, data = ctcw, id = id, waves = visit, corstr = "exchangeable")
##
## Linear Hypotheses:
## Estimate Std. Error z value Pr(>|z|)
## 4 * visit:dose + 16 * dose:visit2 == 0 0.295 0.144 2.05 0.041 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)
confint(pop_odds12)
##
## Simultaneous Confidence Intervals
##
## Fit: geeglm(formula = amenorrhea ~ visit + visit:dose + visit2 + visit2:dose,
## family = binomial, data = ctcw, id = id, waves = visit, corstr = "exchangeable")
##
## Quantile = 1.96
## 95% family-wise confidence level
##
##
## Linear Hypotheses:
## Estimate lwr upr
## 4 * visit:dose + 16 * dose:visit2 == 0 0.2949 0.0125 0.5773
We will address two conditional questions using GLMM:
We’ll use the same fixed effects/systematic component of the model as we used above for the GEE, but add random intercepts. (Note that we are not going to fit a random intercepts and random slopes model in this case, for reasons discussed in the lecture: there simply isn’t enough information in such a short sequence of observations to estimate subject-specific changes over time in the probability of amenorrhea.) Here’s our model: \[\log\left(\frac{\mu_{ij}}{1-\mu_{ij}} \right) = \beta_0 + \beta_1\text t_{ij} + \beta_2 t_{ij}^2 + \beta_3\text{Dose}_i\times t_{ij} + \beta_4\text{Dose}_i\times t_{ij}^2 + b_i\] where \(\mu_{ij} = \text{Pr}(Y_{ij}=1)\). The nAGQ argument (only for random intercept models) may be increased from the default of 1 to improve the accuracy of the estimates, at the cost of some extra computation time. The documentation recommends increasing to up to 25, although at some point the estimates no longer change and increasing nAGQ further does not provide additional improvement.
glmm.ri <- glmer(amenorrhea ~ visit + visit:dose + visit2 + visit2:dose + (1 | id),
data = ctcw, nAGQ = 10, family = "binomial")
summary(glmm.ri)
## Generalized linear mixed model fit by maximum likelihood (Adaptive
## Gauss-Hermite Quadrature, nAGQ = 10) [glmerMod]
## Family: binomial ( logit )
## Formula: amenorrhea ~ visit + visit:dose + visit2 + visit2:dose + (1 | id)
## Data: ctcw
##
## AIC BIC logLik deviance df.resid
## 3881 3918 -1934 3869 3610
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.917 -0.430 -0.227 0.471 4.008
##
## Random effects:
## Groups Name Variance Std.Dev.
## id (Intercept) 5.05 2.25
## Number of obs: 3616, groups: id, 1151
##
## Fixed effects:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.8035 0.3047 -12.48 < 2e-16 ***
## visit 1.1326 0.2681 4.22 2.4e-05 ***
## visit2 -0.0419 0.0548 -0.76 0.4448
## visit:dose 0.5642 0.1921 2.94 0.0033 **
## dose:visit2 -0.1095 0.0496 -2.21 0.0272 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) visit visit2 vst:ds
## visit -0.832
## visit2 0.710 -0.967
## visit:dose -0.052 -0.345 0.434
## dose:visit2 0.041 0.333 -0.456 -0.953
The fitted model is: \[\text{logit}(P(\text{Amenorrhea}_i | \text{Dose}_i, \text{Visit}_{ij}, b_i)) = -3.80 + 1.13 \times \text{Visit}_{ij}\] \[ -0.04 \times \text{Visit}_{ij}^2 + 0.56 \times \text{Dose}_i \times \text{Visit}_{ij} - 0.11 \times \text{Dose}_{i} \times \text{Visit}_{ij}^2 + b_i\]
Let’s plot the estimated probability by subject and the average of all subject-specific trajectories (the average conditional mean):
## Plot point estimates
point_est_df <- data.frame(dose = rep(c(0,1), each = 4),
Dosage = rep(c("100mg", "150mg"), each=4),
visit = 1:4, visit2 = (1:4)^2)
point_est_df$glmm_est <- predict(glmm.ri,
newdata = point_est_df,
type = "response", # probabilities
re.form = NA) # no random effects
point_est_df$glmm_logodds <- predict(glmm.ri,
newdata = point_est_df,
type = "link", # log odds
re.form = NA) # no random effects
ctcw$pred_prob <- predict(glmm.ri,
newdata = ctcw,
type = "response", # probabilities
re.form = NULL) # all random effects
ctcw$pred_logodds <- predict(glmm.ri,
newdata = ctcw,
type = "link", # log odds
re.form = NULL) # all random effects
prob.plot <- ggplot(data = point_est_df,
aes(x=visit, y=glmm_est, group=Dosage, col=Dosage)) +
geom_line(size = 1.5) +
geom_point(size = 2.5) +
geom_line(data = ctcw,
aes(x=visit, y=pred_prob, group=id, col=Dosage),
size=0.25, linetype = 2) +
theme_classic(base_size = 16) +
ylab("Probability of Amenorrhea") + xlab("Visit Number") +
theme(legend.position="bottom") +
scale_color_manual(values = pal2use2)
logodds.plot <- ggplot(data = point_est_df,
aes(x=visit, y=glmm_logodds, group=Dosage, col=Dosage)) +
geom_line(size = 1.5) +
geom_point(size = 2.5) +
geom_line(data = ctcw,
aes(x=visit, y=pred_logodds, group=id, col=Dosage),
size=0.25, linetype = 2) +
theme_classic(base_size = 16) +
ylab("Log Odds of Amenorrhea") + xlab("Visit Number") +
theme(legend.position="bottom") +
scale_color_manual(values = pal2use2)
grid.arrange(logodds.plot, prob.plot, nrow=1)
Notice that since the only subject-specific element is a random intercept, the lines are parallel on a log-odds scale (within each dose group; they’re still not straight because of the quadratic time term) but not on a probability scale.
Since more than one term corresponds to time, inference on the time coefficients (while statistically reasonable) is difficult to interpret. For a more interpretable test, we’ll examine differences between doses at visit 4:
## Estimated ratio at visit 4
avg_odds12 <- glht(glmm.ri, "4*visit:dose + 16*dose:visit2 = 0")
summary(avg_odds12)
##
## Simultaneous Tests for General Linear Hypotheses
##
## Fit: glmer(formula = amenorrhea ~ visit + visit:dose + visit2 + visit2:dose +
## (1 | id), data = ctcw, family = "binomial", nAGQ = 10)
##
## Linear Hypotheses:
## Estimate Std. Error z value Pr(>|z|)
## 4 * visit:dose + 16 * dose:visit2 == 0 0.504 0.241 2.09 0.036 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)
exp(0.504) # exponentiate to get odds ratio
## [1] 1.66
confint(avg_odds12)
##
## Simultaneous Confidence Intervals
##
## Fit: glmer(formula = amenorrhea ~ visit + visit:dose + visit2 + visit2:dose +
## (1 | id), data = ctcw, family = "binomial", nAGQ = 10)
##
## Quantile = 1.96
## 95% family-wise confidence level
##
##
## Linear Hypotheses:
## Estimate lwr upr
## 4 * visit:dose + 16 * dose:visit2 == 0 0.5043 0.0322 0.9764
paste(round(exp(0.032), 3),
round(exp(0.976), 3),
sep=", ") # exponentiate to get odds ratios
## [1] "1.033, 2.654"
## Wald test
# Full model: visit, visit^2, and full interactions with dose
glmm.ri.1 <- glmer(amenorrhea ~ visit + visit:dose + visit2 +
visit2:dose + (1 | id),
data = ctcw, nAGQ = 10, family = "binomial")
# Reduced model: visit, visit^2, no dose
glmm.ri.0 <- glmer(amenorrhea ~ visit + visit2 + (1 | id),
data = ctcw, nAGQ = 10, family = "binomial")
# Likelihood ratio test:
anova(glmm.ri.1, glmm.ri.0)
## Data: ctcw
## Models:
## glmm.ri.0: amenorrhea ~ visit + visit2 + (1 | id)
## glmm.ri.1: amenorrhea ~ visit + visit:dose + visit2 + visit2:dose + (1 | id)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## glmm.ri.0 4 3890 3914 -1941 3882
## glmm.ri.1 6 3881 3918 -1935 3869 12.7 2 0.0018 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
On average, the odds of amenorrhea at 12 months (Visit 4) for a woman who took the high dose is 1.66 times higher than a woman with the same baseline risk who took the low dose (95% CI: 1.03-2.65). Conducting a likelihood ratio test (\(H_0: \beta_3 = \beta_4 = 0\)), we find a statistically significant effect of dose (p=0.002).
GLMM vs. GEE for Amenorrhea (slide 50)
Comparing GEE to GLMM trajectories, we can see that the marginal and conditional associations differ (as expected given the binary outcome). Also as expected, the marginal (GEE) estimates are less steep – attenuated – compared to the conditional (GLMM) estimates.
m_exc <- geeglm(amenorrhea ~ visit + visit:dose + visit2 + visit2:dose,
id = id, data = ctcw, family = binomial, waves = visit,
corstr = "exchangeable")
point_est_df$gee_est <- predict(m_exc, newdata = point_est_df,
type = "response")
point_est_df %>%
pivot_longer(c(glmm_est, gee_est),
names_to = "Model",
values_to = "Estimate") %>%
mutate(Model = ifelse(Model == "glmm_est", "GLMM (RI)", "GEE (Exch)")) %>%
ggplot(aes(x=visit, y=Estimate, group=interaction(Model, Dosage),
col=Dosage, linetype=Model)) +
geom_line() +
geom_point() +
theme_classic(base_size = 16) +
ylab("Predicted P(Amenorrhea)") + xlab("Visit Number") +
theme(legend.position="bottom") +
scale_color_manual(values = pal2use2) +
guides(col=guide_legend(nrow=2,byrow=TRUE),
linetype=guide_legend(nrow=2, byrow=TRUE))
We’ll first run a complete case analysis, where we only include subjects who attended every visit:
# IDS where at least one visit is missing
ids_miss <- ctcw$id[is.na(ctcw$amenorrhea)]
# Create dataset for "complete case" analysis
ctcw_complete <- ctcw[!ctcw$id %in% ids_miss, ]
m_cc <- geeglm(amenorrhea ~ visit + visit:dose + visit2 + visit2:dose,
id = id, data = ctcw, family = binomial,
waves = visit, corstr = "exchangeable")
summary(m_cc)
##
## Call:
## geeglm(formula = amenorrhea ~ visit + visit:dose + visit2 + visit2:dose,
## family = binomial, data = ctcw, id = id, waves = visit, corstr = "exchangeable")
##
## Coefficients:
## Estimate Std.err Wald Pr(>|W|)
## (Intercept) -2.2370 0.1765 160.64 < 2e-16 ***
## visit 0.6967 0.1586 19.31 1.1e-05 ***
## visit2 -0.0328 0.0320 1.05 0.3055
## visit:dose 0.3284 0.1100 8.91 0.0028 **
## dose:visit2 -0.0637 0.0286 4.97 0.0259 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation structure = exchangeable
## Estimated Scale Parameters:
##
## Estimate Std.err
## (Intercept) 1 0.0287
## Link = identity
##
## Estimated Correlation Parameters:
## Estimate Std.err
## alpha 0.363 0.0243
## Number of clusters: 1151 Maximum cluster size: 4
Next, we’ll run an available data analysis, where we include all
available observations on all subjects (for example: if a person misses
the 4th visit only they would be excluded from a complete case analysis,
and their responses from the first 3 visits would be included in an
available data analysis). This is the default of geeglm()
,
so we have already seen these results earlier:
ctcw_available <- ctcw[!is.na(ctcw$amenorrhea), ]
m_ad <- geeglm(amenorrhea ~ visit + visit:dose + visit2 + visit2:dose,
id = id, data = ctcw_available, family = binomial,
waves = visit, corstr = "exchangeable")
summary(m_ad)
##
## Call:
## geeglm(formula = amenorrhea ~ visit + visit:dose + visit2 + visit2:dose,
## family = binomial, data = ctcw_available, id = id, waves = visit,
## corstr = "exchangeable")
##
## Coefficients:
## Estimate Std.err Wald Pr(>|W|)
## (Intercept) -2.2370 0.1765 160.64 < 2e-16 ***
## visit 0.6967 0.1586 19.31 1.1e-05 ***
## visit2 -0.0328 0.0320 1.05 0.3055
## visit:dose 0.3284 0.1100 8.91 0.0028 **
## dose:visit2 -0.0637 0.0286 4.97 0.0259 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation structure = exchangeable
## Estimated Scale Parameters:
##
## Estimate Std.err
## (Intercept) 1 0.0287
## Link = identity
##
## Estimated Correlation Parameters:
## Estimate Std.err
## alpha 0.363 0.0243
## Number of clusters: 1151 Maximum cluster size: 4
Finally, we’ll use IPW-GEE where our model for dropout is: \[\log\left(\frac{\pi_{ij}}{1-\pi_{ij}}\right) = \theta_0 + \theta_1 \text{Dose}_{i} + \theta_2 t_{ij} + \theta_3 Y_{i,j-1} + \theta_4 \text{Dose}_i \times Y_{i,j-1}\] where \(\pi_{ij} = \text{Pr}(R_{ij} = 1|, R_{i,j-1} = 1, \mathbf{X}_i, Y_{i1},\dots, Y_{i,j-1})\)
This is specified through the mismodel
argument in the
wgee()
function.
library(wgeesel)
# Create indicator for dropout. R = 1 if observed; R = 0 if missing
ctcw$R <- 1
ctcw$R[is.na(ctcw$amenorrhea)] <- 0
# Our dropout model includes the response at the previous time point as potentially being informative of dropout
ctcw$lag1y <- ylag(ctcw$id, ctcw$amenorrhea, 1)
head(ctcw)
## id dose visit amenorrhea Dosage visit2 pred_prob pred_logodds R lag1y
## 1 1 0 1 0 100mg 1 0.0492 -2.961 1 NA
## 2 1 0 2 NA 100mg 4 0.1241 -1.954 0 0
## 3 1 0 3 NA 100mg 9 0.2628 -1.031 0 NA
## 4 1 0 4 NA 100mg 16 0.4522 -0.192 0 NA
## 5 2 0 1 0 100mg 1 0.0492 -2.961 1 NA
## 6 2 0 2 NA 100mg 4 0.1241 -1.954 0 0
m_ipw <- wgee(model = amenorrhea ~ visit + visit:dose + visit2 + visit2:dose,
data = ctcw, id = ctcw$id,
family="binomial", corstr ="exchangeable",
scale = NULL, mismodel = R ~ dose + visit + lag1y*dose)
summary(m_ipw)
## Call:
## wgee(model = amenorrhea ~ visit + visit:dose + visit2 + visit2:dose,
## data = ctcw, id = ctcw$id, family = "binomial", corstr = "exchangeable",
## scale = NULL, mismodel = R ~ dose + visit + lag1y * dose)
##
## Estimates Robust SE z value Pr(>|z|)
## (Intercept) -2.2562 0.1776 -12.71 <2e-16 ***
## visit 0.7067 0.1602 4.41 1e-05 ***
## visit2 -0.0325 0.0322 -1.01 0.3136
## visit:dose 0.3394 0.1124 3.02 0.0025 **
## dose:visit2 -0.0688 0.0289 -2.38 0.0174 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Estimated Scale Parameter: 1
##
## Estimated Correlation: 0.382
We can take a look at the results for the dropout model:
summary(m_ipw$mis_fit)
##
## Call:
## glm(formula = mismodel, family = binomial(), data = data[adjusted_idx,
## ])
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.209 0.441 0.544 0.603 0.793
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.9460 0.2014 4.70 2.7e-06 ***
## dose 0.0681 0.1313 0.52 0.6040
## visit 0.3338 0.0682 4.90 9.7e-07 ***
## lag1y -0.4453 0.1618 -2.75 0.0059 **
## dose:lag1y -0.2421 0.2193 -1.10 0.2696
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2459.3 on 2901 degrees of freedom
## Residual deviance: 2417.1 on 2897 degrees of freedom
## AIC: 2427
##
## Number of Fisher Scoring iterations: 4
It’s a good idea to check the distribution of estimated weights for presence of large weights (these could then have large influence on the results):
# Histogram of estimated weights where response is observed
hist(m_ipw$weight[ctcw$R==1],
xlab="Estimated IPW Weights",
main="Estimated Weights where Response is Observed")
The estimated weights range from 1 to 2.06, so there isn’t a concern in this analysis. If the dropout model is correctly specified, WGEE would be valid for data that is MAR.