Bayesian Repeated Measures Logistic Regression

I am having trouble running what I thought was a simple Bayesian repeated measures logistic regression. I can get the model to run but the diagnostics are terrible. I was hoping I could get some tips on how to improve the model

Here is the data. There are two groups, not trained and trained, each measured twice, at baseline (pre-intervention) and follow-up (post-intervention). The outcome is a binary variable, incorrect (=0) vs correct (=1). I’m interested in whether prior training results in greater odds of being correct at follow-up.

# Here is the data
d <- data.frame(id = 1:94,
                training = c(0,1,1,0,0,0,0,1,0,1,1,1,0,1,0,0,1,1,0,0,1,1,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,1,1,1,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,1,1,0,1,0,1,0,0,1,0,0,0,0,1,0,1),
                baseline = c(1,1,1,1,1,0,1,1,1,1,0,1,0,1,1,1,1,1,0,1,1,1,0,1,0,1,0,0,0,1,1,1,1,0,0,1,1,1,1,0,1,1,0,1,0,1,1,1,0,1,1,1,1,1,1,1,1,1,0,1,1,1,1,1,0,1,1,0,1,0,1,0,1,0,1,0,1,0,0,1,0,1,0,1,0,1,1,1,1,1,1,1,1,0),
                followup = c(1,1,1,1,1,1,1,1,1,0,1,1,1,1,1,1,1,1,1,1,1,1,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1))

Now let’s look at the proportions correct in each cell - baseline/not trained, baseline/trained, followup/not trained, followup/trained.

library(dplyr)
d %>% group_by(training) %>%
      summarise(countBase = sum(baseline),
                countFU = sum(followup),
                tot = n()) %>%
      mutate(percBase = round(100*countBase/tot,2), percFU = round(100*countFU/tot,2)) %>%
      transform(training = ifelse(training == 0, "not trained", "trained"))

# output
#      training countBase countFU tot percBase percFU
# 1 not trained        40      62  65    61.54  95.38
# 2     trained        26      28  29    89.66  96.55

Looks like training means greater odds of being correct at baseline before the intervention, but that this difference disappears at follow-up (unsurprising because the rate correct at baseline in the trained group was so high that they really had nowhere to go at follow-up).

Let’s change to a long dataframe for the repeated measures logistic regression

longDF <- d %>% dplyr::select(-one_of("changed")) %>% 
                gather(key = time, value = correct, baseline, followup) %>%
                mutate(id = factor(id),
                       training = factor(ifelse(training == 0, "not trained", "trained"), levels = c("not trained", "trained")),
                       correct = factor(ifelse(correct == 0, "incorrect", "correct"), levels = c("incorrect", "correct")),
                       time = factor(time))

Now for the model. I am using a bernoulli logit model which estimates a grand mean (parameter ‘a’) and then estimates main effects of both training, time and the training x time interaction. I also have an effect estimated for each subject. I have specified hyper-priors on the standard deviation of the deflections of the main and interaction effects from the grand mean. For the priors on these deflection parameters I chose normal distributions truncated at 0 with sd = 2 (because this is on a log-odds scale, no idea if this was the right thing to do or not).

library(rstan)

# put the the data into a list
dList <- list(N = nrow(longDF),
              nID = nlevels(longDF$id),
              nTrain = nlevels(longDF$training),
              nTime = nlevels(longDF$time),
              sIndex = as.integer(longDF$id),
              trainingIndex = as.integer(longDF$training),
              timeIndex = as.integer(longDF$time),
              correct = as.integer(longDF$correct)-1)

# now for the model string
mString <- "
  data{
    int<lower=1> N;                             
    int<lower=1> nID;
    int<lower=1> nTrain;
    int<lower=1> nTime;
    int<lower=1,upper=nID> sIndex [N];
    int<lower=1,upper=nTrain> trainingIndex [N];
    int<lower=1,upper=nTime> timeIndex [N];
    int<lower=0,upper=1> correct [N];
  }
  
  parameters{
    real a;
    vector[nTrain] bTraining;     // main effect of training
    vector[nTime] bTime;          // main effect of time
    vector[nTrain] bTxT[nTime];   // vector of vectors for the interaction
    vector[nID] bSubj;            // subject-level effects
    real<lower=0> sigma_train;    // sd of training main-effect deflections
    real<lower=0> sigma_time;     // sd of time main-effect deflections
    real<lower=0> sigma_TxT;      // sd of interaction deflections
    real<lower=0> sigma_s;        // sd of subject deflections
  }
  
  model{
    vector[N] p;
    
    //hyper-priors
    sigma_train ~ normal(0,2);
    sigma_time ~ normal(0,2);
    sigma_TxT ~ normal(0,2);
    sigma_s ~ normal(0,2);
    
    // priors
    a ~ normal(0,2);
    bSubj ~ normal(0, sigma_s);
    bTraining ~ normal(0, sigma_train);
    bTime ~ normal(0, sigma_time);
    for (j in 1:nTrain) { bTxT[j] ~ normal(0, sigma_TxT); }
    
    //likelihood
    for (i in 1:N) {
      correct ~ bernoulli_logit(a + bTraining[trainingIndex[i]] + bTime[timeIndex[i]] + bTxT[trainingIndex[i]][timeIndex[i]] + bSubj[sIndex[i]]);
    }
}
"

# run the model
twoByTwoLogit <- stan(model_code = mString,
                      data = dList,
                      iter = 2e3,
                      warmup = 1e3,
                      cores = 3,
                      chains = 3)

Here is the print out of the model coefficients as well as chain diagnostics

print(twoByTwoLogit, pars = c("a", "bTraining", "bTime", "bTxT"), probs = c(0.025, 0.975))

# output

Inference for Stan model: 4d5f505643ec1a63680be017d6d7bcf4.
3 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=3000.

              mean se_mean   sd  2.5% 97.5% n_eff Rhat
a             1.43    0.07 0.79 -0.65  2.80   133 1.02
bTraining[1]  0.15    0.06 0.69 -0.84  2.47   141 1.01
bTraining[2]  0.14    0.04 0.62 -0.85  1.95   201 1.01
bTime[1]      0.03    0.06 0.52 -1.13  1.27    79 1.07
bTime[2]     -0.01    0.07 0.51 -1.20  1.22    56 1.08
bTxT[1,1]    -0.03    0.02 0.20 -0.61  0.32   102 1.03
bTxT[1,2]     0.01    0.02 0.21 -0.49  0.38    90 1.03
bTxT[2,1]    -0.01    0.01 0.18 -0.47  0.42   187 1.01
bTxT[2,2]     0.03    0.02 0.19 -0.32  0.45   133 1.00

Samples were drawn using NUTS(diag_e) at Tue Feb 25 10:48:35 2020.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

As you can see, the effective sample sizes are really low and the Rhats are terrible.

Can anybody tell me what I’m doing wrong and what I could do better?

2 Likes

At a quick glance, the issue could be this line:

    for (i in 1:N) {
      correct ~ bernoulli_logit(a + bTraining[trainingIndex[i]] + bTime[timeIndex[i]] + bTxT[trainingIndex[i]][timeIndex[i]] + bSubj[sIndex[i]]);
    }

You don’t have an index on correct, so you’re essentially specifying N different posterior distributions for the entire correct array

3 Likes

Doh! Elementary mistake! Thank you @andrjohns, model ran in 1/100th the time and the output reads

Inference for Stan model: 8443b782742f7ca8b2db5bdef22fa73b.
3 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=3000.

              mean se_mean   sd  2.5% 97.5% n_eff Rhat
a             1.48    0.06 1.48 -1.62  4.20   644 1.01
bTraining[1]  0.03    0.04 1.15 -2.21  2.67   800 1.00
bTraining[2]  0.79    0.05 1.31 -1.22  4.11   746 1.00
bTime[1]     -0.08    0.05 1.41 -2.85  3.20   700 1.01
bTime[2]      1.34    0.08 1.60 -1.28  5.09   400 1.01
bTxT[1,1]    -0.75    0.06 1.20 -3.23  1.58   450 1.01
bTxT[1,2]     0.80    0.06 1.32 -1.41  3.99   445 1.00
bTxT[2,1]     0.42    0.05 1.22 -1.92  3.15   590 1.01
bTxT[2,2]     0.63    0.07 1.42 -1.68  4.22   386 1.00

One day I hope that these stupid mistakes will be mitigated by instinct and experience, but I may have to wait a while. Thank you again.

1 Like

Haha don’t worry, it’s happened to me more than once!

1 Like

Just chiming in with a couple suggestions:

You should do some prior predictive checks to help decide the scale of these priors if you don’t have a good intuition off the bat. For example, I’ve seen even normal(0,1) criticized as too broad and implying rather unrealistic data in some circumstances.

Note also that your model presently does not attempt inference on potential correlations in the manifestation of effects across subjects. If you want to see what it would look like to add inference on the correlations too, a lecture is here that walks through a dataset like yours and models it hierarchically. Don’t worry that you only have one observation per participant per within-subject condition; for binomial outcomes, it’s fine to model things hierarchically in that circumstance.

2 Likes