Marginalization of Autocorrelation Model using Posterior Predictive Distributions

I am dealing with a 1000-patient clinical trial with daily measurements of an ordinal outcome on days 1, 2, …, 28 after randomization where the outcome has between 4 and 100 levels. I’ve spent a lot of time trying to understand how to model an autocorrelation structure on the linear predictor scale using regular time spacing AR(1) and general continuous time with an Ornstein-Uhlenbeck process. That would allow me to directly obtain a marginally interpretable treatment effect (e.g., odds ratio in a proportional odds model). The data exhibits extremely strong autocorrelation that is difficult to capture. This has led me to consider direct modeling using a first-order Markov process:

\Pr(Y_t \geq y | X, Y_{t-1}) = \text{expit}(\alpha_y + X \beta + g(Y_{t-1}))

where g(Y) is a general function and may have multiple parameters, e.g., to allow state transition probabilities to be customized for every level of Y_{t-1} by having indicators for all but the first possible level and having regression coefficients for each of those. One of the baseline covariates X is treatment, and I need to get marginal treatment effects. For example I would want to compute the posterior distribution of the absolute risk difference for \Pr(Y_{28} \geq y | X) for treatment B vs. treatment A. for given covariate settings.

Because of the large number of distinct Y values along the way from t=1, 2, \dots, 27 the computation of the unconditional distribution is messy. It is probably easier to get it through a posterior predictive distribution where a very large (say 40,000) patients are simulated after say 4000 posterior draws are available for the autoregressive model parameters. I’d appreciate feedback on the proposed workflow:

  1. Take one of the posterior draws
  2. Conditional on X and the initial state Y_0 simulate from this posterior draw a 1000 \times 28 matrix of Y values (see below for details) under treatment A
  3. For the same posterior draw and X (except for treatment) and Y_0 simulate a matrix for treatment B
  4. Compute any derived quantity from these two matrices, e.g., absolute reduction in \Pr(Y_{28} > 4) or the difference in number of days t it takes to satisfy Y(t) \leq 6 or some other cutoff.
  5. Repeat this same process using the same posterior draw 4 more times to get 5 \times 2 = 10 (2 treatments) datasets
  6. Move on to the next posterior draw

To simulate each dataset, take a single initial state Y_0 and covariate settings X, compute the cell probabilities for all possible values of Y and simulate a single Y(1) from the multinomial distribution. Then for this Y(1) as the new state get the conditional probability distribution of Y(2) and simulate a single Y(2). Use this to simulate a single value of Y(3), etc.

Is this a valid procedure for computing unconditional (on previous Y) quantities? Is there a more efficient way? Thanks for any thoughts.

Addendum: I figured out how to get exact unconditional probabilities fairly easily when computing them for a single covariate combination, using simple matrix operations sequentially through time. I have R code for that if anyone needs it. But the simulation approach is still a good exercise and will allow one to make inference about complex quantities.


That is IMHO reasonable. It is definitely almost exactly the thing I did when evaluating the effects in Markov model in my recent work and I couldn’t figure out anything better. Few changes I would suggest:

  • I think it is more useful to not actually compute Y, but only the probabilities P(Y) of observing each Y value at each time - any quantity of interest can be computed easily from this distribution and you avoid some issues (like impossibility to convert to log-odds if some event has low probability and thus is never observed in your particular Y). Maybe this is what you discuss in the adendum? (I don’t think I undestood the addendum completely)
  • I don’t see how step 5 is useful - if you have enough posterior draws than IMHO computing just one simulation per draw is enough. (the consideration would be IMHO different if draws were particularly expensive and you would have much smaller number of posterior draws).

The obvious problem is that with this algorithm you get the effect estimate for exactly the distribution of patient covariates you have observed - which might (and might not) be desirable. If I remember correctly, I think @stablemarkets advised in his StanCon talk to also do a “Bayesian bootstrap”, i.e. reweight the patient-level differences in each sample by a sample drawn from Dirichlet(1,1,1,...,1). This way you generalize beyond the set of patients you have actually observed to the hypothethical population your patients were sampled from (hope I am not misrepresenting the argument). This should IMHO be quite similar to resampling the patients with replacement in each sample, but I am not 100% sure about this. I would however be surprised if this bootstrap made a notable difference when you have 1000 patients.

The problem actually seems similar to the model I used for some Covid data (described at: Fitting HMMs with time-varying transition matrices using brms: A prototype)

Best of luck with your model!


Thanks for the comments. Concerning step 5, I’m not seeing how sufficient precision can be obtained without it. But your other comment of stopping with the posterior probabilities and computing derived quantities from them is key… I have R code that does this to compute highest posterior density intervals of differences (between treatments) in state occupancy probabilities as a function of time. No simulation needed, other than the original posterior draws.

I’m starting to read your post that you linked to. Looks extremely valuable to me.

I presume you already know this, but just my attempt to clear up my thoughts on the topic. I am unfortunately unable to give a rigorous mathematical argument, so a rough intuition instead. Repeatedly simulating observed data can IMHO be used in two ways:

  1. All the simulated datasets from the single posterior sample are pooled to create a single effect estimate. This (partially) removes the sampling uncertainty and can be seen as approximating working with the marginal state probabilities directly. I think that as long as you are concerned with quantities that can be calculated from the marginal probabilities of states, it is better to just use that. The only case where this IMHO would be useful is for quantities that can’t be expressed that way (e.g. the proportion of time steps where Y changes from its previous value; average change between t = 14 and t =28; empirical between-patient correlations)

  2. Each of the simulated datasets is treated separately and produces a separate sample of the effect estimate. This produces a larger set of posterior samples, but with some additional internal correlations. It can IMHO be seen as guarding against some “important” or “outlier” sample exerting too big/too small influence on the resulting estimate. This should IMHO be unnecessary if you have enough independent posterior samples from a well behaved model - for any potential “important” or “outlier” sample, there should be multiple other samples with similar properties that would make the observation noise average out reasonably well with just one simulation per sample. I think this supersampling can be useful if you have very few posterior samples or if the observation model and/or the expectation you are trying to compute is somehow weirdly behaved, but it still feels more like a hacky trick to avoid solving the actual problem (getting more samples or figuring out an analytic solution instead of simulating from the observational model).

Or am I misunderstanding something and your use is actually different?

1 Like

Thanks for the insights. That is very helpful Martin. Would you mind checking my understanding? Suppose that we wanted to compute the probability of a compound event for a fixed set of covariate values except for one: we vary treatment from A to B. The compound event might be for example the probability that a patient will be alive on day 28 given that she was on a ventilator on day 14.

  • Simulation of data from a single posterior draw can be for any “new” sample size so we might as well simulate a very large sample size. For treatment A and other covariates X we simulate 50,000 patients’ full vector of states at all times, and likewise for treatment B. We compute the B-A difference or ratio of the quantity of interest, e.g. a difference in proportions. Save this single number.
  • Repeat for each of the other posterior draws.
  • With 4000 original posterior draws we have 4000 realizations of the quantity of interest, e.g. B-A difference of P(alive on day 28 | on ventilator at day 14)
  • The posterior distribution of the quantity of interest is estimated from these 4000 points.

Is this a good strategy?

I think that if you can’t directly analytically compute the probability from a single sample (which might be difficult for a composite event of the type you mention) than simulating a large population and computing the probability in this large population is in my view a very sensible approach. But I admit I am not an expert on this and would ask @bbbales2 for a second opinion.

1 Like