Bayesian modelling for pseudo-observations in clustered data

Hello,

I am working on a project where I need to evaluate the effect of covariates on the restricted mean survival time (RMST) difference in a clustered dataset.

I have been reading the work by Andersen et al. (e.g., Andersen, Hansen, & Klein, 2004; Andersen & Pohar Perme, 2010) on using pseudo-observations. The frequentist approach involves calculating a jackknife-based pseudo-observation for the RMST for each individual (\hat{\mu}_{\tau i}) and then using this value as the outcome variable in a GLM with GEE to account for correlation.

My goal is to perform this analysis in a Bayesian framework, preferably using brms or Stan. My initial plan was to compute the RMST pseudo-observations and fit a multilevel Bayesian GLM, something like:

brm(pseudo_obs ~ 1 + treatment + other_covariates + (1 | cluster_id), family = gaussian())

However, this has led me to a few critical questions, and I am unsure if this is a valid approach.

  1. The Andersen papers often describe an identity link for RMST, which suggests a Gaussian likelihood. However, as the source material shows, pseudo-observations for RMST are not strictly bounded (e.g., they can be negative or greater than τ). Is a Gaussian likelihood a reasonable assumption here? If not, what alternatives would be appropriate?

  2. My main concern is that pseudo-observations are not independent, as they are derived from a jackknife procedure on the full dataset. The frequentist method addresses this with GEE and a sandwich estimator. My proposed brms model only accounts for the cluster-level correlation within the dataset, not the correlation induced by the jackknife calculation itself. For this analysis, I plan to model the pseudo-observation at a single time point. Given this, is it statistically valid to disregard the non-independence from the jackknife procedure and fit a standard multilevel model? Or does this non-independence make the pseudo-observation approach fundamentally unsuited for a standard Bayesian GLM?

  3. Is there a more direct Bayesian alternative?

Has anyone had success fitting a model with pseudo-observations as the outcome in Stan?
If so, how did you address the correlation issue?

Thank you for any guidance or insights you can provide!

Hi, and welcome to the Stan Discourse.

I don’t understand what is the issue with the pseudo-observations not being bounded, the Gaussian likelihood has support for all negative and positive values. Whether the likelihood is a reasonable assumption here depends more on the actual data, e.g. heteroscedasticity and otherwise anything else that may depart significantly from the gaussian distribution.

If you can write the full model/likelihood of a frequentist problem it’s always possible to completely reproduce it as a bayesian problem by keeping flat priors, or a mildly different/better one with vaguely informative priors. After inference, you can use model selecction or other bayesian criteria instead of frequentist significance approaches.

If instead you decide to formulate an entirely or somewhat different model from the frequentist version, then the sky is the limit, you can account or not account for any potential correlations you may want to include, or assume they are not that relevant. More specifically, and I may be off here because I haven’t used pseudo-observations in real-world projects, but my general impression is that, once you accept pseudo-observations from resampling methods you treat them as independent samples without further need to account for their correlation to the original sample.

It’s always possible to try and model every possible effect from the system itself, the data collection process, sample treatment, etc, but at some point we have to accept that no model will capture every possible known (much less unknown) nuance from reality.

So, in summary:

  1. You can simply reproduce the frequentist model and use bayesian alternatives to frequentist tests post-inference;
  2. You can change the model and try to account for more than the original model, it will probably take more work and may or may not be overkill;
  3. In principle there shouldn’t have to be, but you can try and be creative, e.g. using priors to put some constraints on some expected effects effects of resampling like assuming some mean value won’t be uniformly distributed, bur rather closer to the mean of the original data, or something like this.

Hi Caetano-sensei,

Thank you for the helpful discussion.
It’s clarified my thinking, and I now have a clear path forward with the direct Bayesian survival model.
I appreciate you taking the time to respond.

You’re very welcome. If you have additional questions please follow up with them. Sometimes it takes a little bit until someone with experience on the model (problem, issue, Stan interface, etc) gets to look at the post and give you the answer you need.