Use posterior predictive distribution in another model

Hello, I have individual-level predictions produced by a non-parametric Bayesian model. Specifically I have, for each observation, a posterior distribution of the individual Relative Risks associated with a simulated change in a predictor (relative individual treatment effect, rITE).

My goal is to find subgroups in the rITE (I use a tree on the median posterior values), and then to evaluate the posterior distribution of the rITE in each subgroup. My problem is that these groups can be small and Iā€™d like to regularize the estimates modeling them as a random effects. That is, in a brms like formula: rITE ~ (1|group) with a lognormal model.

My question is: how do model this considering that I have a distribution for each observation? Iā€™d like to use the information in the whole distribution without simplifying it taking for example mean and std dev.

Thanks

3 Likes

You have at least three options:

  1. Re-fit the original model and the new model jointly. If this is feasible, it is the cleanest solution.
  2. Obtain a parametric approximation to the posterior from the first model, and fit the second model as a hierarchical model wherein the true values are sampled from this parametric approximation. Note that it is important that the parametric approximation adequately capture the joint posterior, and independent approximations of the marginal posteriors for each quantity of interest may or may not be an adequate joint approximation.
  3. Iteratively fit the second model using as data many independent draws from the posterior of the first model. Combine the posterior chains from these many fits for inference. Unlike the previous two options, this option does not allow the second model to inform the posterior expectations from the first model, which can be a feature or a bug depending on your exact use case. One challenge of this method is that you might find that different parameterizations of the second model might be necessary to yield accurate computation over different posterior iterations of the first model. Thus, if youā€™re unlucky you might need to use multiple different parameterizations of the second model (ensuring that they encode exactly equivalent models). And if youā€™re really unlucky, you might find that some realizations from the first modelā€™s posterior are difficult or impossible to reparametrize in such a way as to yield correct computation. If that happens, you canā€™t safely proceed.
1 Like

I realize this is a slightly older post, but I have a question regarding option #3, and this discussion in general. I have a similar problem, which is that I have a factor analysis model that produces ā€˜zā€™ latent factor scores. I would like to use these factor scores as the outcome variable in a separate linear model where I can contrast the factor scores between two groups. You will see in the following code that the ā€˜second linear modelā€™ is commented out:

Stan code for factor analysis with ā€˜secondā€™ linear model commented out.

//
data {
  int<lower = 0> m; // dimension of observed data 
  int<lower = 0> k; // number of latent factors
  int<lower = 0> n; // number of participants
  int<lower = 0> n_obs; // number of observations
  //int<lower = 1, upper = 2> group[n]; //group variable
  vector[m] y[n];
  
}

transformed data {
  int<lower = 1> p; // number of nonzero lower triangular factor loadings
  vector[m] zeros;

  zeros = rep_vector(0, m);
  p = k * (m - k) + k * (k - 1) / 2;
  
}

parameters {
  vector<lower = 0>[k] beta_diag;
  vector[p] beta_lower_tri;
  vector<lower = 0>[m] sigma_y; // residual error
  real<lower = 0> sigma_L; // hyperparameter for loading matrix elements
  //vector[2] beta;
 // real<lower = 0> sigma_beta;
}

transformed parameters {
  matrix[m, k] L;
  cov_matrix[m] Sigma;
  matrix[m, m] L_Sigma;
  matrix[n,k] z; // factor scores
  
  
  { // temporary scope to define loading matrix L
    int idx = 0;
    for (i in 1:m){
      for (j in (i + 1):k){
        L[i, j] = 0; // constrain upper tri elements to zero
      }
    }
    for (j in 1:k){
      L[j, j] = beta_diag[j];
      for (i in (j + 1):m){
        idx = idx + 1;
        L[i, j] = beta_lower_tri[idx];
      }
    }
  }
  
  Sigma = multiply_lower_tri_self_transpose(L);
  for (i in 1:m) Sigma[i, i] = Sigma[i, i] + sigma_y[i];
  L_Sigma = cholesky_decompose(Sigma);
  
  //z scores
  for (i in 1:n) {
for (j in 1:k) {
z[i, j] = (1 / sigma_L) * L[, j]' * inverse(Sigma) * y[i,];
  }
  } 
}

model {
  // priors
  //vector[n] z1;
  //beta ~ normal(0,1);
  //sigma_beta ~ exponential(1);
  beta_lower_tri ~ normal(0, sigma_L);
  sigma_L ~ normal(0, 1);
  sigma_y ~ normal(0, 1);
  
  // priors for diagonal entries (Leung and Drton 2016)
  for (i in 1:k) {
    target += (k - i) * log(beta_diag[i]) - .5 * beta_diag[i] ^ 2 / sigma_L;
  }
  
  // likelihood for factor model
  y ~ multi_normal_cholesky(zeros,L_Sigma);
  
  // second linear model
  //for (i in 1:n) {
   // z1[i] = z[i,1];
  //}
  
  //z1 ~ normal(beta[group], sigma_beta);

}
  

When I try and run the model jointly, the parameter estimates across the board are altered and donā€™t seem to recover my simulated data very well. The model also runs very slowly and typically has convergence problems. Iā€™m wondering if this is because the addition of the linear model infuses too much variance into the factor analysis parameters (?) However, the third option I see stated in the above comment is to (applied to my case) pull the z latent factor scores out and model them as output data in a simple linear model. I just canā€™t figure out how to code this while preserving the uncertainty by keeping the posterior draws. Any help here would be great. Thank you in advance!

In general Iā€™d caution that option 3 usually isnā€™t a great idea unless you really know what youā€™re doing. Iā€™ll elaborate on that a bit more at the end, but first Iā€™ll answer your question:

You would select one posterior draw from your factor analysis, and run your downstream model. Save those draws as draws1. Then youā€™d select a different posterior draw from your factor analysis and run your downstream model yielding draws2. Do this a few hundred times, and then concatenate draws1, draws2 ā€¦ drawsN into a single posterior for inference. The intuition behind this is roughly:

  • we know that to get inference on any derived quantity (i.e. function of parameters), we just have to compute that quantity iteration-wise over the posterior draws.
  • So imagine the function of interest is the sample correlation of the parameter vector to some external data vector, or the slope of the OLS fit. We we can just run OLS over the posterior, and get the posterior distribution of that correlation or slope.
  • So now imagine that the function is more complicated. Itā€™s a mapping from a parameter vector to a function, and that function is the PDF corresponding to the posterior distribution in a bayesian regression. Again, we can get the posterior distribution for this function by computing its output iteration-wise over the posterior. And if we want, we could choose to summarize this posterior distribution over functions to a distribution over \mathbb{R}^n by taking the average value of the output everywhere in the functionsā€™ domain.

The subtlety is that this posterior distribution is not the same as the posterior that you get from fitting the joint model, and so deciding to use it requires a good reason for deviating from the logically consistent, fully Bayesian joint model. If you think that the joint model is well specified, then you want samples from the joint posterior. If you donā€™t feel like the joint model is well specified, you might consider trying to figure out how to write down a well-specified joint model. Itā€™s awkward to admit that the joint model is misspecified but then use an inexact approximation to that joint model anyway.

I feel like the main(?) reason people have for doing this is when they are quite confident in the specification of the first model but not confident in the specification of the second model. In this setting, fitting jointly could introduce too much flexibility for the second model to inform the parameter estimates from the first, thus yielding a joint model in which the functional form of the relationship encoded by the second model spuriously appears to be a really good match to data, particularly when you ā€œzoom inā€ and just look at the apparent goodness of fit of the second model to the latent factors. This situation is most severe when the first model is well specified but yields a highly uncertain posterior, and the second model contains a shot-in-the-dark guess about the functional form.

In your case, I think thereā€™s another potential reason for running the model this way, which is that latent factor models are notoriously finicky and itā€™s easy to imagine that the regression structure creates minor modes and other pitfalls that are prohibitively challenging for sampling. In this case, one option that would be more consistent with generative Bayesian modeling but that should avoid these issues is #2 above: get a parametric approximation to the posterior from the factor model, and use that in the joint model. For example, suppose the latent factor scores from just the factor model have a posterior that is well approximated by a multivariate normal. Then you take the joint model, and replace the entire latent factor part with this multivariate normal, which shouldnā€™t have the same problems with nasty minor modes cropping up in the joint model.

2 Likes

Thanks, this is a great response with some really useful info in it. I definitely have a better idea of how I could iterate over the posterior in a second model (this process is similar to how Stan imputes missing data, no?). I was weary of deviating from the full joint model for the reasons you specified, and it looks like my intuitions were correct, andI take your point about potential model misspecification. I think the ugly truth here, is that in my field (immunology, and specifically modelling cytokine biomarkers in the blood), it would be a miracle if anything we are modelling is actually linear. The toy data simulation used on the current model though was recovered incredibly well until I introduced the secondary model, which I guess points to an indexed intercept linear model as a poor choice (the first latent factor is not well predicted by knowledge of group membership, and it was actually like that by design). I was just surprised at how much it upset the factor model parameters; your answer about this is very helpful. The real world problem Iā€™m trying to solve pertains to seeing if 1) a latent construct of inflammation via measurement of several biomarkers (because it is a complex, coordinated process) is more pronounced in a group of people with brain injury, and/or 2) if the state of the immune system itself differs in nature between these groups. For the latter, I need to create latent variables for both groups separately and then compare the LV structure, but in the former, where I expect the structure to be similar, just ā€˜strongerā€™ in one group, I canā€™t seem to get away from a simple model where I compare the factor scores of the 1st LV between groups. Would this justify the risky choice of pulling out the second linear model from the joint distribution and modelling it separately? Once again, thank you so much for your time, it has been really insightful.

Sorry for the double reply, but Iā€™ve continued to work on the suggestions given above, and have tried to fit a better joint model. I took out the second simple linear model and tried to fit a multivariate normal model to predict both latent variables (z) given group membership (beta[group]). The output is still strange, but strange in a (potentially) interesting way. the covariance structure of the predictors, their relative loadings on the LVs compared to the raw simulated data, and the correlation between the raw simulated latent factor scores and estimated scores, all check out. However, the estimates are super inflated for the factor model, and super deflated (z) for the second linear model. Iā€™ve played with the priors, but canā€™t get it to change. Maybe as some of the estimates shrink, others inflate to balance the outcome estimates in the posterior (?), Iā€™m just not sure why, and whether itā€™s fixable. Other than this, I donā€™t see how I can replace the LF model with a multivariate normal (tried a bit, but was hopeless, nothing really made sense or approximated the LF posterior). As for iterating the draws, am I looping this in R (like for (I in n_draws) {run stan model and replace the data with another draw}? or should I be loading a 3D array into stan and looping over the iterations in Stan? Sorry this is probably a dumb question, Iā€™m just relatively new to raw Stan coding, and am still working my way around the syntax.

Hereā€™s the code Iā€™m referring to

//
data {
  int<lower = 0> m; // dimension of observed data (# of biomarkers)
  int<lower = 0> k; // number of latent factors
  int<lower = 0> n; // number of participants
  int<lower = 0> n_obs; // number of observations
  int<lower = 1, upper = 2> group[n]; //group variable
  vector[m] y[n];
  
}

transformed data {
  int<lower = 1> p; // number of nonzero lower triangular factor loadings
  vector[m] zeros;

  zeros = rep_vector(0, m);
  p = k * (m - k) + k * (k - 1) / 2;
  
}

parameters {
  vector<lower = 0>[k] beta_diag;
  vector[p] beta_lower_tri;
  vector<lower = 0>[m] sigma_y; // residual error
  real<lower = 0> sigma_L; // hyperparameter for loading matrix elements
  matrix[2,2] beta;
  vector<lower = 0>[2] sigma_beta;
  corr_matrix[k] omega;
}

transformed parameters {
  matrix[m, k] L;
  cov_matrix[m] Sigma;
  matrix[m, m] L_Sigma;
  matrix[n,k] z; // factor scores
  
  
  { // temporary scope to define loading matrix L
    int idx = 0;
    for (i in 1:m){
      for (j in (i + 1):k){
        L[i, j] = 0; // constrain upper tri elements to zero
      }
    }
    for (j in 1:k){
      L[j, j] = beta_diag[j];
      for (i in (j + 1):m){
        idx = idx + 1;
        L[i, j] = beta_lower_tri[idx];
      }
    }
  }
  
  Sigma = multiply_lower_tri_self_transpose(L);
  for (i in 1:m) Sigma[i, i] = Sigma[i, i] + sigma_y[i];
  L_Sigma = cholesky_decompose(Sigma);
  
  //z scores
  for (i in 1:n) {
for (j in 1:k) {
z[i, j] = (1 / sigma_L) * L[, j]' * inverse(Sigma) * y[i,];
  }
  } 
}

model {
  // priors for model 2
  to_vector(beta) ~ normal(mean(z),1);
  sigma_beta ~ exponential(1);
  omega ~ lkj_corr(4);
  
  //priors for model 1
  beta_lower_tri ~ normal(0, sigma_L);
  sigma_L ~ normal(0, 1);
  sigma_y ~ normal(0, 1);
  
  // priors for diagonal entries (Leung and Drton 2016)
  for (i in 1:k) {
    target += (k - i) * log(beta_diag[i]) - .5 * beta_diag[i] ^ 2 / sigma_L;
  }
  
  // likelihood for factor model
  y ~ multi_normal_cholesky(zeros,L_Sigma);
  
  // second linear model
  for (i in 1:n) {
    z[i, 1:2] ~ multi_normal(beta[group[i], ], quad_form_diag(omega,sigma_beta));
  }
}