Posterior predictions rstan - unintended computations, artifacts

marciero · February 16, 2022, 6:17pm

I am trying to generate posterior predictions with new data using a separate stan file and just data and quantities blocks. I am using the UCB gender bias example in McElreath’s Lecture 9 2022 video. (Versions of this example appear in the text also, section 11.1.4.) I generated the posterior logit probabilities- 8000 for each of 12 combinations of gender and 6 departments. The model is

Y_{gid, dept\_id} \sim bernoulli(a(gid, dept\_id))

with the matrix a(i,j) containing the logit probabilities.
I am generating probabilities of acceptance for each of 4526 “applicants”. So this is merely taking each applicant’s gender and dept and computing the inv_logit for the 8000 samples in the column of a for corresponding dept/gender. After compiling the code below, I use sample() 1 chain, and 1 iteration So there should be no mcmc and no “sampling” per se. But it is taking several minutes rather than seconds, and generates a 9.5 GB file. My machine bogs down when I try to access the samples.

data {
   int gid[4526];     //  vector of gender ids, 0/1, for each applicant
  int  dept_id[4526];   // vector of department ids 1,2,...,6, for each applicant
  matrix[8000, 12] a;  // matrix of posterior logit samples for each dept/gender 
}
parameters {
}
model {
}
generated quantities {
  matrix[8000, 4526] y_pred;
  for(n in 1:4526) {
    for(i in 1:8000) {
      y_pred[i, n] = inv_logit(a[i, 2*dept_id[n] - gid[n]]);    // "male" for department j is column 2j-1
    }
  }
}

I sampled in R using

sampling(... chains = 1, 
                     iter = 1, 
                     algorithm = 'Fixed_param')

What I am trying to do should not be this computationally or data storage intensive. Is there something I am missing regarding the computations going on here?
Thanks

Mike

marciero · February 18, 2022, 1:04pm

In fact, this is what the link() function does in the rethinking R package. It takes about a second to do this so there must be inefficiency in my code or it is doing something not intended.

andrjohns · February 18, 2022, 2:03pm

You say that it shouldn’t be data storage intensive, but you’re iterating and storing 8000 x 4526 = 36,208,000 estimates, which is far from trivial. Are these dimensions correct/necessary?

andrjohns · February 18, 2022, 2:57pm

For posterior predictive checks for this model, check out this example with brms: 11 God Spiked the Integers | Statistical rethinking with brms, ggplot2, and the tidyverse: Second edition.

marciero · February 18, 2022, 3:06pm

Thanks for the reference. That is the example in 11.1.4, though I did not see the counterfactual analysis in the book so may not be here. It was in the video. Essentially I am trying reverse-engineer what the link() function does in R or stan. This reference by Kurz is essentially that for brms. The idea was to use the model posterior samples arising from the actual data to generate stratified/counterfactual predictions- in this case pretending all 4526 applicants were male, with number of applicants to each department equal to total applicants in that department in the actual data. Hence the 4526 predictions. I’ll certainly check the Kurz reference for guidance.

andrjohns · February 18, 2022, 3:14pm

For your actual example, the link() function is faster than your implementation because it’s not operating on all of the posterior samples like you are. The link function generates predictions at each iteration for the given model specification.

This means that you instead need to only iterate over the posterior distributions for the coefficients, generating your model predictions by specifying the covariate value of interest (i.e. male). It’s also in this same step that you would apply the inverse-logit function

marciero · February 18, 2022, 3:25pm

Thanks @andrjohns for the insights! I am a little unclear though on how iterating “over the posterior distributions for the coefficients” would not involve iterating over each of the posterior samples. Maybe I just need to reread carefully a few times?

andrjohns · February 18, 2022, 3:33pm

Ah I see what you mean, I missed the counterfactual aspect! This section has a counterfactual example that might be helpful: 11 God Spiked the Integers | Statistical rethinking with brms, ggplot2, and the tidyverse: Second edition

marciero · February 18, 2022, 3:38pm

Thanks again will check that out!

Topic		Replies	Views
Posterior Predictive Checks After Sampling Modeling	3	817	October 23, 2022
Efficient Coding of Large-Dimensional Generated Quantities (Export only means?) Algorithms rstan , posterior-predictive	5	670	July 22, 2020
Generated quantities for prediction Modeling specification , posterior-predictive	9	562	June 5, 2024
Posterior predictive check for binary outcome in Stan Modeling	2	819	November 10, 2020
Simulating posterior predictive samples for hierachical models Modeling rstan , fitting-issues	2	461	June 4, 2021

Posterior predictions rstan - unintended computations, artifacts

Related topics