I am trying to generate posterior predictions with new data using a separate stan file and just data and quantities blocks. I am using the UCB gender bias example in McElreath’s Lecture 9 2022 video. (Versions of this example appear in the text also, section 11.1.4.) I generated the posterior logit probabilities- 8000 for each of 12 combinations of gender and 6 departments. The model is
Y_{gid, dept\_id} \sim bernoulli(a(gid, dept\_id))
with the matrix a(i,j) containing the logit probabilities.
I am generating probabilities of acceptance for each of 4526 “applicants”. So this is merely taking each applicant’s gender and dept and computing the inv_logit for the 8000 samples in the column of a for corresponding dept/gender. After compiling the code below, I use sample() 1 chain, and 1 iteration So there should be no mcmc and no “sampling” per se. But it is taking several minutes rather than seconds, and generates a 9.5 GB file. My machine bogs down when I try to access the samples.
data {
int gid[4526]; // vector of gender ids, 0/1, for each applicant
int dept_id[4526]; // vector of department ids 1,2,...,6, for each applicant
matrix[8000, 12] a; // matrix of posterior logit samples for each dept/gender
}
parameters {
}
model {
}
generated quantities {
matrix[8000, 4526] y_pred;
for(n in 1:4526) {
for(i in 1:8000) {
y_pred[i, n] = inv_logit(a[i, 2*dept_id[n] - gid[n]]); // "male" for department j is column 2j-1
}
}
}
I sampled in R using
sampling(... chains = 1,
iter = 1,
algorithm = 'Fixed_param')
What I am trying to do should not be this computationally or data storage intensive. Is there something I am missing regarding the computations going on here?
Thanks
Mike
In fact, this is what the link() function does in the rethinking R package. It takes about a second to do this so there must be inefficiency in my code or it is doing something not intended.
You say that it shouldn’t be data storage intensive, but you’re iterating and storing 8000 x 4526 = 36,208,000 estimates, which is far from trivial. Are these dimensions correct/necessary?
Thanks for the reference. That is the example in 11.1.4, though I did not see the counterfactual analysis in the book so may not be here. It was in the video. Essentially I am trying reverse-engineer what the link() function does in R or stan. This reference by Kurz is essentially that for brms. The idea was to use the model posterior samples arising from the actual data to generate stratified/counterfactual predictions- in this case pretending all 4526 applicants were male, with number of applicants to each department equal to total applicants in that department in the actual data. Hence the 4526 predictions. I’ll certainly check the Kurz reference for guidance.
For your actual example, the link()
function is faster than your implementation because it’s not operating on all of the posterior samples like you are. The link function generates predictions at each iteration for the given model specification.
This means that you instead need to only iterate over the posterior distributions for the coefficients, generating your model predictions by specifying the covariate value of interest (i.e. male). It’s also in this same step that you would apply the inverse-logit function
Thanks @andrjohns for the insights! I am a little unclear though on how iterating “over the posterior distributions for the coefficients” would not involve iterating over each of the posterior samples. Maybe I just need to reread carefully a few times?
Ah I see what you mean, I missed the counterfactual aspect! This section has a counterfactual example that might be helpful: 11 God Spiked the Integers | Statistical rethinking with brms, ggplot2, and the tidyverse: Second edition
Thanks again will check that out!
1 Like