Posterior predictive simulation by subgroup


Recently I am performing posterior predictive simulation for certain subgroups of the orginal data but encountered some coding inefficacy problems. Suppose we have a hierarchical logistic regression model for 5 districts, and we would like to do posterior predictive simulation for each district separately. Since I don’t know whether Stan has any data structure like list in R (I am a R user), my current strategy is to split the original data into five pieces with corresponding predictors (for example, in total I have n data points and the predictor X is of length n, but now I split the data into n1,x1,…,n5,x5 and put them as separate inputs in the ‘data’ chunk, then define five y_pred vectors with length n1, n2, n3, n4 and n5) and do posterior predictive simulation separately for each of the five y_pred.

However, the above method is quite inefficient and messy if I have more subgroups and I am wondering whether Stan has a more efficient way to accomplish this goal?


The way to do this in Stan is lump the data from all five things into one vector and then include an array of integers indicating which group the elements of the vector correspond to.

So you might original have:

vector[2] v1 = [ 1.0, 2.0 ]';
vector[2] v2 = [ 3.0, 4.0 ]';

And you can recode that like:

vector[4] v12 = [ 1.0, 2.0, 3.0, 4.0 ]';
int group[4] = { 1, 1, 2, 2 };

So like the first two elements of v12 belong to group 1 and the next two to group 2.

You can also do a ragged array sorta encoding, where again you store things in one big vector but instead of having an array of which value belongs to which thing you write down the start position of each group (this requires that your values be organized in groups and not scattered randomly).

Example here: Binomial_lpmf: Probability parameter[1] is 1, but must be in the interval [0, 1] , and docs here:

Thanks a lot!