Posterior predictive simulation and summary by group in generated quantities block

I have the program with vector indicating grouping (group_id) and continuous variable to model (y). My interest is to generate prediction based on estimated \mu and a new length, grouping data (N_New, group_id_new). In addition, I want calcutate some quantities, e.g. mean of predicted difference between groups, e.g. Group_2 - Group_1, and some other quantities. Is there a way to index and and obtain by_group “y_rep” in order to create other quantities of interest in the generated quantities block or is this something that I need to do outside the stan program? Here is a simple example code assuming 3 groups.
Hopefully stan code below is clear and your responses don’t make me look dumb.

data {
  int<lower = 1> N;                              // Data sample size
  int<lower = 2> N_g;                          // number of groups
  vector[N] y;                                      // continuous response
  int<lower = 1, upper = N_g> group_id[N];       // group ID for each response
  
  int<lower = 1> N_New;                              // Prediction sample size
  int<lower = 1, upper = N_g> group_id_new[N_New];       // prediction group ID for predicted responses
}

parameters {
  vector[N_g] mu;                                           // estimated group means
  vector<lower = 0>[N_g] sigma;                    // estimated group SD
}

model {
  // priors
  mu    ~ normal(0, 100);                       
  sigma ~ cauchy(0, 2.5);

  // Model
  for (n in 1:N) {
    y[n] ~ normal(mu[group_id[n]], sigma[group_id[n]]);           
  }
}

generated quantities {
  vector[N_New] y_rep;                           // posterior predictive
  vector[N_g] mu_pred;                          // mean of predicted values (I may be defining this incorrectly, need help) 
  
  for (p in 1:N_New) {
    y_rep[p] = normal_rng(mu[group_id_new[p]], sigma[group_id_new[p]]);
  }
  
 // This is where I need help
// I would like to have mu_pred separated so that I can have
// other quantities e.g. "mu_pred[2]-mu_pred[1], e.t.c

 for (g in 1:N_g) {
   mu_pred[g] = mean(y_rep[group_id_new[g]]);
  }

}

****

Not sure if I’m tracking 100% what you are doing, but it seems like what you have is N observations, each of which can be a member of one of N_g groups and then you are fitting independent normal distributions to those groups of observations?

So if I_n \in \{1, \dots, N_g\} is the group identifier of observation n, then you have:

y_n | \mu, \sigma^2 \sim \mathcal{N}(\mu_{I_n}, \sigma^2_{I_n})

If your goal is to compare E(y_{n} - y_{m}) = E(y_n) - E(y_m) = \mu_{I_n} - \mu_{I_m}. So you can just compute this directly in the generated quantities. For example, if I_n = 1 and I_m = 2, then the expected difference is just \mu_1 - \mu_2 and you don’t need to fiddle about with the various indexes.

1 Like

Your understanding is correct. However, what I am trying to summarize is the mean of the predicted (unobserved y vector) analogous to the event probability estimation. Here the goal is to compare E(\tilde{y}_{n_{new}}-\tilde{y}_{m_{new}}).

Still a bit confused with what you’re going for here.

Suppose that your new value \tilde{y} that you want to examine has group id 1. Then \tilde{y} \sim \mathcal{N}(\mu_1, \sigma_1^2). Then E(\tilde{y}) = \mu_1. There’s no difference between mu_pred and mu in your model. You fit mu to your data set, then you generate new data using that mu and then you take group means to get mu_prep, which is just mu.

Theoretically, Yes. But the estimated mean of the predicted will not be the equal to \mu_1. I want to assess if my predicted sample can recover it. Please note the sample size of the new unobserved y_{pred} might be different.

With the normal model you have in your example, the means are just the mu. Generating samples y[1:N] from normal(mu, sigma) has an expected mean of mu, so you can skip the sampling and just do this:

generated quantities {
  matrix[N_g, N_g] mu_diff;
  for (m in 1:N_g)
    for (n in 1:N_g)
      mu_diff[m, n] = mu[m] - mu[n];
}

When you are done, the mu_diff[m, n] values are just posterior expectation \mathbb{E}[\mu_m - \mu_n | \textrm{data}] w.r.t. your model.