Generated quantities for prediction

Hey team, I’m still new to STAN. I’m struggling to understand how to use the generated quantities block to get inferences for new data points. I’ll use rstan as my interface.

Stan program

For simplicity, let’s use the example program in the stan getting started guide. And the schools data as a simple example.

// saved as schools.stan
data {
  int<lower=0> J;
  real y[J];
  real<lower=0> sigma[J];
}
parameters {
  real mu; 
  real<lower=0> tau;
  vector[J] eta;
}
transformed parameters {
  vector[J] theta;
  theta = mu + tau * eta;
}
model {
  target += normal_lpdf(eta | 0, 1);
  target += normal_lpdf(y | theta, sigma);
}

Example

library(rstan)

data = list(J = 8, 
                    y = c(28,  8, -3,  7, -1,  1, 18, 12),
                    sigma = c(15, 10, 16, 11,  9, 11, 10, 18))

# new schools:
new_schools = list(K=3, y_tilde=c(15, 25, 23), sigma_tilde=c(10, 9, 18))

fit1 = stan(
     file= schools.stan
    , data = data     
    , chains = 4
    , cores = 4
    , warmup = 1000
    , iter = 2000
)

I would like to take draws of p(\tilde{\theta_k} | \mu,\tau, \tilde{y}) to summarize the posterior prediction of the new school. I understand that to do so, I first need draws of the posterior of (\mu, \tau) and then draws of p(\tilde{\theta_k} | \mu,\tau), but I’m unsure how to use the generated quantities block to do so.

I have reviewed the documentation and other questions on this forum (1, 2, 3) but am still confused on how to go about this.

Any help is much appreciated, thanks!

In the generated quantities, you get a draw for the new school with eta_new = std_normal_rng(), multiply this by tau to get the effect of that school, and then use normal_rng(mu + tau * eta_new, sigma) to get draws from that posterior predictive distribution. Sorry, typing from my phone so a bit brief.

@mhollanders Thanks for the response. But I’m sorry, I’m not following.

Firstly, your suggestion does not seem to make use of the new data. Instead my understanding is that you suggest eta is drawn at random

new_schools = list(K=3, y_tilde=c(15, 25, 23), sigma_tilde=c(10, 9, 18))

Secondly, (and please forgive my STAN n00b-ness) I’m not sure how to properly specify the stan program you describe. I would appreciate any guidance! The below produces an error. I’m not sure how to properly include mu and tau into the data block; my various attempts produced different errors.

data {
    int<lower=0> K;
    int<lower=0> n_draws;
    real y_new[K];
    real<lower=0> sigma_new[K];
}
generated quantities {
    matrix[n_draws, K] theta_draws;
    real eta_new[K];
    for (k in 1:K) {
        eta_new[k] = std_normal_rng();
        for (n in 1:n_draws) {
            theta_draws[n, k] = normal_rng(mu + tau * eta_new, sigma_new[k]);
        }
    }
}
# error:                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
Semantic error in 'string', line 13, column 43 to column 45:
   -------------------------------------------------
    11:          eta_new[k] = std_normal_rng();
    12:          for (n in 1:n_draws) {
    13:              theta_draws[n, k] = normal_rng(mu + tau * eta_new, sigma_new[k]);
                                                    ^
    14:          }
    15:      }
   -------------------------------------------------

Identifier 'mu' not in scope. Did you mean 'K'?

Alternative in R

Alternatively, I can consider the conditional posterior and code this in R, but the variance of the draws seems high:

\theta_k | y_k \sim N(E(\theta_k | \mu, \tau, y_k), \sigma_k^2 + \tau^2)

where

  • B_k = \frac{\frac{1}{\tau^2}}{\frac{1}{\sigma_k^2} + \frac{1}{\tau^2}} and
  • E(\theta_k | \mu, \tau, y_k) = (1 - B_k) y_k + B_k \mu
eb_post = function(stanfit, new_y, new_sigma, ndraws=2000) {
    ddf = extract(stanfit)
    b_j_new = 1/mean(ddf$tau)^2 / (1/new_sigma^2 + 1/mean(ddf$tau)^2)
    eb_new = new_y * (1-b_j_new) + b_j_new * mean(ddf$mu)
    p_sigma = sqrt(new_sigma^2 + mean(ddf$tau)^2)
    # draws
    mat = matrix(nrow=ndraws, ncol=length(new_y))
    for (j in 1:length(new_y)) {
        mat[, j] = rnorm(ndraws, eb_new[j], p_sigma[j])
    }
    return(mat)
}
post_mat = eb_post(fit1, new_schools$y_tilde, new_sigma= new_schools$sigma_tilde, ndraws=2000)

Yeah sorry, I didn’t read it right. I guess I don’t understand how you want to make posterior predictions for observed data, where normally you condition your parameters on the observed data. Say you wanted to make posterior predictions for an unobserved school. Your Stan program for that would become:

data {
  int<lower=0> J;
  real y[J];
  real<lower=0> sigma[J];
}
parameters {
  real mu; 
  real<lower=0> tau;
  vector[J] eta;
}
transformed parameters {
  vector[J] theta;
  theta = mu + tau * eta;
}
model {
  target += normal_lpdf(eta | 0, 1);
  target += normal_lpdf(y | theta, sigma);
}
generated quantities {
  real eta_rep = std_normal_rng();
  real new_school = normal_rng(mu + tau * eta_rep, sigma);
}

If you have observed data for the schools, why wouldn’t you estimate your parameters for them?

Thanks @mhollanders !

Let me try to clarify: In my scenario, I have data to fit a model (say several hundred data points). Then I get a new data point. So, ideally I’d like to update the parameters for the existing model (without fully refitting) given the new single data point. Or, alternatively, not update the model and just use it for posterior inference on the new data point.

So, yes, I do want to estimate parameters for new observed schools. But without fully refitting the model. I’m not sure how to do this in STAN.

OK. You could do this by updating the current posterior with a single observation. But you’d have to set the prior in the new model to the be posterior of the old model, which isn’t trivial. You’re best off just fitting the model to all of the data.

Could you say a bit more about your use case, perhaps? In my view, updating the fitted model with a single new datapoint is rather different from using the fitted model to infer something about a new datapoint. When you say it’s much the same to you, I’m guessing there is something about your use case that I’m not understanding.

If assessing the predictive accuracy of your fitted model for future data is relevant, that’s the perfect job for leave-one-out crossvalidation. But hard to tell whether that’s relevant when we don’t know what you’re trying to achieve.

@erognli Sure… and yes, I use loo:loo_compare to assess choices of model in the initial fitting procedure.

My use case is that I’m conducting a meta-analysis of historical experiments in order to produce shrinkage estimators of the effects (eg. Azevedo et al 2019 link).

These shrinkage models are then applied to future experiments. I’m trying to decide how best to implement this application, which will be run by non experts. One way would be to update/refit the model with each successive experiment. I agree with you that this is technically preferable; but I believe it is less practical (though perhaps you disagree). I believe that it will be easier to implement, if less correct, to simply apply the previously fitted model to the new experiment.

So, this is the context behind my OP asking about using generated quantities purely for a new data point. I’m quite happy to hear a different recommendation. Thanks in advance for your help!

Ok, that made it a lot clearer! I’m probably not your best source of advice, but I agree that refitting the model for each new experiment might be excessive if there are lots of them.

Perhaps a reasonable compromise would be to refit the model every time the number of experiments performed exceed some proportion of those already included in fitting the model? If your context of deployment allows for that kind of versioning, that is.

If you are able to switch to cmdstanr, you could use the $generate_quantities() method of a CmdStanModel object to get predictions for a new experiment without refitting the model. You’d have to include a variable in the data block for the new data, and use that in the generated quantities block.