Specifying the number of samples for rng

I am trying to simulate 10 new data points from a normal(0, 10) in the generated quantities block. From other topics, it sounds like this is the preferred method for posterior simulation with Stan. However, when I fit and extract my model with the following code, it looks like it treats x_new as a parameter vector, with each element containing 4000 samples from normal(0, 10).

generated quantities{
  vector[10] x_new;
  for (i in 1:10)
  x_new[i] = normal_rng(0, 10);
}

There must be something simple I’m overlooking. Thanks for any help.

2 Likes

Generated quantities gets run for every sample. It doesn’t just get run once.

By default with 2k total iters w/ 1k warmup, 4 chains, that’ll be 4k generated quantities.

You want to do it like this so your generated quantities are averaged over all the parameters in your model.

1 Like

I see. So “sample” in the last line of the generated quantities description in the manual (pg 108) refers to a single sample, not the full sample of 4000.

The generated quantities program block is rather different than the other blocks.
Nothing in the generated quantities block affects the sampled parameter values. The
block is executed only after a sample has been generated.

I think I managed generate my random sample of 10 by declaring it in the transformed data block, below.

transformed data{
  vector[10] x_new;
  for (i in 1:10) {
    x_new[i] = normal_rng(0, 10);
    }
}

But this brings up a more general question of where to do posterior summarization. Do you recommend doing this outside Stan? Again, in the manual it says the generated quantities should be able to allow us to calculate posterior expectations, but I see that fitting the following model also generates 4000 samples of the posterior expectation, which doesn’t make sense because it seems like it should be a single real value.

data {
  int NT;
  vector[NT] x;
  vector[NT] y;
}
transformed data{
  vector[10] x_new;
  for (i in 1:10) {
    x_new[i] = normal_rng(0, 10);
    }
}
parameters{
  real<lower = 0> beta;
  real<lower = 0> sigma;
}
model{
  target += normal_lpdf(y | beta * x, sigma);
}
generated quantities{
vector[10] y_pred; 
real E_y_pred; // Expectation of posterior prediction
for (i in 1:10)
y_pred[i] = normal_rng(beta * x_new[i], sigma);
E_y_pred = mean(y_pred);
}

Thanks again for your help, and apologies if this is drifting away from the titled topic.

But this brings up a more general question of where to do posterior summarization.

It’s a mix. Do it wherever it is easiest/makes the most sense. But usually if you already wrote the model in Stan it’s easy to just copy-paste a few things and compute whatever you want in the generated quantities.

generated quantities{
  vector[10] y_pred; 
  real E_y_pred; // Expectation of posterior prediction
  for (i in 1:10)
    y_pred[i] = normal_rng(beta * x_new[i], sigma);
    E_y_pred = mean(y_pred);
  }
}

You could do this, but it isn’t the biggest use case. You do your statistics on the output of generated quantities, mostly. You can do it inside, but the simpler case is outside (if that makes any sense…). More like this:

generated quantities {
  real y_pred = normal_rng(beta * x_new, sigma);
}

Then whenever the fit finishes, you can just print the fit (using whatever interface you’re using) and see the posterior statistics on that (mean of y_pred, sd, confidence intervals), or make plots of samples of y_pred directly.

But it seems strange that you’d only have one y value you’re modeling.

Generated quantities usually mirror the model section, in that you use the parameters from the sample you’re in to generate data that (hopefully) looks like the stuff you measure. It’s all about posterior predictive stuff.

So if model (not using vectorization) looks like:

for(n in 1:N) {
  y[n] ~ normal(mu, sigma);
}

Then the generated quantities looks like:

vector[N] y_pred;
for(n in 1:N) {
  y_pred[n] = normal_rng(mu, sigma);
}

That’ll leave you with 4000 copies of your dataset you can plot and analyze and whatever. It’s a lot of data. If you have anything other than a small model, you probably gotta be tactful with the generated stuff to not overload yourself.

Many thanks. This is very helpful!

Just for context, the end goal of this current project is to conduct a full Bayesian decision analysis, which will require maximizing an expected utility function over a set of possible decisions. Each decision will change x_new, and generate a different posterior prediction. This is why I was playing around with generating random samples of x_new.

Originally I was hoping to do all of this in the generated quantities block, but your right, this’ll probably overload my RAM, even if I can somehow make it work. I think I’ll have to modify my approach to extract the samples, and finish the decision analysis calculations in R or Python.

Thanks again!

which will require maximizing an expected utility function

Yeah, depending on how fancy that is, might be easier to do outside of the Stan model.

Be sure and use the generated quantities to generated your random numbers though. Between different software packages it’s easy to end up with different parameterizations of the same distributions (and hence bugs) if you aren’t super careful.

It gets run once per draw. Andrew’s trying to get us to use “draw” to mean the result of one iteration and “sample” to mean the collection of draws.

Thanks. I’ll fix that.

Yes. Everything within Stan is on a per-iteration basis—there’s no way from within the Stan language to access multiple draws.