How to reuse PyStan model to predict new dataset


Hi PyStan Experts,

I’m starting to explore PyStan for bayesian regression model (with the help from Bayesian Models for Astrophysical Data). In short, I’m assuming variable X1 and X2 are related to y where y follows negative binomial distribution. My problem is I still haven’t figured out how to predict y using new dataset (i.e. only X1 and X2 variables are available) but with the Stan model that I have trained. Do you have any suggestion on how to solve my issue?

    int<lower=0> N;
    int<lower=1> K;
    matrix[N,K] X;
    int y[N]; 
    vector[K] beta;
    real<lower=0> theta;
    vector[N] mu;
    mu = exp(X*beta);
    theta ~ gamma(0.001, 0.001);
    for (i in 1:N) y[i] ~ neg_binomial_2(mu[i], theta);
generated quantities{
    real dispersion;
    vector[N] expY;
    vector[N] varY;
    vector[N] PRes;
    vector[N] mu2;
    vector[N] y_predict;
    mu2 = exp(X * beta);
    expY = mu2;
    for (j in 1:N){
        y_predict[j] = neg_binomial_2_rng(mu2[j], theta);
        varY[j] = mu2[j] + pow(mu2[j], 2) / theta;
        PRes[j] = pow((y[j] - expY[j]) / sqrt(varY[j]), 2);
    dispersion = sum(PRes) / (N - (K + 1)); 


PS. I have tried to used generated quantities block, but obviously it won’t work if I only have X1 and X2 in the new dataset (since I don’t know yet the value of y for prediction stage). Indeed, when I only input X1 and X2, this error message appears

‘RuntimeError: Exception: int variable contained non-int values; processing stage=data initialization; variable name=y; base type=int (in ‘unknown file name’ at line 6)’

Thanks in advance.


See subsection “Programming predictions” in section 9.14 of the manual.

You do want to use the generated quantities block and define y2 there then use the RNG functions to generate the y2, where N_new is the number of new predictions and X_new is the data matrix.

generated quantities {
  int y[N_new];
    vector[N_new] mu = exp(X_new * beta);
    for (n in 1:N_new)
      y[n] = neg_binomial_2_rng(mu[n], theta);

The nested brackets are to make mu a local variable so it doesn’t get saved.

We don’t recommend those gamma(0.001, 0.001) priors—see the prior choice Wiki or manual chapter on regression.

You can also vectorize your density as

y ~ neg_binomial_2(mu, theta);

And then you can recognize that we have neg_binomial_2_log that takes mu on the log scale, so you can just forget about defining the local variable and reduce your entire model to just

theta ~ gamma(1e-3, 1e-3);  // may introduce positive bias!
y ~ neg_binomial_2_log(X * beta, theta);


Thanks for your fast response @Bob_Carpenter. Will try your suggestion right away.