# How to reuse PyStan model to predict new dataset

Hi PyStan Experts,

I’m starting to explore PyStan for bayesian regression model (with the help from Bayesian Models for Astrophysical Data). In short, I’m assuming variable X1 and X2 are related to y where y follows negative binomial distribution. My problem is I still haven’t figured out how to predict y using new dataset (i.e. only X1 and X2 variables are available) but with the Stan model that I have trained. Do you have any suggestion on how to solve my issue?

``````"""data{
int<lower=0> N;
int<lower=1> K;
matrix[N,K] X;
int y[N];
}
parameters{
vector[K] beta;
real<lower=0> theta;
}
model{
vector[N] mu;

mu = exp(X*beta);
theta ~ gamma(0.001, 0.001);

for (i in 1:N) y[i] ~ neg_binomial_2(mu[i], theta);

}
generated quantities{
real dispersion;
vector[N] expY;
vector[N] varY;
vector[N] PRes;
vector[N] mu2;
vector[N] y_predict;

mu2 = exp(X * beta);
expY = mu2;

for (j in 1:N){
y_predict[j] = neg_binomial_2_rng(mu2[j], theta);
varY[j] = mu2[j] + pow(mu2[j], 2) / theta;
PRes[j] = pow((y[j] - expY[j]) / sqrt(varY[j]), 2);
}
dispersion = sum(PRes) / (N - (K + 1));
}

"""
``````

PS. I have tried to used generated quantities block, but obviously it won’t work if I only have X1 and X2 in the new dataset (since I don’t know yet the value of y for prediction stage). Indeed, when I only input X1 and X2, this error message appears

‘RuntimeError: Exception: int variable contained non-int values; processing stage=data initialization; variable name=y; base type=int (in ‘unknown file name’ at line 6)’

Novia

See subsection “Programming predictions” in section 9.14 of the manual.

You do want to use the generated quantities block and define `y2` there then use the RNG functions to generate the `y2`, where `N_new` is the number of new predictions and `X_new` is the data matrix.

``````generated quantities {
int y[N_new];
{
vector[N_new] mu = exp(X_new * beta);
for (n in 1:N_new)
y[n] = neg_binomial_2_rng(mu[n], theta);
}
}
``````

The nested brackets are to make `mu` a local variable so it doesn’t get saved.

We don’t recommend those `gamma(0.001, 0.001)` priors—see the prior choice Wiki or manual chapter on regression.

You can also vectorize your density as

``````y ~ neg_binomial_2(mu, theta);
``````

And then you can recognize that we have `neg_binomial_2_log` that takes `mu` on the log scale, so you can just forget about defining the local variable and reduce your entire model to just

``````theta ~ gamma(1e-3, 1e-3);  // may introduce positive bias!
y ~ neg_binomial_2_log(X * beta, theta);
``````

Thanks for your fast response @Bob_Carpenter. Will try your suggestion right away.