How to reuse PyStan model to predict new dataset

Novia_Listiyani · February 16, 2018, 12:23am

Hi PyStan Experts,

I’m starting to explore PyStan for bayesian regression model (with the help from Bayesian Models for Astrophysical Data). In short, I’m assuming variable X1 and X2 are related to y where y follows negative binomial distribution. My problem is I still haven’t figured out how to predict y using new dataset (i.e. only X1 and X2 variables are available) but with the Stan model that I have trained. Do you have any suggestion on how to solve my issue?

"""data{
    int<lower=0> N;
    int<lower=1> K;
    matrix[N,K] X;
    int y[N]; 
}
parameters{
    vector[K] beta;
    real<lower=0> theta;
}
model{
    vector[N] mu;
    
    mu = exp(X*beta);
    theta ~ gamma(0.001, 0.001);
    
    for (i in 1:N) y[i] ~ neg_binomial_2(mu[i], theta);
    
}
generated quantities{
    real dispersion;
    vector[N] expY;
    vector[N] varY;
    vector[N] PRes;
    vector[N] mu2;
    vector[N] y_predict;
    
    mu2 = exp(X * beta);
    expY = mu2;
    
    for (j in 1:N){
        y_predict[j] = neg_binomial_2_rng(mu2[j], theta);
        varY[j] = mu2[j] + pow(mu2[j], 2) / theta;
        PRes[j] = pow((y[j] - expY[j]) / sqrt(varY[j]), 2);
    }
    dispersion = sum(PRes) / (N - (K + 1)); 
}

"""

PS. I have tried to used generated quantities block, but obviously it won’t work if I only have X1 and X2 in the new dataset (since I don’t know yet the value of y for prediction stage). Indeed, when I only input X1 and X2, this error message appears

‘RuntimeError: Exception: int variable contained non-int values; processing stage=data initialization; variable name=y; base type=int (in ‘unknown file name’ at line 6)’

Thanks in advance.
Novia

Bob_Carpenter · February 17, 2018, 8:55pm

See subsection “Programming predictions” in section 9.14 of the manual.

You do want to use the generated quantities block and define y2 there then use the RNG functions to generate the y2, where N_new is the number of new predictions and X_new is the data matrix.

generated quantities {
  int y[N_new];
  {
    vector[N_new] mu = exp(X_new * beta);
    for (n in 1:N_new)
      y[n] = neg_binomial_2_rng(mu[n], theta);
  }
}

The nested brackets are to make mu a local variable so it doesn’t get saved.

We don’t recommend those gamma(0.001, 0.001) priors—see the prior choice Wiki or manual chapter on regression.

You can also vectorize your density as

y ~ neg_binomial_2(mu, theta);

And then you can recognize that we have neg_binomial_2_log that takes mu on the log scale, so you can just forget about defining the local variable and reduce your entire model to just

theta ~ gamma(1e-3, 1e-3);  // may introduce positive bias!
y ~ neg_binomial_2_log(X * beta, theta);

Novia_Listiyani · February 18, 2018, 1:52pm

Thanks for your fast response @Bob_Carpenter. Will try your suggestion right away.

Topic		Replies	Views
Pystan sampling from the posterior predictive PyStan	8	3055	October 30, 2020
How to reuse the trained model to make prediction and get the predicted value Modeling techniques , fitting-issues	5	2278	September 18, 2019
Learning Pystan and vectorizing models Modeling	2	884	July 3, 2017
How to force PyStan to recompile a stan model? PyStan	12	1130	March 1, 2022
Posterior predictive plot in when using PyStan General pystan , python , arviz	4	1870	June 4, 2021

How to reuse PyStan model to predict new dataset

Related topics