Generated Quantities for prediction data

Hello,

I’m very new to Stan, but am excited to get started. I am trying to convert a random effects model from pymc3 into Stan.

I have been able to use pystan to compile a random effects model and draw posterior samples from the fitted distribution. See below for the code:

// hierarchical model with non-centered parameterization of mu
data {
  int<lower=1> N; //Number of observations
  int<lower=0> J; //Number of groups
  int<lower=0> K; //Number of group predictors
  int<lower=0> O; //Number of observation predictors
  int<lower=0, upper=J> group_idx[N]; // Maps each observation to each group
  real response[N]; // The outcome of the data generation process
  matrix[N,O] observation_data; //covariate data of the observations
  matrix[J,K] group_data; //covariate data of the groups
}
parameters {
  real<lower=2> nu;
  real<lower=0> inv_phi;
  vector[O] beta;
  vector[K] gamma;
  vector[J] group_mu_raw;
  real<lower=0> sigma_group_mu;
}
transformed parameters {
  vector[J] group_mu = group_data * gamma + sigma_group_mu * group_mu_raw;
  vector[N] mu =
    group_mu[group_idx] +
    observation_data * beta;

  real phi = 1 / inv_phi;
}
model {
  response ~ student_t(nu, mu, phi);

  nu ~ gamma(2, 0.1);
  inv_phi ~ student_t(4, 0, 5);

  beta ~ student_t(4, 0, 5);
  gamma ~ student_t(4, 0, 5);

  group_mu_raw ~ student_t(4, 0, 5);  // std_normal()
  sigma_group_mu  ~ student_t(4, 0, 5);

}
generated quantities {
  real y_rep[N];
  for (n in 1:N) {
    y_rep[n] = student_t_rng(nu, mu[n], phi);
  }
}

However, I now want to generate predictions from out-of-sample data. To do this, I have created a second Stan model that only contains the data and generated quantities block as recommended elsewhere. See below for code:

// hierarchical model with non-centered parameterization of mu
data {
  int<lower=1> N; //Number of observations
  int<lower=0> J; //Number of groups
  int<lower=0> K; //Number of group predictors
  int<lower=0> O; //Number of observation predictor
  int<lower=0, upper=J> group_idx[N]; // Maps each observation to each group
  real response[N]; // The outcome of the data generation process
  matrix[N,O] observation_data; //covariate data of the observations
  matrix[J,K] group_data; //covariate data of the groups
  int<lower=0> max_group_idx; //Highest

  int N_samples; // number of samples drawn from posterior distributions
  real nu[N_samples]; // nu parameter from student t likelihood
  real phi[N_samples]; // standard deviation parameter from student t likelihood
  matrix[N_samples, O] beta; // covariate matrix draws for observations
  matrix[N_samples, K] gamma; // covariate matrix draws for groups
  matrix[N_samples, max_group_idx] group_mu; //posterior draws of the group mu parameter
  real sigma_group_mu[N_samples]; // posterior draws of the hierarchal variance parameter.
}
parameters {
}
model {
}
generated quantities {
  real y_pred[N, N_samples];
  row_vector[N_samples] group_mu_rep;
  matrix[N, N_samples] mu;
  vector[O] observation_data_vector;

  for(n in 1:N) {
    if (group_idx[n] > max_group_idx) {
      for (s in 1:N_samples) {
         group_mu_rep[s] = gamma[s] * to_vector(group_data[group_idx[n]]) + sigma_group_mu[s] * student_t_rng(4, 0, 1);
      }
    }
    else {
      group_mu_rep = to_row_vector(group_mu[,group_idx[n]]);
    }
    mu[n] = group_mu_rep + to_row_vector(beta * to_vector(observation_data[n]));

    y_pred[n] = student_t_rng(nu, mu[n], phi);

  }
}

The problem now is that it takes an extremely long time to compile the model and draw samples from the predicted distribution.

It also feels kind of a hack to specify iter=1 when sampling.

pred_model.sampling(data=model_data, chains=1, iter=1, algorithm="Fixed_param")

Are there any recommendations for specifying the generated_quantities block in the prediction or how to speed this operation up?

Just for comparison, I can generate forward samples with python code in less than a second.

def generate_replicated_data(model_data, posterior_distribution):
    """Given the samples from the posterior_distribution and the model_data, create
    the forward sampled likelihood data given the model."""
    max_group_idx = model_data['max_group_idx']
    group_idx = model_data['group_idx']
    
    group_mu = posterior_distribution['group_mu']
    sigma_group_mu = posterior_distribution['sigma_group_mu']
    
    nu = posterior_distribution['nu']
    phi = posterior_distribution['phi']
    
    beta = posterior_distribution['beta']
    gamma = posterior_distribution['gamma']

    observation_data = model_data['observation_data']
    group_data = model_data['group_data']
    
    # Calculate the group mean for the missing data, i.e. draw from prior distribution for random effect, but
    # use the covariate coefficients.
    missing_groups = group_data[max_group_idx:,]

    missing_group_mus = []
    for missing_group in missing_groups:
        missing_group_mu = np.dot(gamma, missing_group.T) + scipy.stats.t.rvs(df=4, loc=0, scale=sigma_group_mu)
        missing_group_mus.append(missing_group_mu)
    
    missing_group_mus = np.array(missing_group_mus).T
    
    # Combine the known group means with the missing group means.
    if missing_group_mus.size > 0:
        group_mu = np.concatenate([group_mu, missing_group_mus], axis=1)
    
    assert group_mu.shape[1] == group_data.shape[0]
    
    mu = group_mu[:,group_idx-1] + np.dot(beta, observation_data.T)

    y = []
    for i, replication in enumerate(mu):
        y.append(scipy.stats.t.rvs(df=nu[i], loc=replication, scale=phi[i]))
    
    y = np.array(y)
    
    return y
1 Like

Sorry, your question fell through a bit.

In my opinion, generating in host language is a good way to generate predictions (e.g. it is used by the brms package), the main advantage is that it gives you the most flexibility. In practice you often can’t really reuse much code when generating predictions in Stan, although I admit there is some advantage in using the same language to generate the predictions (e.g. you don’t have to worry if Python has the same parametrizations of the distributions as your model does) .

If you want to use Stan for predictions, you might want to supply the data for prediction already when fitting the model - you would have something like N_pred, group_idx_pred, observation_data_pred, group_data_pred in the data and generate one sample in generated quantities for all the observations for prediction. I am not sure if it is already supported by PyStan, but there was some work to let you rerun just the generated quantities part of a model, so you should (possibly in some future version) be able to fit the model once and then predict multiple times even when using just a single Stan model.

Personally, I think having a separate Stan model just for predictions gets you the worst of both worlds (with the only exception that you can use the same prediction code for multiple host languages).

Does that make sense?

Yes, that makes sense. Thanks!

1 Like

are you running the standalone generated quantities method?
details are here: https://mc-stan.org/docs/2_24/cmdstan-guide/gc-intro.html
and here: https://mc-stan.org/docs/2_24/cmdstan-guide/standalone-generate-quantities.html

it’s available in CmdStanPy - https://cmdstanpy.readthedocs.io/en/latest/generate_quantities.html

superficial reaction looking at the above code is maybe there’s a more efficient way to code this loop.

1 Like

you need a PyStan later than Stan 2.20 to get faster compilation. not sure what’s available.
CmdStanPy will run with the latest Stan - 2.24.0 should be out this week - another reason to use CmdStanPy.

1 Like

Thank you! I think I figured this out by installing the latest dev version of pystan.