Mismatch when predicting inside and outside of STAN

Hello there!

I’ve built the following model with 4000 iterations x 2 chains and the diagnostics seem fine:

data {
  int<lower=1> N;  // number of observations
  vector[N] Y;  // response variable
  int<lower=1> K;  // number of coefficients
  matrix[N, K] X;  // design matrix
  vector[N] a; // hyperparamater of y ~ beta(a, beta)
  //priors
  real shape_rate_prior; 
  real shape_shape_prior;
  real rho_nu; 
  real rho_mu;
  real rho_sigma;
}
parameters {
  vector[K] rho;  
  real<lower=0> shape;  // shape parameter
  vector<lower=0.01>[N] beta; //hyperparamater of y~beta(a, beta)
}
model {
  vector[N] log_mu = X * rho;
  vector[N] mu = exp(log_mu);
  vector[N] rate;
  for (n in 1:N) {
    rate[n] = shape / mu[n];
  }
  // priors
  target += student_t_lpdf(rho | rho_nu, rho_mu, rho_sigma);
  target += gamma_lpdf(shape | shape_shape_prior, shape_rate_prior);
  //hyperprior
  target += gamma_lpdf(beta | shape, rate);
  // likelihood
  target += beta_lpdf(Y | a, beta);
}

I am interested in predictions, so I extracted the posteriors and reconstructed my model in R (as @bgoodri suggests in this post).

As you can see, Y \sim beta(a, beta), with a being an observed variable and beta being a parameter of the model. I computed both y_hat and y_rep, being that

  • y_hat takes a shortcut by drawing directly from a and the beta parameter generated when sampling the model, and
  • y_rep is computed by reconstructing the entire model in R, i.e. the only input I use from stan is the parameter \rho, which is what I intend to do to get out of sample predictions later on. (y_hat was more of a reality check).
    BTW, I ignored the lp__ parameter completely. Should I not do this?

The pp_dens_overlay outputs for these two look like this:
bg_recovery_Y_inStan
bg_recovery_Y_outStan

The distributions of y_hat (“inside of stan”) and y_rep(“outside of stan”) look fairly similar, but y_rep looks slightly worse. The MAD for both is also considerably different, with values of 0.0104 and 0.0576 for y_hat and y_rep, respectively. (I used the median as the point estimate)

As my writing might have suggested, I am not an expert and I am not sure why these differences occur. I was wondering whether these differences are caused by

  1. selection bias in the random number generators: perhaps Stan tends to select draws that happen to maximize the likelihood of the response variable, which means the draws obtained while fitting the model happen to have a good fit. (Admittedly, this might be a silly guess, but I thought I’d ask)
  2. my own mistake somewhere at reproducing the model, which I didn’t find so far
  3. something else I might be forgetting.

My questions here is:

  • what causes these differences?
  • what would you suggest me to do in this particular case?
    (Of course, any additional feedback on the model would be much appreciated)

Thanks a lot for any help and please let me know in case I didn’t express myself well enough!

Hi, as a first step make sure that both Stan and R use the same function (_lpdf gives same output with same parameters).

1 Like

Thanks for the quick reply! I am not entirely sure what you mean by function in this context. Do you mean the function to compute y_hat and y_rep? If so, my calculations in R should mimic the Stan model.

Here are the important lines of the code:
(I omitted variable declarations for convenience; fit_bg_samples is where I stored the draws from Stan and bg_data contains the data used to run the model.

rho_hat <- fit_bg_samples$rho 

log_mu_hat <- t(bg_data$X %*% t(rho_hat ))
mu_hat <- exp(log_mu_hat)

for (i in 1:bg_data$N){
  rate_hat[,i] <- shape_hat/mu_hat[,i]
}

for(i in 1:bg_data$N){
    y_hat[,i] <- rbeta(N_draws, bg_data$a[i], fit_bg_samples$beta[,i])
    beta_rep[,i] <- rgamma(N_draws, rate = rate_hat[,i], shape = shape_hat[])
    y_rep[,i] <- rbeta(N_draws, bg_data$a[i], beta_rep[,i])
  }

I’ve even tried to avoid matrix multiplications by just doing all the operations at the element level to make sure I wasn’t making any mistakes there but I don’t find the error there either. What else could explain these large differences?

If you do those calculations in generated quantities and in R, are they giving you the same answer (rng are hard to compare, but lpdf is easier).

This is just to make sure that the parametrisation in R is the same as in Stan (e.g. use of variance vs standard deviation).

1 Like

Sorry, totally overlooked this and thanks a lot for the advice.

I changed all the _lpdf functions to the ~ distribution() (e.g. y~beta(...) as opposed to target += beta_lpdf(y,...)) format and added the following to the generated quantities, using rng:

generated quantities {
  vector[N] log_mu_hat;
  vector[N] mu_hat;
  vector[N] rate_hat;
  vector[N] Y_rn;
  vector[N] beta_rn;
  vector[N] Y_rn_rn;
  
  
  for(n in 1:N){
    // drawing Y from beta paramater directly
    Y_rn[n] = beta_rng(a[n], beta[n]);
    
    // drawing Y from draws of beta paramater 
    log_mu_hat[n] = X[n,] * rho;
    mu_hat[n] = exp(log_mu_hat[n]);
    rate_hat[n] = shape/mu_hat[n];
    beta_rn[n] = gamma_rng(shape, rate_hat[n]);
    Y_rn_rn[n] = beta_rng(a[n], beta_rn[n]);
  }

As you can see, I calculate Y in two different ways in the generated quantities. Y_rn uses the beta parameter straight from the estimation; Y_rn_rn replicates the entire parametrisation, using as input the rho and shape parameters from the estimation.

Y_rn_rn gives me similar results to those I obtain by performing all the calculations outside of Stan. That means the problem seems to be in drawing the hyperparameter beta from the gamma distribution. I get more heterogeneous distributions when drawing the beta (both in the generated quantities in Stan and in R) than Stan in the estimation phase. The plot below illustrates this by overlapping 200 random draws from these three methods. Stan draws refers to the generated quantities in Stan and it seems very similar to the draws generated in R; Stan estimation refers to the parameter resulting directly from the estimation process, and it seems to generate less heterogeneous samples. Could this be the problem? If yes, is there any way to fix it?

In addition, you mentioned avoiding _rng, but I wasn’t sure how to go about it. How do I run predictions staying consistent with the target += distribution_lpdf(..) format?

Thanks a lot for your help once again!