Hello there!
I’ve built the following model with 4000 iterations x 2 chains and the diagnostics seem fine:
data {
int<lower=1> N; // number of observations
vector[N] Y; // response variable
int<lower=1> K; // number of coefficients
matrix[N, K] X; // design matrix
vector[N] a; // hyperparamater of y ~ beta(a, beta)
//priors
real shape_rate_prior;
real shape_shape_prior;
real rho_nu;
real rho_mu;
real rho_sigma;
}
parameters {
vector[K] rho;
real<lower=0> shape; // shape parameter
vector<lower=0.01>[N] beta; //hyperparamater of y~beta(a, beta)
}
model {
vector[N] log_mu = X * rho;
vector[N] mu = exp(log_mu);
vector[N] rate;
for (n in 1:N) {
rate[n] = shape / mu[n];
}
// priors
target += student_t_lpdf(rho | rho_nu, rho_mu, rho_sigma);
target += gamma_lpdf(shape | shape_shape_prior, shape_rate_prior);
//hyperprior
target += gamma_lpdf(beta | shape, rate);
// likelihood
target += beta_lpdf(Y | a, beta);
}
I am interested in predictions, so I extracted the posteriors and reconstructed my model in R (as @bgoodri suggests in this post).
As you can see, Y \sim beta(a, beta), with a being an observed variable and beta being a parameter of the model. I computed both y_hat and y_rep, being that
- y_hat takes a shortcut by drawing directly from a and the beta parameter generated when sampling the model, and
- y_rep is computed by reconstructing the entire model in R, i.e. the only input I use from stan is the parameter \rho, which is what I intend to do to get out of sample predictions later on. (y_hat was more of a reality check).
BTW, I ignored thelp__
parameter completely. Should I not do this?
The pp_dens_overlay outputs for these two look like this:
The distributions of y_hat (“inside of stan”) and y_rep(“outside of stan”) look fairly similar, but y_rep looks slightly worse. The MAD for both is also considerably different, with values of 0.0104 and 0.0576 for y_hat and y_rep, respectively. (I used the median as the point estimate)
As my writing might have suggested, I am not an expert and I am not sure why these differences occur. I was wondering whether these differences are caused by
- selection bias in the random number generators: perhaps Stan tends to select draws that happen to maximize the likelihood of the response variable, which means the draws obtained while fitting the model happen to have a good fit. (Admittedly, this might be a silly guess, but I thought I’d ask)
- my own mistake somewhere at reproducing the model, which I didn’t find so far
- something else I might be forgetting.
My questions here is:
- what causes these differences?
- what would you suggest me to do in this particular case?
(Of course, any additional feedback on the model would be much appreciated)
Thanks a lot for any help and please let me know in case I didn’t express myself well enough!