I am using a multilevel model constructed using stan to make predictions. I want to evaluate the model by making density predictions and evaluate those using the predictive likelihood for out-of-sample prediction y_{i,t+x} (also desirable for in-sample predictions) but I do not completely know how to compute these using stan or the stanfit output.

Currently, I construct point predictions by using the posterior parameter draws from the stanfit object and compute \hat{y}_{ij,t+x} = \frac{1}{M}\sum_{m=1}^M \mathbb{E}(y_{ij,t+x}|\boldsymbol{y}), with \boldsymbol{y} the values of y_{it} for t = 1,...,t. For example, when my dependent variable y_{it} is normally distributed with mean \alpha_{i} + \beta_{i}x_{it}, I can compute \hat{y}_{ij,t+x} = \frac{1}{M}\sum_{m=1}^M \alpha^{(m)}_{i} + \beta^{(m)}_{i}x_{i,t+x} as a point prediction. However, I do not know how to compute the posterior predictive likelihood, p(y_{i,t+x}|\boldsymbol{y}) for this observation. Does someone know how to do this? Or is there some other approach to compute this? For example, I do compute the replicated (in-sample) y_{it} values using this:

generated quantities{
real y_rep[Ni];
vector[Ni] log_lik;
for (i in 1:Ni) {
real eta_n = mu[i];
y_rep[i] = normal_rng(mu, sigma_e0);
log_lik[i] = normal_lpdf(y[i]| mu, sigma_e0);
}
}

You would need to pass the out-of-sample information in through the data block and refer to it in generated quantities. There are examples of this in the manual. For example,

However, the method used in the manual you cited using

generated quantities {
vector[N2] y2;
for (n2 in 1:N2)
y2[n2] = normal_rng(f[N1 + n2], sigma);
}

is basically the same I do here

right? This is only for making the actual predictions. What I want to do is to make a density prediction and afterwards compute the predictive likelihood for each of these density predictions. In the end I also want to compute the predictive bayes factor from Geweke, J. (1994). Bayesian comparison of econometric models . Working Paper, Federal Reserve Bank of Minneapolis, Minnesota. Do you have any idea how to do this using the approach in the manual or something else?

I have one final question about the interpretation of this predictive likelihood. I take the mean of the predictive likelihood values in, letâ€™s say, pred_lik in my previous post. However, when I compare these values between two models, the model which gives better predictions in terms of RMSE and WAIC has a lower average predictive (log)likelihood, but I do not know what I am misinterpreting here. Do you have any idea what I am doing wrong?

Well, I would have just computed the ELPD. But even if you get all the code right, there is a chance with any finite number of observations you are evaluating things on that the ranking of actual log predictive density could be different than the ranking by expected log predictive density.

You defined pred_lik[n2] = normal_lpdf(f[N1+n2], sigma), that is itâ€™s log-density. Just to make sure, donâ€™t take mean over the posterior draws. The mean over the posterior draws has to be made in density scale. After taking the mean over draws, you can take mean over the observations in log scale.

Could you maybe explain a little bit more what you mean by this

How can I take the mean in density scale?

Currently my stan code looks like this

generated quantities{
vector[Ntest] y_test;
vector[Ntest] log_lik_test;
for (i in 1:Ntest){
real eta_n = beta_0j[modelLookup_test[i]] + fixedEffectVars_test[i,]*beta;
y_test[i] = normal_rng(eta_n, sigma_e0[modelLookup_test[i]]);
log_lik_test[i] = normal_lpdf(y_test[i]| eta_n, sigma_e0[modelLookup_test[i]]);
}
}

Afterwards I take the mean as follows mean(colmeans(as.matrix(stan_object, pars = "log_lik_test")). Do you have any idea what I am doing wrong to compute the average predictive likelihood for the out-of-sample predictions?

@avehtari I tried the approach you mentioned before, but I still get the result that the resulting average predictive likelihood for one mode; is higher than for the other. However, using your approach in this paper â€śWAIC and cross-validation in Stanâ€ť (also using the code in this paper) I get a significantly lower WAIC and higher lpd value for the model with lower average predictive likelihood as calculated using this:

Do you have any idea what the reason for this could be? Your help is really appreciated!

For the average predictive likelihood I am using this stan code part:

generated quantities{
vector[Ntest] y_test;
real log_p_new;
vector[Ntest] log_p_news;
for (i in 1:Ntest){
real eta_n = beta_0j[modelLookup_test[i]] + fixedEffectVars_test[i,]*beta;
y_test[i] = normal_rng(eta_n, sigma_e0[modelLookup_test[i]]);
log_p_news[i] = normal_lpdf(activations_test[i]| eta_n, sigma_e0[modelLookup_test[i]]);
log_p_new = sum(log_p_news);
}
}

where the [modelLookup_test[i]] part selects the appropriate parameter (since it is a multilevel model)

and afterwards I use this code in R

u = as.matrix(stan_object, pars = "log_p_new")
max_u <- max(u);
a <- 0;
for (n in 1:length(u)) {
a <- a + exp(u[n] - max_u);
}
lp <- -log(length(u)) + max_u + log(a);

and use lp (from the tutorial you mentioned before)

To compute the WAIC and lpd values I use this stan code

generated quantities{
vector[Ni] log_lik;
for (i in 1:Ni) {
real eta_n = mu[i];
log_lik[i] = normal_lpdf(y[i]| eta_n, sigma_e0[modelLookup[i]]);
}

and I use log_lik as input for the code in your paper.

By summing log_p_news, you are computing the joint log density instead of pointwise log densities, and those values are not comparable and you can have a very different value compared to waic which is based on pointwise predictions.

Yes for the point 1. This avoids computing and overwriting the previous sum Ntest times.

Yes. the results should not change.

See also my point 2. You are now comparing \log p(y_{\mathrm{test},1},\ldots,y_{\mathrm{test},N_{\mathrm{test}}}|y_\mathrm{train}) to \sum_{n=1}^{N_\mathrm{train}} \log p(y_n | y_{n-1}). These are not comparable as the first one is log of joint predictive density and the other one is sum of log pointwise predictive densities. You can compute log pointwise predictive densities also for y_{\mathrm{test},1},\ldots,y_{\mathrm{test},N_\mathrm{test}} from log_p_news (and you donâ€™t need that sum at all).

Only thing I do not have an explanation for is that the log pointwise predictive densities for out-of-sample values are higher for a model which has a considerable higher WAIC value and also performs worse in terms of RMSE and MAE (also out-of-sample) than my other model. Is this maybe caused because the better performing model in terms of RMSE and MAE has higher uncertainty for the out-of-sample predictions?

Difficult to say without seeing the code for all these computations (as there might be other mistakes) or data. Do you get Pareto-k warnings from loo()? loo has better diagnostics than waic, so itâ€™s better to use loo.

@avehtari sorry for the late reply, but I am still struggling with this. Using the loo package I get the warning: (3.9%) p_waic estimates greater than 0.4. We recommend trying loo instead. Are this the Pareto-k warnings you mean? The resulting WAIC value however is exactly the same as using the code in your paper. Do you maybe know how to interpret this warning, as I cannot really interpret it as whether I can trust my estimates or not

Using loo I also get the same results that the in-sample WAIC for one model is better than the other which also produces better predictions in terms of RMSE and MAE (both in- and out-of-sample). However, the out-of-sample pointwise predictive likelihoods indicate the opposite, but I get the following warnings: (100.0%) p_waic estimates greater than 0.4. We recommend trying loo instead.. So, does this mean I cannot trust the results?