Appropriate evaluation criteria for out-of-sample predictive performance

Hi everyone. I’m currently working on a forecasting study in the area of financial time-series. The Stan platform (and PyStan in particular) has proven to be extremely helpful to me so far. My study uses a moving window approach for which Stan is used to fit the model and produce a 1-step ahead prediction at each iteration, forecasting a different time period and using a slightly different training set at each iteration.

Right now, I am looking for an appropriate way to assess the accuracy of the forecasts. As my knowledge in this area is somewhat lacking, I was hoping some of you might have some good insights or advice on which evaluation criteria to use and how to implement them correctly.

Right now, the generated quantities block for my most basic model looks as follows:

generated quantities {
    real rpred   = normal_rng( alpha + Xpred * Beta  , sqrt(sigma2));
    real rpredll = normal_lpdf(ropred | alpha + Xpred * Beta  , sqrt(sigma2));

So, at each iteration, the model produces a sample drawn from the posterior predictive density (rpred) and the predictive log likelihood (rpredll) as my understanding is that the latter is needed for most assessments of the accuracy of the forecast density. Hence, after running through all moving windows, I will end up with samples of both these quantities at each time period.

It’s pretty straightforward that I can use the mean of the posterior predictive density to produce a point forecast at each iteration, and use something like RMSE or out-of-sample R-squared to produce an aggregate value of the accuracy of the point forecasts across the different windows. I’m not so sure, however, what I should use to evaluate the accuracy of the density. I’m thinking of using something like the WAIC measure and use the predictive log likelihood values as input, but as I have so far not been able to find an example that uses this measure on a case similar to mine (moving windows, and only assessing out-of-sample predictions), I’m not sure how appropriate it is.

I also came across the following paper ( which lays out a method to calculate Bayes factors (see section 3) by means of the predictive likelihood. In this case, am I correct to think that I could calculate equation (7) from the paper by taking the mean of the exponential of the predictive log likelihood draws (rpredll) at each iteration, and then calculate equation (8) by summing over the logarithm of these values across the iterations? On the other hand, my understanding is that as the Bayes factor does not penalise the performance value as the variance of the log likelihood increases (and hence the parameter uncertainty increases), it might be less appropriate compared to WAIC. But, like I said, I’m not sure what the correct implementation of WAIC looks like in my case.

I would really appreciate any advice or pointers to useful resources. Thanks!

You usually compute the log predictive densities in order to use the elpd as predictive performance utility (see more here Vehtari et al., 2017)

Waiting for an answer for someone more expert than me, I can suggest you to try to give a look at this example of leave-one-out cross-validation applied to time series:


I’ll have a look at those sources, thanks!