So, a bit of a non-standard request here. The situation builds off the discussion here (and I’m hoping some of the same participants can weigh in here — @avehtari @yuling 🙏)

We are running multiple time-series models that we would like to stack together using a bayesian stacking approach. Following the bayesian stacking approach, we would like to stack the models based on their out-of-sample predictive performance.

However, since it is a time series model, the out-of-sample cannot be leave-one-out, but has to be leave-future-out. The crux of the matter is that we would like to stack K models together based on their performance predicting the last N days of data.

The core problem is that in calculating the weights, we are using a model that has not seen the last N-days of data. But we want to stack together models that *have* seen the last N days of data.

What we had been doing is running 2 sets of K models; a set of K that sees the full dataset (`full`

), and a set of K that sees up to the last N days (`holdout`

). Then we compute stacking weights based on the `holdout`

set and then use those weights to stack the `full`

set. The implicit assumption here is that the posterior obtained in the `full`

set is the same posterior that would have been obtained by the `holdout`

set had they seen the full dataset.

This seemed reasonable. *However*, the issue is that our models are thoroughly in the regime explored in this paper. That is, at least sometimes, our chains saturate the treedepth, only explore a part of the posterior distribution, don’t mix, we don’t obtain Rhat=1, etc. Crucially, this means that the assumption laid out above (that the posterior obtained by `full`

is the same as would have been obtained by `holdout`

) no longer holds because the region of the posterior explored by any given chain is somewhat random.

So, due to the non-mixing of the chains, we would like to apply chain-level stacking as described in the paper linked-to in the previous paragraph. But, I don’t see a way to create stacking weights by evaluating the N day holdout performance of a chain, and then use those weights on chains that have seen the last N days. They would be different chains.

I had thought that the leave-future-out vignette was fit-to-purpose, but unfortunately it does not seem to be, given that the point of this vignette generally seems to be to approximate what a model fit on days 1 through L would have predicted on days (L+J) to (L+J+M); e.g., approximating the M-step-ahead performance of the model when it is actually M+J steps ahead, which is sort of the reverse of the challenge here, which is to approximate the M-step-ahead performance of the model when has actually seen M-steps ahead.

I am curious if anyone has any thoughts or ideas here!