So, a bit of a non-standard request here. The situation builds off the discussion here (and I’m hoping some of the same participants can weigh in here — @avehtari @yuling 🙏)
We are running multiple time-series models that we would like to stack together using a bayesian stacking approach. Following the bayesian stacking approach, we would like to stack the models based on their out-of-sample predictive performance.
However, since it is a time series model, the out-of-sample cannot be leave-one-out, but has to be leave-future-out. The crux of the matter is that we would like to stack K models together based on their performance predicting the last N days of data.
The core problem is that in calculating the weights, we are using a model that has not seen the last N-days of data. But we want to stack together models that have seen the last N days of data.
What we had been doing is running 2 sets of K models; a set of K that sees the full dataset (full
), and a set of K that sees up to the last N days (holdout
). Then we compute stacking weights based on the holdout
set and then use those weights to stack the full
set. The implicit assumption here is that the posterior obtained in the full
set is the same posterior that would have been obtained by the holdout
set had they seen the full dataset.
This seemed reasonable. However, the issue is that our models are thoroughly in the regime explored in this paper. That is, at least sometimes, our chains saturate the treedepth, only explore a part of the posterior distribution, don’t mix, we don’t obtain Rhat=1, etc. Crucially, this means that the assumption laid out above (that the posterior obtained by full
is the same as would have been obtained by holdout
) no longer holds because the region of the posterior explored by any given chain is somewhat random.
So, due to the non-mixing of the chains, we would like to apply chain-level stacking as described in the paper linked-to in the previous paragraph. But, I don’t see a way to create stacking weights by evaluating the N day holdout performance of a chain, and then use those weights on chains that have seen the last N days. They would be different chains.
I had thought that the leave-future-out vignette was fit-to-purpose, but unfortunately it does not seem to be, given that the point of this vignette generally seems to be to approximate what a model fit on days 1 through L would have predicted on days (L+J) to (L+J+M); e.g., approximating the M-step-ahead performance of the model when it is actually M+J steps ahead, which is sort of the reverse of the challenge here, which is to approximate the M-step-ahead performance of the model when has actually seen M-steps ahead.
I am curious if anyone has any thoughts or ideas here!