Approximating leave-future-out performance when the future is left in

So, a bit of a non-standard request here. The situation builds off the discussion here (and I’m hoping some of the same participants can weigh in here — @avehtari @yuling 🙏)

We are running multiple time-series models that we would like to stack together using a bayesian stacking approach. Following the bayesian stacking approach, we would like to stack the models based on their out-of-sample predictive performance.

However, since it is a time series model, the out-of-sample cannot be leave-one-out, but has to be leave-future-out. The crux of the matter is that we would like to stack K models together based on their performance predicting the last N days of data.

The core problem is that in calculating the weights, we are using a model that has not seen the last N-days of data. But we want to stack together models that have seen the last N days of data.

What we had been doing is running 2 sets of K models; a set of K that sees the full dataset (full), and a set of K that sees up to the last N days (holdout). Then we compute stacking weights based on the holdout set and then use those weights to stack the full set. The implicit assumption here is that the posterior obtained in the full set is the same posterior that would have been obtained by the holdout set had they seen the full dataset.

This seemed reasonable. However, the issue is that our models are thoroughly in the regime explored in this paper. That is, at least sometimes, our chains saturate the treedepth, only explore a part of the posterior distribution, don’t mix, we don’t obtain Rhat=1, etc. Crucially, this means that the assumption laid out above (that the posterior obtained by full is the same as would have been obtained by holdout) no longer holds because the region of the posterior explored by any given chain is somewhat random.

So, due to the non-mixing of the chains, we would like to apply chain-level stacking as described in the paper linked-to in the previous paragraph. But, I don’t see a way to create stacking weights by evaluating the N day holdout performance of a chain, and then use those weights on chains that have seen the last N days. They would be different chains.

I had thought that the leave-future-out vignette was fit-to-purpose, but unfortunately it does not seem to be, given that the point of this vignette generally seems to be to approximate what a model fit on days 1 through L would have predicted on days (L+J) to (L+J+M); e.g., approximating the M-step-ahead performance of the model when it is actually M+J steps ahead, which is sort of the reverse of the challenge here, which is to approximate the M-step-ahead performance of the model when has actually seen M-steps ahead.

I am curious if anyone has any thoughts or ideas here!

There are many challenges here. I guess the main issue is that how to run model evaluation if we cannot obtain exact computation (chains being mixed). Stacking chains may be an option, in which case you are combining K\times M models if you have M parallels chains each model.

But perhaps an even easier solution is to simplify/improve the model such that exact sampling is feasible. It is rarely reassuring if the final perdition to be deployed is based on corrupted sampling. Perhaps you would like to run some fake data simulation with a mall sample size to see if there are model issues.

Thank you @yuling! A couple of things:

But perhaps an even easier solution is to simplify/improve the model such that exact sampling is feasible. It is rarely reassuring if the final perdition to be deployed is based on corrupted sampling. Perhaps you would like to run some fake data simulation with a mall sample size to see if there are model issues.

Yes this is definitely preferable and we are working on this too! And have done fake data simulation as well as other investigations to diagnose model issues.

I guess the main issue is that how to run model evaluation if we cannot obtain exact computation (chains being mixed). Stacking chains may be an option, in which case you are combining K×M models if you have M parallels chains each model.

Yes this is the option we would like to explore. The issue is – how to stack them? In the leave-one-out case it’s clear how to do it; you can develop stacking weights based on the approximated leave one out performance of the model. The chains you are stacking together are the same as the chains that you create weights on.

In the time-series, leave-future-out case, it’s not clear (at least to me) how you stack chains. Since it’s leave-future-out, to get the expected performance of the model on held out data, you have to actually hold the future out. But what correspondence does that chain have with one that was run on the full dataset? Since they are different chains, and each chain is only exploring part of the posterior, weights computed on one set of chains have no bearing on another set of chains.

My main question is can you run the stacking-chains approach on time series data? In order to do that, it seems to me that you need to be able to compute the stacking weights on the same chains that you are stacking together (as you do with the approximate LOO case). In order to do that, it seems that you have to be able to approximate the leave-future-out performance of a chain when it has seen the held out data.

The whole endeavor sounds fragile to me, but is this not what you want?

  1. For a given model, run a bunch of chains.
  2. Use stacking to get an assumed trustworthy posterior from the model, even if there’s not convergence.
  3. Show the resulting ensemble the holdout data to compute a single stacking weight for the already-stacked ensemble of chains
  4. Fit the models to the full timeseries, again employing stacking as necessary to get assumed-trustworthy posteriors.
  5. Use the stacking weights from (3) to stack the posteriors from (4).

That is, if you think you can use stacking to get a trustworthy posterior even in absence of convergence (but I guess you already know to be very careful with this), then you obtain those trustworthy posteriors first, and do your multimodel inference by stacking those assumed-trustworthy posteriors at the level of the entire posterior, rather than at the level of a single chain.

I don’t think that’s what I want because in step 4, what weights are you using to stack together chains? You’re using weights that were computed with different chains (ones that didn’t see the holdout data) and because they’re different and the region that each chain explores is somewhat random, you can’t apply those weights to different chains. The crux of the matter is whether or not one can approximate what the held-out performance of a chain would have been had it not seen the data (but when it did). In this way (and only this way, as far as I can tell) can you create weights and then apply them to the same chains.

First off, I’m not an expert on this and I could well be out of my depth. With that said…

The assumption is that by stacking you approximate a well converged model. So if you stack in step 2, then even though the chains are random, the stacked ensemble should converge to the true posterior (not that this is necessarily the case, but it’s the assumption you’re willing to make in order to pursue stacking here).

Same in step 4. Even though the chains are random, the stacked ensemble should converge to the true posterior (or at least that’s what you’re willing to assume).

Now in step 5, I’m suggesting that you not use the individual chains anymore at all, but rather you stack the already-stacked ensembles based on weights computed using those ensembles.

Thanks for the help. I guess what you’re proposing is that I stack ensembles of stacked ensembles. I guess that could work, but I am missing a “level” to stack the ensembles on in between.

In any case, I found what I’m looking for, which is in Appendix B of this paper, which shows how approximate the leave-future-out performance when the model is fit to the full dataset

The fact that there is a time series is not sufficient condition for not using leave-one-out (see, e.g. CV-FAQ #9).

This can be the reason for not using leave-one-out.

Ouch

Great! I was just about to comment that you can do the LFO in either direction, but good that we did mention it also in the appendix.

If computation is not a concern, it is not especially challenging to do stacking for time series. In Section 5.3 of https://arxiv.org/pdf/2101.08954.pdf, we have an example in which we stack time series forecasts (polling data).

But things are harder when multimodality kicks in. If the full posterior is multimodal, when we leave the last day out, it is hard to replicate the same set of “modes” or attraction regions even with the same HMC initializations. They may merge or disappear.

In exchangeable LOO context, this may happen too. We avoided this ambiguity by defining the leave-one-out distribution via inverse-likelihood importance weighing ( section 3.2 of https://arxiv.org/pdf/2006.12335.pdf).

I don’t have an example at hand. But I would argue the same ''definition" is applicable to the leave-future-out multi-chain stacking. The idea is that we run K chains on full dataset, cluster them, and then we use Important weighting to compute leave-1-day posterior, …, leave-T-day posterior, until the k hat can no longer support importance sampling approximation. Then we treat these T days as holdout data and stacking these K chains.

2 Likes