# Model averaging for Bayesian time series models

I would need some help or advice in order to perform model averaging for time series. To illustrate my goal, I will provide a simple example and also briefly discuss why, I believe, LOO (leave-one-out) and LFO (leave-future-out) does not really target my purpose.

Assume that we have a time series for 30 years (one data per year), (y_1,y_2,\dots,y_{30}), and we have two model canditates M_1 and M_2. My goal is two-fold:

1. Select the model which has the best prediction performance for the last 10 years, based on data of the first 20 years: (y_1,y_2,\dots,y_{20})
2. As an alternative, propose a model M_3 which is a linear combination of M_1 and M_2, and gives higher weight to the model which has the best prediction performance for the last 10 years.

Following the arguments of Burkner et al. (2019), model comparison based on leave-one-out (LOO) on the whole data is not appropriate because it will select the model which fits best the data for the 30 years and not on its ability to forecast.

On the other hand, leave-future-out as proposed in that paper is not exactly appropriate for my purpose because I am only concerned about the prediction performance of the last 10 years, and not a rolling window of 10 years as proposed in that paper.

I would then suggest the following if it makes sense:

1. For each model M_k and for S parameters draws, determine the expected log predictive density (elpd) by comparing the observed data of the last 10 years (y_{21},y_{22},\dots,y_{30}) with the last 10 year forecasts (\hat{y}_{21}^{(s)},\hat{y}_{22}^{(s)},\dots,\hat{y}_{30}^{(s)}),for s=1,...,S. If I am right, this can be obtained by averaging the log likelihoods over S and summing over 10 years:
elpd^k\approx \sum_{i=1}^{10}\frac{1}{S} \sum_{s=1}^{S}\log p(y_{20+i:30}|y_{1:20},\theta_{1:20}^{(s)})
2. For model averaging, I would consider the following weights based on the ideas of Pseudo-BMA of Yao et al. (2018) but adapted to my goal:
w_k= \frac{exp(elpd^k)}{\sum_{k=1}^{2}exp(elpd^k)}
Hence, a higher weight is given to the model which has the highest elpd, hence the highest forecast performance of the last 10 years.

Does my reasoning make sense? Do you have any suggestions?
Thank you very much for your help!

1 Like

Hi, sorry that your question slipped through.

I would tag @avehtari as he knows this stuff best, but at first glance, your approach looks reasonable to me. I think you might also benefit from something like leave-future-out where you would look at all 2(or 3 /4) year windowss in the 10 year for evalution while training on all the years before that (e.g. training on years 1-25 and predicting years 25-28) as this would let you asses how the predictions change when you add more observations (and use the LOO statistic).

Best of luck with the model!

Since last time, I have implemented this approach and it gave good results! Namely a higher weight was given to the model which forecasted the best for the last 10 years.

Thanks for your suggestion as well. Since I have 50 years of data, I think I might consider a rolling window of 10 years for the last 20 years and see, as you suggested, how the predictive performance changed over the recent past.

P.S: The sum over S should be inside the log in the elpd formula which can be computed via the log_sum_exp function, c.f. this link,
e l p d^{k} \approx \sum_{i=1}^{10} \log \frac{1}{S} \sum_{s=1}^{S} p\left(y_{20+i: 30} | y_{1: 20}, \theta_{1: 20}^{(s)}\right)

Hi @Kbus007,

I had also missed your post. Yes the approach you describe is sensible. You could further improve by taking into account the uncertainty (Pseudo-BMA+ in the stacking paper).

1 Like

Hello @avehtari,

Thank you very much for your feedback!

In fact, if I am correct in the function that I called, I already considered the Pseudo-BMA+.
Let me explain, assume that S is a N \times K matrix containg the point-wise log likelihoods, where N is the number of data points and K is the number of models. Now, if I call the following functions in R:

stack_weights<-stacking_weights(S)
pseudo_BMA_weights<-pseudobma_weights(S)


from the loo package, the first function will provide the stacking weights which maximize the log score as described in your paper and the second function will provide the weights form the Pseudo-BMA+. Is that correct?

Thank you very much in advance.

1 Like

Yes! I just looked at the equation in your first post which is without +. Calling these by default is using +

Btw. as + is ambiguous, we are considering in the future to use terms LOO-SE-weights and LOO-BB-weights to clarify also whether normal approximation or Bayesian bootstrap is used to take into account the uncertainty. And more generally Kfold-SE/BB-weights or LFO-SE/BB-weights etc, to clarify which cross-validation is used.

2 Likes

@avehtari Sorry for asking but I am now getting confused on how to use properly Pseudo-BMA+ with leave-future-out validation.

If I understand correctly, Pseudo-BMA+ takes into account the uncertainty caused by finite amounts of data. To cite Burkner et al. (2020) :

Second, there is uncertainty caused by finite amounts of data. For 1-step-ahead predictions, we can use an analogous approach to what is done in approximate LOO-CV by computing the standard error across the pointwise estimates for each observation (Vehtari et al., 2017). More generally, for M-step-ahead predictions, we can compute the standard error by using every Mth value which are then independent.

In my example above, I split my 30 years of data in two: first 20 years to fit the model and last 10 years to determine the 10-year ahead forecast (= leave-future-out validation). If I understand correctly the quote, does it mean that I should compute the standard error by using the 10th-year point and 20th year point only ?

In my understanding, Pseudo-BMA+ cannot be used in this example because the elpd sums over 1-year, 2-year,â€¦ 10-year ahead forecasts. Therefore, the â€śstandard errorâ€ť of this sum would not really make sense.

Thank you very much in advance.

If I understood correctly you are splitting the data just in two parts, and thus you have just one prediction and you need more than one to compute the standard error.