I’m having a hard time making sense of the write-ups on LFO cross-validation found on the Stan website (http://mc-stan.org/loo/articles/loo2-non-factorizable.html) and on arxiv (https://arxiv.org/pdf/1902.06281.pdf), both of which are entitled “Approximate leave-future-out cross-validation for Bayesian time series models.” The Stan-website writeup (in the section “M-step-ahead predictions”) assumes that we have a factorizable model, i.e., the values y[t], 1 <= t <= T, have independent distributions conditional on the model parameters – but this is never true for any but trivial time-series models! The whole point of having a time series model is that past values are informative about future values.

The arxiv writeup correctly notes that “Most time series models have a non-factorizable likelihood”, but then equation (10) treats the likelihood as factorizable anyway. Furthermore, it appears to have the formula for the raw importance ratios inverted – compare equation (10) in the arxiv write-up with the second equation in section “Approximate M-SAP Using Importance Sampling” in the Stan-website writeup.

Can the authors, or anyone else who understands the paper, help me out here?

The link you provided for Stan website has title " Leave-one-out cross-validation for non-factorizable models". Did you mean to link to http://mc-stan.org/loo/articles/loo2-lfo.html which has title " Approximate leave-future-out cross-validation for Bayesian time series models" ?

If we have a model p(y_i|f_i,\phi) and joint time series prior for (f_1,...,f_T) then p(y_i|f_i,\phi) can be considered independent given f_i and \phi and likelihood is factorizable. This is true often and the past values are informative about future values, but conditionally we know f_i, the past values are not providing additional information. This should not be confused with that when we don’t know f_i and integrate over the posterior of (f_1,...,f_T), then y_i are not independent anymore. Also they are not anymore exchangeable as we have the time ordering telling additional information. In cross-validation M-step ahead prediction is more about the usual interest in predicting future and evaluating the time series model for (f_1,...,f_T), but leave-one-out cross-validation is valid for assessing conditional part p(y_i|f_i).

Thanks, we’ll fix this to not say that.

Thanks we need to fix the vignette. We first implemented leave-future-out by fitting the model with all data, and removing one by one observations, but after an anonymous hint, we realized that it’s better to fit first the minimal model and add one by one observations as then the proposal tends to have thicker tails than the target.

Tagging @paul.buerkner so the he also sees what we need to fix.

Aki: You’re right, I included the wrong link, I meant the one you mention.

If we have a model p(y_i | f_i, ϕ) and joint time series prior for (f_1,\ldots,f_T)

Oh, so for, say, a SSM you’re including the sequence of latent states as part of the parameter vector \theta? I would suggest you make that explicit, as this runs counter to the terminology usually used in the SSM literature. Anyway, thanks, that clarifies things a lot.

The autoregressive model discussed as an example for the online writeup (Water Level in Lake Huron) is non-factorizable: there is nothing that plays the role of f_i in your explanation, and y_{i+n} for n>0 is dependent on y_{i-1},\ldots,y_{i-p}.

Do you really need factorizability if you can efficiently compute p(y_{j+1}\mid y_1,\ldots,y_j,\theta) for all j, as is the case for AR(p) models, and for linear SSMs via the Kalman filter?

Thanks for the clear questions, these will help to improve the paper and the vignette

AR-model can be written with f explicitly shown and this is what is done, e.g., in Stan code produced by brms.

Your questions raises another question that is what does factorizable mean?

If we don’t present f explicitly, AR-model still factorizable for LFO purposes as the joint posterior is p(y_1|\theta)p(y_2|y_1,\theta)p(y_3|y_2,y_1,\theta),\ldots,p(y_T|y_1,...,Y_{T-1},\theta)
which, e.g., in case of AR(1) reduces to p(y_1|\theta)p(y_2|y_1,\theta)p(y_3|y_2,\theta),\ldots,p(y_T|Y_{T-1},\theta).
Which are terms we are interested in. This factorization is the reason we can easily use (Pareto smoothed) importance sampling as the contribution of each additional observation is easy to compute. We are restricted to add these contributions in time order, which is not a problem as in LFO we want to do that anyway. For LOO we have a challenge if we can’t represent the model with terms y_i|f_i as discussed in the non-factorizable paper.

That efficient computation Kalman filter is based on that each term in the time ordered factorization is simple. Kalman filter could be used to compute these terms in cases where Kalman filter is applicable. Maybe we should have used non-Gaussian observation model (although then you could ask why we don’t use EKF or UKF, and we would need to again extend our model in a way that is easy in Stan and MCMC, but gets more difficult with *KF type algorithms, and then you could ask why don’t we use particle filters, and then I would say we don’t have those implemented in Stan)

Your questions show that we need to be more careful with the term factorizable and when we have explicit latent values f and when not.

Thank you @Kevin_Van_Horn for your really helpful comments! As soon as the LFO paper comes out of review, we will incoroporate your comments in the revised version, in particular clarify what we understand as factorizable and what kind of non-factorizability would be a problem for our method and what would not.

I will update the LFO vignette in the LOO package as soon as we have a final version of the paper, in order to avoid having to update the same things in two places multiple times.

This post helps a lot but I am still confused about whether or not it is possible to use LOO-PSIS for time series. In my case, I am interested in model comparison and in the interpretation of the model parameters. As mentioned by @Kevin_Van_Horn, the point-wise log-likelihood can be computed with the Kalman filter for linear Gaussian state space model. Does the time ordered factorization of the likelihood prevent from using LOO-PSIS and we must use LFO-PSIS instead?

Yes, it’s possible. It depends on what are your goals whether LOO or LFO is more useful.

Does this mean that you are not going to use the model to predict future? Which of the model parameters you are interested in?

Yes, and I would recommend it if you are using Kalman filter for linear Gaussian state space models. In this forum I’m a priori assuming people are using MCMC with Stan and write my responses based on that assumption.

No, it doesn’t prevent. It’s valid to use LOO, but it answers a different question than LFO.

I am going to use the model to predict the future only for validation purpose. My main goal is to infer the parameters of a grey-box model to estimate thermal properties of the true physical system.

Sorry I should have clarified this point. I come to the discourse to learn about Bayesian inference in general.

I get that point now, maybe I will try both and compare if the model selection with LFO and LOO leads to different recommendations.

Then LFO would predict what happens in that prediction better.

I try to be more specific. Let’s say you have time t=1,\ldots,T, a conditional model p(y_t|f_t,\phi) and time series model p(f_1,\ldots,f_T|\theta). If you are interested in \phi, LOO is probably enough. If you are interested in \theta, LOO might be enough but LFO would be better. In your model some of the parameters probably belong to the \phi category and some to the \theta category.

No problem, I’m happy to help in general.

My guess is that they give the same order for models, but when comparing models where the difference is in the timser series model part LOO shows smaller differences than LFO.

Yes, and I would recommend it if you are using Kalman filter for linear Gaussian state space models. In this forum I’m a priori assuming people are using MCMC with Stan and write my responses based on that assumption.

Those aren’t mutually exclusive. I often use MCMC to estimate the model parameters of a linear Gaussian SSM, and only switch to MAP in situations where I have good reason to believe the parameter uncertainty won’t contribute significantly to the forecast uncertainty, or just can’t afford to do MCMC. (Note that parameter uncertainty can lead to thicker tails, which can be important if you want a high-confidence upper bound forecast.)

So in more layman terms, would it be an accurate assertion to say that LFO is more appropriate if the goal is to see which model is better for out-of-sample predictive accuracy of future points and LOO is more appropriate if the goal is to determine out-of-sample predictive accuracy of unobserved measurements corresponding to existing obs_times?

Also, are there plans to implement LFO in the loo package?

Not immediate. It’s likely that refits are needed and refits need knowledge of the data and model structure so it’s not possible to make it generic. This is the same reason why refit for LOO is supported only in rstanarm and brms. There is a vignette (which should be updated) for LFO showing code.

It’s not clear what you mean with Stan directly. 1) We don’t have an example how you could do it using only Stan language. 2) The vignette Approximate leave-future-out cross-validation for Bayesian time series models shows R code that can be modified to be used with models written in Stan language directly (and not via, e.g. brms formulas).

Yes. I was going to say that see ?brms::log_lik for what it computes given the additional asrguments, but I noticed that it doesn’t document oos argument. @paul.buerkner can you clarify that part?

I think oos is passed to brms::prepare_predictions as out-of-sample indices. On second glance, it seems that other parts of the codes do not rely on brms and I just need to sort out log_lik in Stan code. Thanks a lot!