Conceptual question regarding application of LOO

First, it would be better to discuss cross-validation in general and loo is just one specific case.

I recommend first to read When LOO and other cross-validation approaches are valid

I will extend some points below, and after that if there is anything unclear please ask, as I’m preparing new material for cross-validation and feedback on unclear issue is useful.

loo and cross-validation in general do not require independence and not even conditional independence. Exchangeability is sufficient. Even we are using models with conditional independence structure, it doesn’t require that the true data generating mechanism is such, but due to exchangeability and the data collection process we can proceed as if assuming conditional independence. See more in BDA3 Ch 5. Cross-validation can also be used when the mode doesn’t have conditional independence structure.

In time series y_1,\ldots,y_T are not exchangeable as the index has additional information about the similarity in time. If we have model p(y_t|f_t) with latent values f_t then pairs (y_1,f_1),\ldots,(y_T,f_T) are exchangeable (see again BDA3 Ch 5) and we can factorize the likelihood trivially. We usually can present time series models with explicit latent values f_t, but sometimes integrate them analytically out due to computational reasons and then get non-factorizable likelihood for exactly the same model. See two posts in another thread

If we want to evaluate the goodness of the model part p(y_t|f_t) LOO is fine. If we want to evaluate the goodness of the time series model part p(f_1,\ldots,f_T) way may be interested in goodness for predicting missing data in a middle (think about audio restoration of recorded music with missing parts, e.g. due to scratches in the medium) or we may be interested in predicting future (think about stock market or disease transmission models).

If the likelihood is factorizable (and if it’s not we can make it factorizable in some cases) then this shows in Stan code as sum of log-likelihood terms. Now it’s possible to define entities which are sums of those individual log likelihood components. If the sums are related to exchangeable parts, we may use terms like leave-one-observation-out, leave-one-subject-out, leave-one-time-point-out, etc. And if we want additionally restrict the information flow, for example, in time series we can add constraint that if y_t is not observed then y_{t+1},\ldots,y_{T} are not observed.

How do we then choose the level of what to leave out in cross-validation? It depends on which level of the model is interesting and if many levels are interesting then you can do cross-validation at different levels. Or if you want to claim that tour scientific hypothesis generalizes outside the specific observations you have, you need to define what is scientifically interesting. For example in brain signal analysis it’s useful to know if the time series model for brain signals is good, but it is scientifically more interesting to know whether the models learned from a set of brains work well also for new brains not included in the data used to learn the posterior (training set in ML terms).

What do you want to do with these models? Predict future for the same locations? Tell that models learned based on the data from certain locations can describe the phenomenon in other locations? When testing generalization to ou of observed data are there additional constraints what information should be available (e.g. causality of time)

Yes if the different data types form groups which are exchangeable.

It can be. You can decide. If there is a constraint you may need different computation, but you can still choose what is sensible entity for the scientific or predictive task.

It’s better first work out what is the cross-validation you want to do. When you know that, then you can ask how do I compute if efficiently. There is no need to constrain your model evaluation based on what is easy to do with PSIS-LOO.

Not conceptually. In practice you need to be careful with how the continuous data is scaled, as the scaling affects log-densities and then log-probabilities and log-densities of arbitrarily scaled data are not comparable and their contributions would have arbitrary weights in the sum. You can also report the performance for these separately, you don’t need to sum them together.

2 Likes