Conceptual question regarding application of LOO

I’m hoping to start a discussion about the appropriateness of using LOO in a particular model type.

My situation is basically a population matrix model. I track say A ages over T years, animals are born, grow, die, captured, etc. There’s a bunch of non-linearities and hierarchical structure to the model, but that’s most likely not relevant. The important point here is that the data used to fit the model are more complex than the examples given in the loo paper/package. I have multiple different types of data.

In a fisheries context, a vessel catches fish in a net at a location. By comparing catch rates (total biomass caught per effort) over time I have information about the relative trend of the population, N(t), often modeled as a log-normal distribution. I also have data sets that count observed proportions of ages of individuals from sub-samples (e.g., 100 individuals), which are usually assumed multinomial. In practice we might have 10-15 distinct data sets but this simple example is sufficient to demonstrate the general idea.

In pseudo code I might loop over T years:

log_lik_1(t)=dnorm(Nobs(t), Npred(t), sd, TRUE)
log_lik_2(t)=dmultinom(x=Pobs(t), prob=Ppred(t), size=100, TRUE)

where Pobs is observed proportions of ages in year t across A ages from a sample of 100 individuals, and Ppred is the predicted proportions of age from the model. These are assumed independent, and summed inside to get a joint likelihood in a statistical model which we call “integrated analysis” in fisheries science. Naively I would think I could just do log_lik=c(log_lik_1, log_lik_2) and pass that through to loo like normal.

My question is whether it is statistically appropriate to combine different data types like this.

I read the loo vignette about non-factorable models and while it doesn’t seem to apply here, it got me thinking. Specifically, I am doubtful because the dmultinom calculation is a vector of observations with constraints (sum to 100) that get collapsed into a single log-likelihood. So is that really a single “datum”? Furthermore, given the structure of the model the multinomial samples across years will measure the same age classes multiple times, e.g., age 1 in year 1 is observed again as age 2 in year 2 just with fewer animals (death). The model estimates a single initial number of animals Nhat(t,0), but the size of the age class is measured multiple times. Put another way, if I dropped year 5, I would still have information about all age classes because later samples measure them. I don’t know if this breaks independence required by loo or not.

I’m wondering if anyone else has worked with PSIS-LOO in a similar situation and has thought about what a “datum” is in a situation with a variety of data sets of different types. Is it a problem to mix discrete (multinomial) and continuous (lognormal) data types? Does independence hold in this situation and it would be appropriate to calculate PSIS-LOO as described above.

First, it would be better to discuss cross-validation in general and loo is just one specific case.

I recommend first to read When LOO and other cross-validation approaches are valid

I will extend some points below, and after that if there is anything unclear please ask, as I’m preparing new material for cross-validation and feedback on unclear issue is useful.

loo and cross-validation in general do not require independence and not even conditional independence. Exchangeability is sufficient. Even we are using models with conditional independence structure, it doesn’t require that the true data generating mechanism is such, but due to exchangeability and the data collection process we can proceed as if assuming conditional independence. See more in BDA3 Ch 5. Cross-validation can also be used when the mode doesn’t have conditional independence structure.

In time series y_1,\ldots,y_T are not exchangeable as the index has additional information about the similarity in time. If we have model p(y_t|f_t) with latent values f_t then pairs (y_1,f_1),\ldots,(y_T,f_T) are exchangeable (see again BDA3 Ch 5) and we can factorize the likelihood trivially. We usually can present time series models with explicit latent values f_t, but sometimes integrate them analytically out due to computational reasons and then get non-factorizable likelihood for exactly the same model. See two posts in another thread

If we want to evaluate the goodness of the model part p(y_t|f_t) LOO is fine. If we want to evaluate the goodness of the time series model part p(f_1,\ldots,f_T) way may be interested in goodness for predicting missing data in a middle (think about audio restoration of recorded music with missing parts, e.g. due to scratches in the medium) or we may be interested in predicting future (think about stock market or disease transmission models).

If the likelihood is factorizable (and if it’s not we can make it factorizable in some cases) then this shows in Stan code as sum of log-likelihood terms. Now it’s possible to define entities which are sums of those individual log likelihood components. If the sums are related to exchangeable parts, we may use terms like leave-one-observation-out, leave-one-subject-out, leave-one-time-point-out, etc. And if we want additionally restrict the information flow, for example, in time series we can add constraint that if y_t is not observed then y_{t+1},\ldots,y_{T} are not observed.

How do we then choose the level of what to leave out in cross-validation? It depends on which level of the model is interesting and if many levels are interesting then you can do cross-validation at different levels. Or if you want to claim that tour scientific hypothesis generalizes outside the specific observations you have, you need to define what is scientifically interesting. For example in brain signal analysis it’s useful to know if the time series model for brain signals is good, but it is scientifically more interesting to know whether the models learned from a set of brains work well also for new brains not included in the data used to learn the posterior (training set in ML terms).

What do you want to do with these models? Predict future for the same locations? Tell that models learned based on the data from certain locations can describe the phenomenon in other locations? When testing generalization to ou of observed data are there additional constraints what information should be available (e.g. causality of time)

Yes if the different data types form groups which are exchangeable.

It can be. You can decide. If there is a constraint you may need different computation, but you can still choose what is sensible entity for the scientific or predictive task.

It’s better first work out what is the cross-validation you want to do. When you know that, then you can ask how do I compute if efficiently. There is no need to constrain your model evaluation based on what is easy to do with PSIS-LOO.

Not conceptually. In practice you need to be careful with how the continuous data is scaled, as the scaling affects log-densities and then log-probabilities and log-densities of arbitrarily scaled data are not comparable and their contributions would have arbitrary weights in the sum. You can also report the performance for these separately, you don’t need to sum them together.

2 Likes

Thanks for the thorough reply. I’m going to read through the linked resources and think carefully about exchangeability and predictive goals.

One thing I didn’t understand is the last paragraph where you say mixing discrete/continuous likelihoods and being careful about how they are “scaled” – what do you mean by this?

First, I did not say “likelihoods”. Likelihood is a function with respect to the parameters and Stan doesn’t allow discrete parameters (unless integrated out by summing). You can mix discrete and continuous observation models which have continuous likelihood functions. (Yes I know that people often call observation models as likelihoods, but in this specific case it is very important that we make clear difference between these as discrete observation model can have continuous likelihood function and continuous observation model can have discrete likelihood function).

Scaling of the data doesn’t change probabilities in discrete observation model. Scaling of the data does change the probability densities in continuous observation model. People often scale the data before modeling, for example to have standard deviation of 1. The same holds for other transformations, e.g. people might compare Poisson model for discrete counts to normal model for log counts, and then the results are not comparable. When the the probabilities don’t change but densities change, then the relative weight of components change. So you need to be careful, either by explicitly discretizing the continuous to probabilities as I mentioned earlier or keeping the scale such that densities correspond directly to sensible discretization.