I’m hoping to start a discussion about the appropriateness of using LOO in a particular model type.
My situation is basically a population matrix model. I track say A ages over T years, animals are born, grow, die, captured, etc. There’s a bunch of non-linearities and hierarchical structure to the model, but that’s most likely not relevant. The important point here is that the data used to fit the model are more complex than the examples given in the loo paper/package. I have multiple different types of data.
In a fisheries context, a vessel catches fish in a net at a location. By comparing catch rates (total biomass caught per effort) over time I have information about the relative trend of the population, N(t), often modeled as a log-normal distribution. I also have data sets that count observed proportions of ages of individuals from sub-samples (e.g., 100 individuals), which are usually assumed multinomial. In practice we might have 10-15 distinct data sets but this simple example is sufficient to demonstrate the general idea.
In pseudo code I might loop over T years:
log_lik_1(t)=dnorm(Nobs(t), Npred(t), sd, TRUE)
log_lik_2(t)=dmultinom(x=Pobs(t), prob=Ppred(t), size=100, TRUE)
where Pobs is observed proportions of ages in year t across A ages from a sample of 100 individuals, and Ppred is the predicted proportions of age from the model. These are assumed independent, and summed inside to get a joint likelihood in a statistical model which we call “integrated analysis” in fisheries science. Naively I would think I could just do log_lik=c(log_lik_1, log_lik_2) and pass that through to loo like normal.
My question is whether it is statistically appropriate to combine different data types like this.
I read the loo vignette about non-factorable models and while it doesn’t seem to apply here, it got me thinking. Specifically, I am doubtful because the dmultinom calculation is a vector of observations with constraints (sum to 100) that get collapsed into a single log-likelihood. So is that really a single “datum”? Furthermore, given the structure of the model the multinomial samples across years will measure the same age classes multiple times, e.g., age 1 in year 1 is observed again as age 2 in year 2 just with fewer animals (death). The model estimates a single initial number of animals Nhat(t,0), but the size of the age class is measured multiple times. Put another way, if I dropped year 5, I would still have information about all age classes because later samples measure them. I don’t know if this breaks independence required by loo or not.
I’m wondering if anyone else has worked with PSIS-LOO in a similar situation and has thought about what a “datum” is in a situation with a variety of data sets of different types. Is it a problem to mix discrete (multinomial) and continuous (lognormal) data types? Does independence hold in this situation and it would be appropriate to calculate PSIS-LOO as described above.