Model selection with multiple observation likelihoods

I’m looking to do model selection with the loo package on a mark-recapture model that I’ve been working on. The experimental setup is fairly simple. Fish are captured in a trap, marked, and released upstream of the trap. They may be recaptured any day over the next few weeks (this recapture window is chosen a priori). The probability that a marked fish is recaptured on any given day depends on the time since release and the probability of capture on that day. Simultaneously, unmarked fish are collected each day with the same probability of capture as the marked recaptures. Fish are individually marked so the time between release and recapture can be known exactly.

I understand that I need to provide log likelihood values for each observation. However, my model includes two observation likelihoods. The first is a multinomial, representing the number of recapture tags on each day with an additional category for those individuals that were not recaptured. Release groups include between 1 and 100 individuals, and the cell probabilities depend on latent travel time and probability of capture processes. The second is Poisson, representing the number of unmarked individuals captured each day.

The probability that an individual in release group r arrives at the trap on day t is \theta_{rt}, and the probability of capture on that day is p_t, so for release group r with n_r individuals, we observe recaptures and non-recaptures as the vector \boldsymbol{M}_r:

M_r \sim \text{multinomial}\left(n_r, \left(\theta_{rt}p_t, \theta_{rt+1}p_{t+1}, \ldots, 1 - \sum_t \theta_{rt}p_t\right)'\right).

There can be as few as 4 or as many as 25 release groups.

The second observation likelihood is a thinned Poisson process, depending on the arrival rate of unmarked individuals on day t, \lambda_t, and the same probability of capture on day t, p_t, so that the number of unmarked individuals captured is u_t, where

u_t \sim \operatorname{Poisson}\left(\lambda_t p_t\right) \quad t = 1,\ldots, T.

Calculating the individual log likelihood contributions from each Poisson observation seems straightforward. I have a couple of questions:

  1. How should I handle the two observation likelihoods? Treat them as equivalent by putting them into a single vector? Consider only using one of them? Calculate selection criteria for each likelihood separately and see if the selection criteria agree?
  2. What level should I consider a single observation of the multinomial likelihood? Release groups would be straightforward, but given that each individual in a release group is considered (conditionally) independent it probably makes more sense to treat each individual as a separate observation.

Thank you!

loo package supports factorizing likelihoods, so that the total likelihood is product of likelihood terms and the log likelihood is sum of log likelihood terms (for non-factorized likelihoods some additional computation is needed outside of loo package).

Each likelihood term provided for loo package defines the amount of information removed in leave-something-out cross-validation, and when log score is used it also has double use as the log predictive density / probability for that left out term.

If I understood your model correctly, the release group likelihood terms factorize (given the parameters) and unmarked individuals likelihood terms factorize, too, independently of release group likelihood terms. So the easiest approach is just to give all release group and unmarked individuals logl likelihood terms for loo package.

  1. You can put them into a single vector to get one total elpd_loo value, which you can use to compare models. You can also compute the elpd_loo for them separately and which is useful as it is likely that you want to do calibration checks for the multinomial and Poisson separately. It may be also useful for additional insights when comparing models.

  2. I would consider release groups as that is the easiest given how you describe the model, but you can do it for individuals if you prefer that.

1 Like

Thank you @avehtari! This is very helpful. The likelihoods do factorize, so I’ve set up my Stan model to return each observation likelihood separately as well as a combined vector. Unfortunately this results in a substantial number of Pareto-\hat{k} > 1 observations. This is probably due to the large number of parameters (e.g. each unmarked observation has its own random effect), so I’m now exploring the Mixture IS method.

You can integrate out the observation specific parameters using quadrature as demonstrated in Roaches cross-validation demo

1 Like