Understanding LOO and Binomial models

Hi all,

A quick question about computing LOO for models with Binomial likelihoods (or any that are weighted for that matter).

As far as I am aware, approximating the LOOCV score for a model makes use of the number of observations. Here’s the point made by McElreath in Statistical Rethinking:

Ordinarily, the number of observations is equal to the number of rows. But for models with Binomial likelihoods (or any that are weighted) this is not the case. Instead, the number of observations is equal to the sum of the number of Bernoulli trials (in the Binomial case) or the sum of the weights (in any other case).

Does this mean that LOO scores for these such models will be incorrect?

cc @avehtari

1 Like

It depends on what you want to leave out. If you want predictive performance on a single held-out Bernoulli trial, then you need to get the pointwise log-likelihood where points correspond to the individual Bernoulli trials. If you want predictive performance on a single held-out sampling unit where a samping unit is one string of iid Bernoullis with a Binomial sufficient statistic, then you hold out one row from the Binomial model. I’ll leave it to others with greater expertise to comment about weighted models in general.

1 Like

Ah, so in that case it might be best for me to fit the model using the binomial data but then compute the LOO scores using the data where each row is a single trial, right?

1 Like

If you interest is on the predictive performance for a holdout consisting of a single trial, then yes. However, if the number of trials varies due to processes that aren’t of predictive interest, then I think you probably want to hold out entire rows. Consider the following limiting case:

Two rows of the binomial regression are extremely data-rich, forcing the linear predictor to pass directly through the link-scale locations given by those two points. The remainder of rows are myriad, but in total account for a small fraction of the total number of Bernoulli trials. The regression line doesn’t actually fit these remaining data very well at all. Now, LOO-CV where the holdout is a single Bernoulli trial suggests that the model is performing very well, because it’s sitting at precisely the optimal probability for the majority of the Bernoulli trials in the dataset. But LOO-CV where the holdout is a Binomial row concludes that the model is performing very poorly (actually, PSIS LOO fails because of the extreme influence of the two data-rich points, but if you then implement full leave-one-out cross validation by brute force, you see that predictive performance is poor).

So if your predictive goal is to predict additional Bernoulli trials within the same groups that are already being used in the model, with additional observations expected to arrive in each group in proportion to the number of Bernoulli trials that exist in the existing dataset, then you want to do LOO on the Bernoulli trials. But if your goal is to predict the Bernoulli probability associated with observations from a new subject, then you want to do LOO on the Binomial rows.

2 Likes

Thanks, this is really useful. In my case the binomial cases are election surveys and the individual Bernoulli trials whether a given voter voted for the incumbent party or not. I don’t care about election surveys as a unit, so I guess it makes more sense to compute it for the Bernoulli trials (I’m only doing it this way because it’s a lot faster than using a Bernoulli likelihood given how many rows I have)

2 Likes

If the grouping truly doesn’t matter and all the Bernoulli trials are assumed iid irrespective of which survey they’re from (that is, there are no covariates or survey-specific terms in the model), you can convert the entire model to a single row with a single binomial sufficient statistic for the purposes of fitting (i.e. the model is estimating a single Binomial proportion). If you are conditioning on characteristics of the survey (i.e. you have survey-specific covariates in your Binomial model) or if you have a survey-specific term (i.e. a random intercept) then I suspect you might be interested in the predictive value of the model in the population of possible surveys, rather than in the specific sample of surveys that you obtained. (Unless the population of surveys is stratified in some important way and the number of respondents is proportional to the population-scale weight of the stratification block that the survey is meant to represent).

1 Like