Is LOO valid for models with missing outcome when using a complete case dataset that is a subset of the original data?

Let’s say I have a dataset with an outcome, time, gender, and predictor1. There a missings in the outcome and predictor1. I want to compare the following 2 models with LOO.

m1 ~ brm(outcome | mi() ~ time + gender, data=data)
m2 ~ brm(bf(outcome | mi() ~ time + gender + mi(predictor1)) + bf(predictor1 | mi() ~ 1), data=data)
loo(m1,m2)

This will give an error and then this warning, “NAs were found in the log-likelihood. Possibly this is because some of your responses contain NAs. If you use ‘mi’ terms, try setting ‘resp’ to those response variables without missing values. Alternatively, use ‘newdata’ to predict only complete cases.”

If I create a complete case dataset from the original dataset (ie drop all missings from ‘outcome’ and ‘predictor1’), is it valid to use that for the ‘newdata’ argument in LOO to compare the models??
I tried it, and it seemed to work, except that it also gave this warning, “Not all models have the same y variable. (‘yhash’ attributes do not match)”

Is that warning just related to imputing missings for predictor1 variable in the model formula? If so, can I safely ignore that warning?

Thanks!

2 Likes

Hi,
EDIT: should note first, that my understanding of cross validation and LOO is quite shallow, Aki (answering below) is the actual expert, so my advice is to be taken more carefully and with a grain of salt.

In most cases the data with missing outcome can be safely ignored when fitting the model (it appears the outcome is not used to impute anything else), as missing outcomes usually don’t provide any additional information for your models (while rows with missing predictors still can). See e.g. MICE missing values on the response for some discussion.

Just to be clear - you used that same argument as newdata when computing the loo for both models? Or just for the one with the additional missing predictor? If so, then I think it might be mostly OK, but I think you are not likely to get any strong guarantees - the observations with missing predictor (especially if there are many of them) may have important influence on your posterior that is getting lost when you ignore them. So I would treat the results at best as a quick heuristic. I think that you could in principle also calculate the log likelihood for the rows where the predictor is missing by integrating out the parameter representing the imputed value - this would however need to be done manually. (I did that once with a very different type of nuisance variables and the results looked sensible, but I didn’t do a deep investigation). You AFAIK cannot just use the likelihood given the imputed value as it is basically certain that the imputed value depends strongly on the particular observation for that row, breaking the assumptions loo makes and giving you large pareto k.

Best of luck with your model!

3 Likes

Just adding that this is true for the models in the first post and for many other models, but if the prior structure depends on the locations of observations it can be the case that missing data can’t be simply ignored (see, e.g., section 3.1 in
Efficient leave-one-out cross-validation for Bayesian non-factorized normal and Student-t models.

1 Like

Thanks @martinmodrak and @avehtari for the responses.
My model formulation is actually a bit more complex than the example I gave here, and relates to this question Missing data imputation for outcome missing for all rows for certain timepoints - #5 by jsocolar
I am imputing missing outcomes so that I can keep those rows of data which will help me better impute missings for those predictors.

Just to be clear - you used that same argument as newdata when computing the loo for both models?

Yes, that is correct.

If so, then I think it might be mostly OK

Hmm…that’s not super reassuring haha. So is “integrating out the parameter representing the imputed value” the only solution for comparing models via ‘information criterion’ when I impute missing data? If I don’t do that (honestly not sure how), then is using the complete case dataset in the newdata argument the best I can do?

But you give the same newdata for both models?

Yes, same newdata for both models. I was wondering if that warning was just related to imputing missings for predictor1 variable using the formula bf(predictor1|mi() ~ 1)

I honestly don’t really know. As I said, I once tried doing the integration and it seemed to work, but I am not completely sure it is sound, I’d ask @avehtari to confirm this, if he has time.

The main idea is that when you observe y which depends on a parameter r (here the imputed value), and the joint likelihood (using \pi to denote “density of”) decomposes as:

\pi(y, r | \theta_1, \theta_2) = \pi(y | r, \theta_1)\pi(r | \theta_2)

then the marginal for y can be obtained as

\pi(y | \theta_1, \theta_2) = \int_{-\infty}^{+\infty} \pi(y | r, \theta_1)\pi(r | \theta_2) \mathrm{d}r

here \pi(y | r, \theta_1) is the observational model (depends on the imputed value r and some other parameters) and \pi(r | \theta_2) is the imputation model (which once again depends on other parameters). Note that for loo you need to get the logarithm of the density.

Since this is a 1D integral, it is quite likely it can be solved relatively efficiently by Stan’s integrate_1d or any other standard integration method. I.e. for each sample, you take all the parameters of the linear predictors for the imputed value and the main response, build a function that expresses the (non-log) likelihood of the observed value as a function of the imputed value times the (non-log) likelihood of the imputed value and then integrate this function with respect to the imputed value. The log of the result is IMHO the log-likelihood you want to use for loo.

Hope I didn’t mess up :-)

1 Like

Probably, as there is separate model for that with its own data.

In theory, what you try should be correct, but there is a small probability that way the imputation is implemented in brms that in this case it’s not correct. Unfortunately I’m not able to look at this in more detail before end of July.

1 Like

Ok, thank you for your responses and @martinmodrak for yours. I appreciate it.