Log_lik for [estimated] missing data?

Stephen_Martin · July 7, 2017, 10:00pm

Hey all,

In a future project, I plan to not use listwise deletion, but include some missings as parameters to be estimated.
I’ll figure the code out when the time comes, but I’m curious how to handle computing the log_lik statements needed for looic or other model fit metrics.

Generally, you just throw vector[N] log_lik into gen quantities, and compute the log-likelihood of each observation, throw it back into that vector.
But what do you do if you model missing data? Do you only compute log_lik for observed data, or do you include log_lik for missings as well?

Context: The model will be an SEM type of model, where inevitably, some people will not answer all scales’ responses. Generally, this number has been small enough that I can just dump the 2-3 cases that didn’t fully answer the scales, but I plan on including all /available/ responses into the model, then model missing responses by just constructing a full data matrix from observed and unobserved data, and running the model on that data matrix. I’ll want to compute some model fit stats using the joint likelihood of the data, but I have no clue whether to include missing, estimated observations into the log-likelihood estimates. Thoughts?

Bob_Carpenter · July 10, 2017, 9:24pm

You include the log likelihood for missing data, too. That and the prior controls how it will be imputed. There’s a chapter in the manual on how to code up missing data in Stan.

Stephen_Martin · July 10, 2017, 10:34pm

Alright, so just to clarify, when computing log_lik[n] for LOOIC, you would compute log_lik[n] with missing observations as well?

Bob_Carpenter · July 11, 2017, 3:34am

I have no idea about LOO or what LOOIC is. Usually you wouldn’t compare missing observations under cross-validation, so maybe they don’t use them under LOO.

And we are talking missing data, not just latent parameters that go with the data? Such latent parameters need to be marginalized out in order to produce the usual notion of likelihood.

avehtari · July 11, 2017, 7:23am

If you want to compute LOO, then in log_lik computation in generated quantities include only the observed values. If you would include missing values, then those terms would correspond to self predictive approach (Section 5.2.3 A survey of Bayesian predictive methods for model assessment, selection and comparison). Self predictive log densities are highest for the most narrow predictive distributions, so you could use that to examine how your alternative models differ, but the self predictive approach can be optimistic as it cares only about how narrow the distribution is and not about where the distribution is (but with missing data you don’t have the observation where to compare the location).

Aki

Stephen_Martin · July 11, 2017, 7:26am

Perfect! That was my intuition (I didn’t think it would make sense to find leave-one-out approx error for unobserved variates), but your answer was very helpful.

Thank you.

Topic		Replies	Views
Loo with only partial log likelihood data RStan loo	1	369	May 22, 2021
Obtaining log likelihood for mixed effects models Modeling rstan , fitting-issues , specification , loo	3	1084	January 18, 2022
Fit statistics — troubles calculating log likelihood? Modeling specification , loo	6	2044	October 16, 2017
Calculate log_lik from model using less memory Modeling loo	4	693	August 28, 2019
Log_lik for LOO vs lp__ Modeling loo	6	1946	October 3, 2017

Log_lik for [estimated] missing data?

Related topics