first of all many thanks for the great Stan ecosystem which we are using extensively in our research.

Here, I have a question about the loo package. I was wondering about how the standard error (SE) of the elpd_loo estimate is computed. Following eq. (22) of the paper https://arxiv.org/pdf/1507.04544.pdf the error of the estimate should be about \sqrt(\sum_i Var(\widehat{elpd}_{loo,i}). What I have trouble with is linking this with eq. (23) which makes two strong assumptions

Estimation errors for individual observation likelihoods, i.e. for each i are equal (and independent)

The estimation variance can be obtained from the spread of estimates
across observations, i.e. as V_{i=1}^n \widehat{elpd}_{loo,i} in the notation of the paper.

Especially this second point does not make sense to me. Just consider two data points, i.e. i = 1, 2. Now, the first point is predicted much better than the second \log p(y_1 | y_2) \gg \log p(y_2 | y_1). Further you can estimate both predictive likelihoods with high precision. Then, the precision of the total predictive likelihood – which is just the sum over all observations – should be high as well. Yet, the large spread across data points gives rise to a high SE according to the formula in the paper and the loo package.

Any thoughts on this would be highly appreciated. With best regards,

Likelihood is a function of parameters. \widehat{\mathrm{elpd}}_{\mathrm{loo},i} are log predictive densities (or log predictive probabilities).

Eqs 22 and 23 do not include estimation errors for individual \widehat{\mathrm{elpd}}_{\mathrm{loo},i}. We do mention in the paragraph starting at the bottom of page 19 and continuing at the top of page 20, that we could compute the Monte Carlo estimation errors for individual terms, too, and loo >2.0.0 is computing that, too using the equations described in [1507.02646] Pareto Smoothed Importance Sampling. These Monte Carlo estimation errors for individual terms are often negligible compared to the uncertainty of not knowing the future data distribution.

Terms \widehat{\mathrm{elpd}}_{\mathrm{loo},i} are assumed to be independent, although they are not. This is mentioned in the last paragraph of section 5.1. With large n and well specified models the error from independence assumption is small.

The sd we report (computed as eqs 22 an 23) is the sd of a distribution describing about our uncertainty what would be the predictive performance for a new observation coming from the same distribution as the observed data. In M-open we don’t trust any model to present the future data distribution and thus we use the observed data as pseudo Monte Carlo draws from that distribution without further model assumptions. Each \widehat{\mathrm{elpd}}_{\mathrm{loo},i} presents a potential future outcome when using the model to make prediction and we would observe y_i (and leaving that observation out of posterior computation to not double use the data). Eq 23 is usual Monte Carlo error estimate for the expectation over the future data distribution.

Sorry, I didn’t understand what you try to say in this part, but maybe it’s due to the confusion on the terms?

Thanks, that clarifies the issue for me. Indeed, I was considering the log predictive densities as a statistics that is to be estimated on the actual data. Then, only the Monte Carlo estimation errors should matter. Instead, you view the log predictive densities as a sampling distribution for predictions on future data which admittedly is much more useful and – as you note – has much higher uncertainty.