Yes. Following BDA3 notation, loo_i corresponds to \log p_{\text{post}(-i)}(y_i) (in Equation 7.14) and therefore the loo value (labeled as elpd_loo in the custom printout if using the log scale) is the sum of all loo_i values. Note that ArviZ allows choosing which scale to use, and both loo and loo_i include the scale “correction” so that if using the negative log scale loo_i will be - \log p_{\text{post}(-i)}(y_i) instead.
It looks like it, my guess is that it comes from the subsection title in pages 167-168 where elpd and lppd are defined. Given that the same docstring does use “density” in loo we should fix loo_i description.
\log p_{\text{post}(-i)}(y_i) = \log p(y_i | y_{-i}), right ? This is notation from BDA3 I find a bit silly, since it doesn’t save space but adds confusion.
In BDA3 p.168 (e.g. Equation 7.4), it seems like the extra “p” for “pointwise” is supposed to mean that we sum over the datapoints. It isn’t clear, but my guess is that \text{lppd} = \sum_{i=1}^n \text{lpd}_i. So also Equation 7.14 is \text{lppd}_\text{loo} = \sum_{i=1}^n \text{lpd}_{\text{loo},i} where your loo_i is \text{lpd}_{\text{loo},i} and your elpd_loo is \text{lppd}_\text{loo} which is an estimate of the elpd, so can also be called \widehat{\text{elpd}}_{\text{loo}}.
I think in BDA3 p.167, they use “predictive accuracy” as the more general term, for which “predictive density” is one example ? But then throughout the next pages they do use the terms interchangeably.
Due to its generality, we use the log predictive density to measure predictive accuracy in this chapter.
Yes again, I am not sure I understand all the notation used but I think everything you wrote is correct.
The extra p that stands for pointwise in the names, refers about the way the results are calculated indicating as you point out that the calculation consists in summing over datapoints. poinwise in ArviZ refers to returning intermediate computation results because everything we do is pointwise in the calculation sense just mentioned, so sometimes the “extra” p is “lost” from names and notation like it also happen in the link below.
Some notes on pointwise vs non-pointwise computation
Our goal is actually to calculate the elppd (Equation 7.2 in BDA3, called elpd in the paper I linked, Equation 1) because it is better suited in practice than what is defined as elpd in BDA3 (Equation 7.1) which is generally impossible to calculate.
I believe the first paragraph of “Estimating out-of-sample predictive accuracy using available data” (In BDA3, Section 7.2) gives a good explanation about why elpd is no longer considered throughout the section. Side note, I honestly still don’t understand if AIC and DIC technically aim for elpd or elppd but I don’t really care enough to find out, so we may consider elpd by using AIC and DIC but we definitely don’t consider it in the more recent paper.
We assume elppd is a good estimate of elpd under these terms (as stated in BDA3):
We know of no approximation that works in general, but predictive accuracy is important enough that it is still worth trying
Some notes on pointwise values vs summed value
The use of these intermediate result that is not given a specific name follows the proposal in [1709.01449] Visualization in Bayesian workflow, where they are called " expected log predictive densities (ELPD) for theindividual data points", “pointwise ELDP values” and “ELPD {}_i”. In ArviZ we use “pointwise ELPD values”, “pointwise predictive accuracy” (which should be changed to pointwise predictive density as you pointed out) and we have plot_elpd which plots “pointwise ELPD differences” and labels them “ELPD” without the {}_i. IIUC your notation, I would therefore use \text{elpd}_{\text{loo,i}} to indicate that we are calculating elppd, so we add expectations instead of taking the expectation over the whole (un)observed dataset as in elpd.
That is what I understand which is why I think we should change ArviZ docs, if you are interested in updating the docs, we can guide you in submitting a PR.
Our goal is actually to calculate the elppd (Equation 7.2 in BDA3, called elpd in the paper I linked, Equation 1) because it is better suited in practice than what is defined as elpd in BDA3 (Equation 7.1) which is generally impossible to calculate.
My understanding is that anything with the “e” (both elpd and elppd) is impossible to calculate without knowing the true distribution (which BDA3 calls f), we can only hope to estimate it.
This estimation seems to involve two sources of error:
Yes. It’s easy to complain about notation but very difficult to explain everything accurately, thanks for all the questions as it’s being very useful having to write this down instead of living in the realm of ideas. The dependence on f is always there, for both elpd and elppd, in fact, for independent observations, both elpd and elppd are equivalent.
The names in the bullet points are much better than my proposal, I may even take those and modify ArviZ’s source code.
I have one comment on the error sources. Point 2 includes both the Monte Carlo and the PSIS/Importance sampling error. We have sampled from the posterior conditioned on y, and we can marginalize over \theta by summing over the posterior samples, which we do to get from posterior to posterior predictive. This calculation is an approximation because it depends on the MCMC samples. But to get \log p(y_i|y_{-i}) we use PSIS to estimate the effect of conditioning on y_{-i} instead, which introduces an extra source of error. This last one luckily has \hat{k} as diagnostics.
If the model is refitted n times in order to avoid needing PSIS to “condition” on y_{-i}, we still have the cross-validation error and the monte carlo error to take into account. Even if this is often called “exact” cross validation! Same thing happens when there are bad \hat{k}, and we can’t trust the PSIS approximation, we can use IWMM or reloo as a workaround for PSIS not working, but then we need to be careful that the refits converge to get accurate results, there are still cv and mcmc errors to take into account.
We can imagine 3 possible Monte Carlo estimators based on these integrals:
a. Monte Carlo estimator with samples from p(\theta | y_{-i}) is \frac{1}{S} \sum_{s=1}^S p(y_i|\theta^s)
b. Monte Carlo (raw) Importance Sampling estimator with samples from p(\theta | y) is \frac{\sum_{s=1}^S r_i(\theta^s) p(y_i|\theta^s)}{\sum_{s=1}^S r_i(\theta^s)}
c. Monte Carlo Pareto Smoothed Importance Sampling (PSIS) estimator with samples from p(\theta | y) is \frac{\sum_{s=1}^S w_i(\theta^s) p(y_i|\theta^s)}{\sum_{s=1}^S w_i(\theta^s)}
Is your point that the Importance Sampling estimators (b and c) have higher Monte Carlo error than estimator a ? Though I think that estimator c (PSIS) should have less error than estimator b (raw IS).
Using importance ratios (both raw or PSIS) adds an extra estimation layer, which doesn’t always hold. I think this is a different source on error that adds to the Monte Carlo one. Both b and c have this limitation, even if PSIS is much more stable than raw importance sampling it is still limited to cases where the posteriors p(\theta|y) and p(\theta|y_{-i}) are similar. If the posteriors differ too much, either IWMM or reloo (which is approach a but only for those observations where PSIS fails) are needed to get good estimates, even if the mcse is negligible. More info on computational methods for loo: Cross-validation FAQ and more on \hat{k} estimates: LOO package glossary — loo-glossary • loo
We need to be able to convert the expectation over a multidimensional \tilde{y} into a sum of pointwise expectations. I am not sure I follow what you mean by identically distributed observations, isn’t it seems more restrictive?
I added a link on that (same as above), I am afraid I am not yet able to try and explain it in a clear, concise or organized manner. Also worth noting that it is already implemented in loo package but not yet in ArviZ.
Thanks @OriolAbril for answering most of the questions (AIC and DIC work pointwise)
More accurately
Replacing \mathrm{E}_{p(\tilde{y})}[\log p(\tilde{y}|y)] with \frac{1}{n}\sum_{i=1}^n \log p(y_i | y_{-i})
Note that LHS doesn’t have i and RHS has an empirical mean of terms \log p(y_i | y_{-i}). That is the unknown distribution p(\tilde{y}) is approximated with point mass distribution at locations y_i (we can include also the uncertainty about how much mass each location should have, e.g. using Dirichlet distribution). The uncertainty from having only a finite number of y_i can be called cross-validation uncertainty and is discussed in detail in Uncertainty in Bayesian leave-one-out cross-validation based model comparison (also a 30min video).
Now you could write \text{elpd}_i as you wrote, but then that expectation is estimated using just one value, which is not that nice.
IS is also a Monte Carlo approach, so Monte Carlo error is fine for the combination.
There’s a cool example in Implicitly adaptive importance sampling paper, demonstrating that MCMC refit Monte Carlo error can be quite big compared to truly exact computation.
Sorry, it’s our fault. As I said it’s not technically incorrect, but it’s not a useful presentation. Books are annoying as the release cycle is so slow and it’s difficult to send the information to the readers when the online version is updated. The latest BDA3 version Chapter 7 has a footnote “P.S. Instead of this Section we recommend to read Vehtari, Gelman, and Gabry (2017).” Maybe we should do more edits on that chapter to reduce the confusion.
The pointwise is an extra definition. lpd is generic term that can refer to both log predictive pointwise density or log predictive joint density. As loo considers only pointwise, that second p is not repeated everywhere. This is the usual way in stats, that something that is always the same or should be obvious from the context is not explicitly written. In the new paper we have tried to review also the more exact notation.
You are very good at spotting typos and inconsistent notation! We should send all our papers to you for proof-reading 😆
On the last line, there shouldn’t be i. Again, it would be nice to be able to make a pull request that could fix that also in the journal version.
Thanks! It looks like there is a typo in BDA3, still in the latest online version. In page 172 \widehat{\text{elpd}} is used for both AIC and DIC as if it were not pointwise whereas in page 174 for WAIC \widehat{\text{elppd}} is used. I think they should all 3 use the same term.
I was trying to emphasize the fact that PSIS is not the only source of error, and couldn’t find better terms. I think we should try to be explicit on this double source of Monte Carlo error, especially given how extended is using “exact” to describe reloo/kfold computational approaches in all our docs:
All these use “exact” to indicate it doesn’t use PSIS, but I think they can easily be interpreted as “has no Monte Carlo error”. The CV-FAQ also uses “naive” and “brute-force” for K-fold, maybe one of these two could be used as replacement for “exact”?
I’ll change these (I had hoped that that footnote would be enough and haven’t tried to change the online version too much, but I’ll do this one).
Even if the inference itself is not exact, the cross-validation given that inference can be exact. If you want to know the predictive performance given MCMC for full posterior then the exact LOO will also use MCMC. If you want to know what would be the predictive performance given an exact inference for full posterior predictive density then MCMC is not enough. So “exact cross-validation” and “exact posterior inference” are different things.
cross-validation error: replacing \text{elpd} = E_{p_{\text{true}}(\tilde{y})}[\log p(\tilde{y}|y)] \overset{\text{assuming our model is exchangeable}}{=} \sum_{i=1}^n E_{p_{\text{true}}(\tilde{y}_i)}[\log p(\tilde{y}_i|y)] with \text{lpd}_{\text{loo}} = \sum_{i=1}^n \log p(y_i | y_{-i})
EDITED (based on @avehtari’s correction below) from “iid” (independent and identically distributed) to “exchangeable”, which implies “id” (identically distributed) given covariates, see BDA3 p.5:
It follows from the assumption of exchangeability that the distribution of y, given x, is the same for all units in the study in the sense that if two units have the same value of x, then their distributions of y are the same.
Monte Carlo error: estimating \text{lpd}_{\text{loo}}, which can be done using exact LOO/cross-validation with samples from p(\theta|y_{-i}), or using (PS)IS using samples from p(\theta|y)as I describe here.
I’m also looking at Bates, Hastie, Tibshirani (2021), where I think their \text{Err}_{XY} is analogous to your \text{elpd}(M_k | y^{\text{obs}}) = E[\log p(y|y^{\text{obs}}) | y^{\text{obs}}] in equation (1), and their \text{Err} = E[\text{Err}_{XY}] is analogous to your \text{e-elpd}(M_k) = E[\text{elpd}(M_k | y^{\text{obs}})] = E[\log p(y|y^{\text{obs}})] in equation (2) ? If I’m understanding the iterated expectation correctly, maybe equation (2) might be clearer if written with y^{\text{obs}} instead of y, for consistency with equation (1) ?
I agree different levels of error can be confusing.
This would be the exact future predictive performance (assuming stationarity)
(also, instead of iid, exchangeability is sufficient)
I tried to say, that it matters whether the same computational is used for p(\tilde{y} | y) and p(y_i | y_{-i}). MCMC is not exact and can sometimes have significant pre-asymptotic bias before CLT kicks. If you could compute LOO predictive densities exactly (without MCMC) they may not match that error MCMC can make and thus if you want to estimate the predictive accuracy given, e.g. 4000 MCMC draws to estimate p(\tilde{y} | y) then you would like to compute LOO also with 4000 MCMC draws from each LOO-posterior. There’s an example showing the difference between exact inference and MCMC inference for full posterior predictive density in Implicitly adaptive importance sampling.
Correspondingly if we would use VI for the full posterior and to make predictions, we would like to compute LOO with VI run for each LOO-posterior.
Now if running MCMC for LOO-posteriors is approximated with PSIS there is a mismatch between the methods and there can be additional error.
So there would be three sources of error
cross-validation error
Monte Carlo error (other inference error)
mismatch between using different computation to compute full and LOO-predictive densities
We’re mostly able to estimate the variance part of these errors and we can estimate the bias in 1., but estimating possible bias from 2 and 3 is difficult without a lot of extra computation.
It would easier to just say yes, but the true answer is that we don’t need to in the same way as usually in Bayesian inference we don’t need to assume i.i.d. but exchangeability. We try to say this also in the paper with different words.
I corrected my post above from from “iid” ( independent and identically distributed) to “exchangeable”, which implies “id” (identically distributed) given covariates.
Thanks also for drawing my attention to a third source of error: mismatch between computation methods. I had not yet read your excellent paper that shows this, Implicitly adaptive importance sampling. Figure 2 compares the analytical/exact \text{lpd}_{\text{loo},i} to the computed (by various MCMC methods: PSIS, naive LOO, AIS) \widehat{\text{lpd}}_{\text{loo},i}. (The paper writes these as “elpd”, but I prefer to drop the “e” which is for expectation over the true distribution of future data.)
cross validation error: \text{elpd} = E_{\tilde{y}}[ \log p(\tilde{y} | y)] \overset{\text{exchangeablity}}{=} \sum_{i=1}^n E_{\tilde{y}_i}[ \log p(\tilde{y}_i | y)]
vs \text{lpd}_{\text{loo}} = \sum_{i=1}^n \log p(y_i | y_{-i})]
with p() as the exact/analytical computation.
Monte Carlo error: p(y_i | y_{-i})
vs \hat{p}^{\text{MCMC naive}}(y_i | y_{-i})
with the naive approach to LOO-CV that fits the model with MCMC separately for each fold.
Computational mismatch error: \hat{p}^{\text{MCMC naive}}(y_i | y_{-i})
vs \hat{p}^{\text{PSIS}}(y_i | y_{-i})
with the PSIS approach that uses samples from the full posterior.
But suppose our quantity of interest is the elpd fit using the algorithm we use to make predictions: E_{\tilde{y}}[ \log \hat{p}^{\text{MCMC}}(\tilde{y} | y)]. Does the second (Monte Carlo) error source go away ?
Not completely, but it’s often negligible. We show the Monte Carlo error for PSIS-LOO as there that error can be bigger although in practice it’s either negligible or we show NA when we can’t estimate that reliably.