I’m trying to understand key aspects of this paper without a stats/math degree, so I hope pedagogically minded members can chip in on my dumb questions. Here’s the first one:
Figure 4 (p. 17) shows a different scale for the first and second column, which represent estimated vs. true elpd_diff, respectively. Both the caption and the y-axis label suggest that this axis represents the mean of elpd_diff divided by its SD (“relative mean”) over simulated iterations. But it doesn’t make sense to me that true vs estimated elpd_diff could differ 30-fold. What’s the reason for the scale difference? Is it because in a simulation setting, true elpd_diff can be exactly calculated and therefore has no SD to divide by?
But on the same page, the text reads “When \beta_\Delta \neq 0, the relative mean of both
|elpd| and |elpd_loo| grows infinitely.” What am I missing? Is it just a detail assumed to be so self-evident that the authors prefer to save space by not spelling it out?
Also, am I interpreting the top-left panel of the plot correctly, i.e. that when true elpd_diff is 0, its estimator will always subtly favor the simpler model even when the sample size is large?
It’s because the standard deviation in the estimator is much larger than the standard deviation in the true value. There’s still variation in the true value because the elpd depends on the dataset that gets drawn.
This whole thing is in a section about the asymptotic behavior as n gets large. The indefinite growth is what happens in the large-n limit.
The true elpd_diff isn’t zero, but I think you’re getting burned by the scaling of the y-axis.
If I understand which line you’re wondering about, the gap looks smaller on the right, not the left.
The gaps on the left and right aren’t literally equal, because of the different sds. But that zero-coefficient line should be positive on the right hand side. The true model has the coefficient set to zero, so the model that constrains the coefficeint to zero should on average provide better out-of-sample prediction than a model that does not constrain the coefficient to zero.
Figure 13 on p. 27: What do the lowercase letters in the panel titles mean? Experiment 1b, 1c etc.
The descriptions of the 6 experimental settings on p.22–25 don’t seem to mention any “subtypes” of the six settings. And the description of the figure in the main text just says that it “compares the normal uncertainty approximation for data size n = 128, with a non-shared covariate effect \beta_{\Delta} = 0.5.” I’ve thus far failed to find any gloss or explanation for what the small letters mean.
Yes. The labelling of the experiments was simplified at some point to be 1-6, but we forgot to update this figure. Thanks for mentioning this! Ping @mans_magnusson
Since we got into the business of pointing out possible errata, I was somewhat confused by the fact that the rightmost definitions of \text{elpd(M}_k|y) seem to differ between Equation 1 (p. 3) and the Notation table (p. 5). The former uses \tilde{y}_i, the latter y_i.
It also looks to me like Equation 1 writes p_{\text{M}_{k}} when it means p_k. Overall, there’s a lot of vacillation all through the article on whether the _\text{M} in p_{\text{M}_{k}} is present or absent. Page 5 suggests to me that it should be absent throughout. But perhaps there’s a meaning difference that I’ve simply missed.
I’m now trying to understand the difference between \text{elpd}(\text{M}_a, \text{M}_a|~y), whose estimation is the main topic of the article, and \text{e-elpd}(\text{M}_a, \text{M}_a), which is discussed only briefly. Here are follow-up questions:
Does \text{elpd}(\text{M}_a, \text{M}_b|~y) being “conditional on y” mean that it is conditional on the respective posteriors of \text{M}_a and \text{M}_b as estimated from this sample? That is to say, is \text{elpd}(\text{M}_a, \text{M}_b|~y) always specific to two particular fits rather than just two particular models?
And if that’s the case, then isn’t \widehat{\text{elpd}}_{\text{LOO}}(\text{M}_a,\text{M}_b|~y) a somewhat unsatisfactory estimator of \text{elpd}(\text{M}_a, \text{M}_b|~y), given that it simulates resampling, refitting and retesting and is therefore not conditional on a particular fit? Isn’t it in fact true that \widehat{\text{elpd}}_{\text{LOO}}(\text{M}_a,\text{M}_b|~y) makes more sense as an estimator of \text{e-elpd}(\text{M}_a, \text{M}_a) than of \text{elpd}(\text{M}_a, \text{M}_b|~y), given that its calculation involves lots of refitting?
As the word sample is overloaded, can you clarify do you mean, e.g. data sample or posterior sample?
Specific to two particular models and one particular data. It’s not clear what you mean by “fit”, but in the paper we assume that the computation is exact or close enough that we are not considering the extra variation from potential use of stochastic inference.
It’s not clear what you mean with these terms, but in LOO the training data sets in each fold are as close as possible to y and each other, and thus it simulates conditioning on y. There are also approaches where each fold is independent of each other which would simulate the case of conditioning on random data, but then each fold is also very different from y (at least has to be much smaller).
Sorry about being unclear. I’ve now looked at relevant sections of the article some more, and here’s what I gather (trying to be clear this time):
\text{elpd}(\text{M}_a, \text{M}_b~|~y) compares the future utility of \text{M}_a and \text{M}_b's fits to the sample at hand. Meaning, after being estimated from this particular dataset, how do the two models compare on their usefulness in predicting future observations from the same DGM.
\text{e-elpd}(\text{M}_a, \text{M}_b) compares the present and future utility of \text{M}_a and \text{M}_b in general, with limited interest in the usefulness of their posteriors as estimated from this particular dataset (which is only one of innumerable potential realizations of y).
\text{elpd}_\text{LOO}(\text{M}_a, \text{M}_b~|~y) is an unbiased estimator of both quantities, but it seems more natural as an estimator of \text{e-elpd}(\text{M}_a, \text{M}_b) because it involves repeated re-evaluations of the posterior “as if we were fitting the model to a new sample”.
With 1. and 2. I agree, but I don’t agree with 3. because y_{-i} are close to y and each other. If you would like repeated re-evaluations of the posterior “as if we were fitting the model to a new sample”, you would like to condition each repetition with data that are as independent of each other as possible (minimum overlap). You could this, for example, by dividing the data in K folds and unlike in the usual K-fold-CV, you would use only the kth fold for fitting and then you would have K independent “training” data sets. Of course then the test sets have high overlap leading to other complications.