What does it mean that the "model reduces the variance" relative to CV?

In this video at 1:04:08 or so, the lecturer says “if we can trust our model assumptions, we can detect smaller effects because we get the additional benefit from the model (reduced variance) while CV has the assumption that we don’t trust our model and we are predicting the future data distribution kind of non-parametrically”

Could someone help me understand the gist of that? What does it mean to “trust our model assumptions”? Does it mean assuming that the model is an accurate representation of the data generating process, i.e. that future data will be like the sample data, so we can trust quantities that are derived directly from the non-crossvalidated posterior? And if that’s not what it means, then could someone please try and explain what it does mean, preferably with the help of some simple illustrative scenario? The point being made in the vid seems to be an important one.

I’m that lecturer. You can see that difference in action in Beta blocker cross-validation demo. Models contain additional information beyond data, like assumptions about smoothness and distribution shapes. That additional information reduces variance. Cross-validation uses minimal non-parametric assumptions about the potential future data distribution and thus doesn’t get the benefit of additional model information. That case study includes also reference predictive approach that estimates the same predictive performance as cross-validation, but does use the model itself to model the conditional part of the future data distribution, which reduces the variance, but works only if the model is well calibrated. If we have made through model checking, and that checking indicates good calibration we can trust it. The calibration issue hold when looking at the posterior of the parameters directly.

5 Likes

Conditional on what, sir? The X values observed in the sample plus the distributional assumptions of the family parameter, perhaps? (Other people than the lecturer are allowed to answer too)

Whoa! Does this mean that as long as my model comfortably passes a goodness-of-fit test (such as the Hosmer-Lemeshow test in the case of binomial data), I can skip all those time-consuming loo calculations, and just use the raw posterior for prediction and effect interpretation? That would be a major convenience.

Take the case of simple Gaussian linear regression. If you have two models, and you know that both yield well calibrated predictions, in the sense of yielding well calibrated predictive error everywhere, then obviously you would prefer to make predictions from the one that yields narrower uncertainty intervals. The question is how can you know this? Underpowered goodness of fit checking applied to point estimates is not a good way to know this, and absent strong theoretical frameworks you’re almost always going to be underpowered due to the possibility that poor calibration manifests only over a limited range of the explanatory variables.

In the case of fixed effects binomial regression, it’s not formally possible to have two models that are both well calibrated everywhere but are different from one another. To have two different everwhere-well-calibrated models requires that one include predictors capable of soaking up variance that the other models as random, but fixed effect binomial regression contains no such variance term.

4 Likes

See Section 4.3 in A survey of Bayesian predictive methods for model assessment, selection and comparison

I also recommend using loo (or other CV) for model checking. If you are then certain that the model is well calibrated then

  • you will get reduced variance for the predictive performance estimate with self-predictive approach, but you get slight optimistic bias This bias might be negligible, especially in the case of one model.
  • if the quantity of interest is one of the parameters, and that parameter has no posterior dependencies with other parameters, you can look at the marginal directly (this is what happens in that Beta blocker demo)
  • if the quantity of interest is one of the parameters, and that parameter has posterior dependencies with other parameters, the marginal is less useful. This happens, for example, in case of collinear covariates and then looking at the predictive performance is again useful (see, e.g., Bayesian version of Does model averaging make sense?, and Bayesian Logistic Regression with rstanarm)

In case of two or more models, the optimistic bias in the self-predictive approach might be non-negligible, and that is why we recommend to use reference predictive approach instead with well checked reference model (which is then optimistic only towards the reference model).

To summarise

  • cross-validation is useful when we don’t trust out model(s)
  • if we trust our model, we can examine posterior directly, but that might be difficult in case of posterior dependencies
  • if the posterior dependencies cause difficulties in interpretation and we trust the biggest model we can use reference-predictive approach to compare the biggest model and the simpler models to gain additional insights
  • if we care just about the predictive performance of one model, and we trust that model, self-predictive approach may provide lower variance than cross-validation but slight optimistic bias
4 Likes

I’m trying to understand that part of the case study now. Having some difficulty with the formula for \text{elpd}_\text{ref} and especially with the plot below it. What exactly is being plotted on the x axis? In the first geom_density_ridges plot it’s apparently the probability that the true odds ratio < 1, and in the second one it seems to be the elpd_diff between the full and reduced model, but with this one I’m scratching my head. It doesn’t seem to say what exactly is being plotted.

It is the elpd-difference where elpd is computed with the equation elpd_ref shown in the beginning of that section (repeated here for the convenience)
\mathrm{elpd}_{\mathrm{ref}} = \int p(\tilde{y}|D,M_*) \log p(\tilde{y}|D,M_k) d\tilde{y},
would it be more clear if I add the subindex
\mathrm{elpd}_{\mathrm{ref},k} = \int p(\tilde{y}|D,M_*) \log p(\tilde{y}|D,M_k) d\tilde{y},
where k=1,2.

2 Likes

Oh, a difference. If so, isn’t that just a simple log score comparison between the bigger and smaller model? Much like the classical likelihood-ratio test, except with log scores averaged over posteriors rather than calculated from single point estimates?

I suppose that would explain why the bigger model wins even when True OR = 1 (top row of plot). With one extra parameter it can fit more noise and thereby win the self-prediction contest even when there’s no true effect underlying the realized data.

1 Like

No

No. Log score is computed using the predictive distribution so the averaging over the posterior happens before taking the log. It log score averaged over the estimated future data distribution. See more details in Section 3.3.1 of A survey of Bayesian predictive methods for model assessment, selection and comparison

In this two model case I did use the bigger one as the reference model, and yes these leads to favoring it. It is more common to use the reference model when there are many models or when the goal is to project the reference model to a smaller one (not included in the case study, but see more in Section 3.3.2 of A survey of Bayesian predictive methods for model assessment, selection and comparison

I included the reference model example just to illustrate the effect of modeling to the variance, but didn’t go to the details of possible bias due to the modeling. In the example of the case study, there is no really need to do model comparison as the posterior of the bigger model can tell the most useful information anyway. The intro of the case study says “In this case, cross-validation is not needed, and we can get better accuracy using the explicit model.” I could add that none of the other methods are needed, and get the best results using the original model itself.

Tough read. But if I got it right, Equation 11 of Piironen and Vehtari 2017 is an even more precise description. I guess the idea would be: simulate many new response vectors from the posterior of M_*, then for each model under comparison (here just M_* and M_{\text{0}}), calculate its average lpd over the many simulated response vectors, and compare those average lpd’s between/among the models.

Is this finally correct?

I think it’s is such an important point that it cannot possibly be over-illustrated. At least for us mathematically limited, those brief remarks in the lectures about “reduced variance” were not at all enough to drive the lesson home.

1 Like

Yes. When using pointwise log score, instead of simulation we can also do the integral exactly in finite discrete case (as was done with Bernoulli distribution here) or with desired accuracy using adaptive quadrature.

It’s a difficult question whether in a such long lecture should I make these brief remarks or not. I’m glad you found the case study interesting and have been asking questions. I’m now adding more explanations to that case study, and later will write more to clarify some of the points.

1 Like

Many thanks for clarifying these things.

It’s a deep lesson that I’m only beginning to understand now. Weird that I haven’t seen it discussed elsewhere.

Looking forward to it.

Looking forward to learning more about this too. Probably a topic for a new thread though.

1 Like

Though the thread is solved, I’m trying to clarify my thoughts some more, and it seems better to continue here rather than starting a new thread.

I’d like to formulate the “models reduce variance” idea as succinctly as possible without becoming vague.

Could we truthfully state the following: LOO-CV has high variance because we are comparing N “point responses” y_i to N “point predictions” p_{\text{M}_k}(y_i|y_{-i}) and having to estimate the variance of the resulting \widehat{\text{elpd}}_{\text{LOO},i} scores empirically, whereas with a trustworthy reference model we compare N distributions with a known shape (and hence variance) to N other distributions with a known shape (and hence variance), thereby not having to estimate a separate variance empirically from the “point” quantities?

Also: For Bernoulli models, is \text{log}~p_{\text{M}_k}(y_i|y_{-i}) the same thing as

dbinom(D$y[i,], 1, prob = brms::fitted(update(Mk, newdata = D[-i,]), newdata = D[i,])[,1], log = TRUE)

??
I tested both a manual LOO script and kfold(Mk, folds = "loo") on fake Bernoulli data and the results were almost but not exactly identical, so I’m wondering whether it’s because of simulation variability or because I’m computing the wrong thing.

BIG EDIT: Maybe it’s as simple as this: When there’s no distributional assumption, ANY variability in \text{log}~p_{\text{M}_k}(y_i|y_{-i}) across observations will increase the “uncertainty” of \widehat{\text{elpd}}_{\text{LOO}}. The reason being: without a distributional assumption, \widehat{\text{elpd}}_{\text{LOO}} doesn’t know what degree of variability is “expected” and what degree of variability is “unexpected”, so it simply adds all the pointwise variances together to form an aggregate measure. OTOH if the analyst happens to know what the conditional distribution of y is, then (s)he knows better than loo() whether se_elpd_loo or se_diff truly is “large”, or merely what you’d normally expect from this distribution.

(everyone’s input is welcome, not just avehtari’s)

LOO-CV has higher variance because the integral

\int p(\tilde{y})\log p(\tilde{y}|D,M_k)d \tilde{y}

is approximated with finite number of terms in a Monte Carlo approximation

\frac{1}{N}\sum_{n=1}^N \log p(y_n|D_{-i},M_k)

This approximation includes also a small bias due to use of D_{-i} instead of D.

Reference model approach has lower variance as it approximates that integral with integral

\int p(\tilde{y}|D,M_*)\log p(\tilde{y}|D,M_k)d \tilde{y}

which we often can compute with arbitrary accuracy, but now there can be a big bias if p(\tilde{y}|D,M_*) is very different than p(\tilde{y}).

If the number of observations would go to infinity, then both these would have small variance.

Looking at the posterior directly is the another thing. Looking at the posterior directly has additional benefit that often the model is dividing the total uncertainty to the epistemic (in the posterior) and aleatoric (in the observation model), and the epistemic uncertainty is often smaller. All the predictive performance methods use the predictive distribution, which then combines both epistemic and aleatoric uncertainty. Downside of looking at the posterior directly is that we usually are not able to observe the true parameter later, and thus we need to trust the model and understand the role of the parameter well.

Looks correct. For that case study, I used analytic equations. If using MCMC and Stan, I would use PSIS-LOO in loo package, so that refitting is not needed.

1 Like