Reporting the need for a parameter using loo?

Hi Everyone,

I am looking for a way that will allow me to be more clearly report the evidence in favor of omitting a parameter from a model (any model, but lets say a slope in a regression for simplicity).

  • I personally prefer avoiding BF due to the sensitivity for the priors.
  • Stacking seems to always get me to 100% / 0%.
  • loo_compare / k-fold CV is great, but the elpd_diff numbers are mostly arbitrary - which makes it less appealing for reporting “as is” I think. Its like after all this accurate Bayesian sampling, we end up using an arbitrary cutoff (e.g., >2sd on the elpd_diff) for deciding which model is more appropriate given the data. Do you share this thought or am I missing something basic here?

What I thought would be nice to report is something like “There is X% probability that the reduced model (without the parameter) predicts future data better then the full model that includes this parameter.

So my main question is how do I get that. For example, I thought we can get an elpd for a model with/without the parameter of interest. Then sample using bootstrapping the elpd_diff vector to get this probability.

# Extract pointwise predictive values
loo_full <- loo(full_model, save_psis = TRUE)
loo_reduced <- loo(reduced_model, save_psis = TRUE)
elpd_diff_pointwise <- loo_reduced$pointwise[, "elpd_loo"] - loo_full$pointwise[, "elpd_loo"]


# Bootstrap probability
prob_reduced_better <- mean(replicate(4000, sum(sample(elpd_diff_pointwise, replace = TRUE))) > 0)

# Report
cat(sprintf("Probability reduced model predicts better: %.1f%%", prob_reduced_better * 100))

Would that make sense? Am I missing a simpler way to get this? Would really love to have your thoughts here.

Best,
Nitzan

No, they are not mostly arbitrary. See CV-FAQ 12
You can use also other cost and utility functions, which you assume are more familiar for the target audience. See CV-FAQ 11.

My recommendation is not to use an arbitrary cutoff. You should report the elpd_diff and diff_se. You may also report the probability that one model is better than another. There is no need for cutoffs, and especially you should not hide elpd_diff, diff_se, and probability behind any arbitrary cutoffs.

Yes! This would be fine. I would prefer “The predictive performance of the reduced model is not worse than predictive performance of the full model with probability”. I would state it this way, as with sensible priors the bigger model is usually better.

No need for bootstrap as normal approximation works as well. Since you are using R and loo package, you can get the normal approximation with

loo_compare(loo_full, loo_reduced)

See also CV FAQ 16, which briefly mentions also the conditions when the normal approximation is well calibrated (the same holds for the bootstrap). For more details about the normal approximation for the elpd-difference see [2008.10296] Uncertainty in Bayesian Leave-One-Out Cross-Validation Based Model Comparison

Given elpd_diff and diff_se, you can use pnorm() to compute the probability. See, Nabiximols case study, Section 15 for a complete code example. That case study also shows that it instead of the predictive performance, it can be better to look at the posterior of the quantity of interest using the full model.

1 Like

This is really great, thank. I am working with logistic regression, is there any problem taking the mean/sd for the pnorm directly from the elpd_diff? I’m attaching the code in case it helps.

m_reduced<-
  brm(
    formula=stay_key~0+Intercept+(1+reward_oneback|subject),
    data = df,
    family=bernoulli(link="logit"),
    warmup = 2000,
    iter = 4000,
    chains = 4,
    cores = 4,
    prior=myprior,
    backend = "cmdstanr"
  )

m_full<-
  brm(
    formula=stay_key~0+Intercept+reward_oneback+(1+reward_oneback|subject),
    data = df,
    family=bernoulli(link="logit"),
    warmup = 2000,
    iter = 4000,
    chains = 4,
    cores = 4,
    prior=myprior,
    backend = "cmdstanr"
  )

loo_reduced =loo(m_reduced,save_psis=TRUE)
loo_full         =loo(m_full,save_psis=TRUE)

comp <- loo_compare(loo_full, loo_reduced)
elpd_diff <- comp[2, "elpd_diff"]
se_diff   <- comp[2, "se_diff"]

prob_reduced_better <- pnorm(0, mean = elpd_diff, sd = se_diff)
prob_reduced_better * 100 # percent chance

Yes, you can do it like this

This is really interesting Aki, thanks!

Am I correct in my understanding that here:

the “probability” you’re referring to is not a posterior probability (or a “Bayesian probability”) but a frequentist one - the expected proportion of the sampling distribution of elpd_diff that would indicate to reduced model being not worse than the full model?

If so, is there a way to get a posterior probability of one model preforming between than another model that doesn’t use (fractional, intrinsic, or other) Bayes factors?

It’s a Bayesian probability, as the uncertainty is about the unknown future conditional on the fixed observed data set. We discuss this in [2008.10296] Uncertainty in Bayesian Leave-One-Out Cross-Validation Based Model Comparison. Let me know if that paper is not clear enough about this.

So se_diff is a summary of the uncertainty in the difference in predictive performance, carried over by the (epistemic) uncertainty of a model’s estimated PPD - am I getting this right? Which means that we can expect stronger priors on the parameters of the model(s) to lead to smaller se_diff?

And then, if we had the actual posterior distribution of p(\Delta elpd_{M1-M2}|Y) we could find the posterior probability that p(\Delta elpd_{M1-M2}<0|Y) by integrating over said posterior.

But instead, assuming a normal approximation \Delta elpd_{M1-M2} \sim N(\mu, \sigma^2), we can take take \mu=elpd_diff and \sigma=se_diff and find \Phi_{\Delta elpd_{M1-M2}}(0).

No. The relevant part in [2008.10296] Uncertainty in Bayesian Leave-One-Out Cross-Validation Based Model Comparison is the first paragraph of Section 2.2. I’ll try with slightly different and more words. elpd_loo is an estimate of the predictive performance for infinite amount of future data. If we observed infinite amount of future data, there would not be any uncertainty. If we observed N future data observations and would use those to estimate the predictive performance for infinite future data, we would still have uncertainty. We are using cross-validation because we don’t trust any model. Thus when evaluating the model performance we don’t use any parametric model for the future data distribution. We can consider that we’re using a non-parametric model that has point mass at every N future observation. We can then compute posterior for the mean of the non-parametric model. We could do this exactly by using Dirichlet model and Bayesian bootstrap, see Bayesian model assessment and comparison using cross-validation predictive densities.. But instead we approximate that posterior with normal distribution. Now in practice, we don’t observe those N future observations, and we instead use already observed N observations as proxy for future data (without weighting if assuming observed and future data coming from the same distribution and with weighting if the distributions are assume to be different). Cross-validation is then used to avoid bias from double use of data.

Note that we could also use parametric models for the future data. If we use the same model we are assessing, it would be self-predictive approach. If we use some other model than which we are assessing, it would be reference-predictive approach. See more in A survey of Bayesian predictive methods for model assessment, selection and comparison

Yes. In case of Dirichlet model for the future data, Bayesian bootstrap does the computation with the actual posterior.

Yes.

Thanks for the clarification. I found it very useful and insightful.

However, I would like to go back to the uncertainty - you say it is Bayesian, about the relative fit conditional on the fixed observed dataset. However, in section 2.2 page 7 you say:

We consider also the distribution of \widehat{elpd}_{LOO}(M_a, M_b | y) as a statistic over possible data sets y, and call it the sampling distribution. This follows the common definition used in frequentist statistics.

(emphasis added)

To further my confusion, later on the same page you write:

It can be seen that [\widehat{SE}_{LOO} equation] does not consider the variability caused by the term elpd(M_a, M_b | \tilde{y}) over possible datasets.

(emphasis added)

Even though it seems to directly be computing the standard error of \widehat{elpd}_{LOO}(M_a, M_b | y).

What am I missing or misunderstanding in the above quotes / how they relate to the probability p(elpd(M_a, M_b | y))<0|Y) being computed?

I really appreciate you taking the time here.

Thank you Aki for this. I think this is a real game changer, at least for me, to report the model comparison results in such a manner.

I am a bit worried still, and wanted to please double check. I got the following loo_compare results for three nested models (model 1 is the basedline, model2 has one more param, model3 has two more params compared to baseline). The elpd_diff and sd_diff between model 3 and 2 looks unimpressive (thus model2 should be better since its more parsimonious). Yet, it seems to have 78.5% probability that the full model (model 3) would be preferred in terms of predicting new data. Does the numbers make sense? I think its time I pass this to my students, but really wanted to double check.

 elpd_diff se_diff
model3   0.0       0.0  
model2  -3.4       4.3  
model1 -57.5      12.1  
loo (reduced):  -9644.14 
loo (partially_reduced):  -9590.01 
loo (full):  -9586.631 

#Model 3 vs 2
> pnorm(0, -3.4, 4.3)
[1] 0.7854398
#Model 3 vs 1
> pnorm(0, -57.5, 12.1)
[1] 0.999999

Possible report: We found >99%% probability that the full model (Model 3) will outperform the reduced model (Model 1) in predicting new data. However, we found only 78% chance that the full model (Model 3) will outperformed the partially reduced model (Model 2) in such future predictions. We therefore conclude that the predictive performance of Model 2 is not substantially worse than the predictive performance of the full model (Model 3).

Thank you!
Nitzan

1 Like

\widehat{\mathrm{elpd}}_\mathrm{LOO}(\mathrm{M}_a, \mathrm{M}_b | y) conditional on y is considered to be a point estimate. It is perfectly fine to examine frequency properties of Bayesian point estimates. Frequency properties of are analyzed by considering what if had observed something else from the same data generating process that generated the observed y. That sentence tells that we examine how \widehat{\mathrm{elpd}}_\mathrm{LOO}(\mathrm{M}_a, \mathrm{M}_b | y) varies when instead of observed y we consider all possible data sets that we might have observed.

That is correct (although that \tilde{ } over y should be dropped for clarity). That equation corresponds to uncertainty about the future expected predictive performance conditional on the observed data y. If we would additionally consider the variation from considering what if we had observed something else than y, the uncertainty would be bigger. We discuss this latter option more in Appendix A.

Maybe the use of the term “standard error” is confusing? Maybe we should have called it standard deviation of the predictive distribution for the future predictive performance?

This is not different from generic probability calculus (without need to even consider what is the definition of probability).

If we have p(\theta | y)=\mathrm{normal}(\theta | \mu, \sigma), we can compute p(\theta < 0 | y) = \int_{-\infty}^0 \mathrm{normal}(\theta | \mu, \sigma)d\theta. Now instead of p(\theta | y)=\mathrm{normal}(\theta | \mu, \sigma) we have \hat{p}(\mathrm{elpd}({\mathrm{M}_a, \mathrm{M}_b}|{y})) = \mathrm{normal}(\widehat{\mathrm{elpd}}(\mathrm{M}_a, \mathrm{M}_b|{y}), \, \widehat{\mathrm{SE}}({\mathrm{M}_a, \mathrm{M}_b}|{y})) .

Good that you ask, so I can think we can improve the explanations (right now the problem is that the journal has a strict page limit, so we can’t add anything to the main text)

1 Like

More parsimonious is not necessarily better. If you are interested in the uncertainty related to the additional parameters in model 3, then model 3 is superior, as model 2 ignores the uncertainty completely. Even if the model 3 predictive performance is not better than model 2 predictive performance, it is still possible that the posterior for those additional parameters is narrower than prior, which means the data did have some information about them.

The paper [2008.10296] Uncertainty in Bayesian Leave-One-Out Cross-Validation Based Model Comparison, mentions that if the difference is small (e.g. |elpd_diff<4|) then se_diff might be underestimated and the actual distribution is skewed and not well approximated with normal, so you can’t trust that probability. However, as the difference is small, it doesn’t matter which model you choose as they have similar predictive performance!

I would change this to “There is no statistically or practically significant difference between the predictive performances of models 2 and 3.”

1 Like

Yes, I think that is part of what’s confusing me. When I head “standard error” my mind shift’s gears into frequentists uncertainty.

Absolutely - except that that (what is the definition of probability) is exactly what I’m trying to understand (I will clarify my motivation: I want to be able to explain this to some extent to my UG-students).


Okay, final shot: se_diff represents our (Bayesian) uncertainty in our model’s performance on a set of new data, conditional on the observed data that we used to “estiamte” p(\hat{y}). That is, the uncertainty is related not to the models’ predictions (the PPD, which is fixed for a given model) but to the distribution of future data, which is what is here conditioned on the observed data (which I think would be denoted as p(\hat{y}| y)).

How’s that?

Damn frequentists ruining so many good words. We are revising the paper, and I’ll check with the co-authors if it would make change the term. Table 1 introducing the notation already has
\widehat{\mathrm{SE}}_\mathrm{LOO} = estimator for the standard deviation of \widehat{\mathrm{elpd}}_\mathrm{LOO}(\cdots \mid y).

Would you think it would be better to have
\widehat{\mathrm{SD}}_\mathrm{LOO} or \widehat{\mathrm{sd}}_\mathrm{LOO}?
We could change this also in loo package version 3, so that we would have sd_diff.

On the other there is still a connection to the error distribution, even if the interpretation is Bayesian, so that way SE would make sense.

Oh, please don’t stop if you have still something to say! This discussion is very useful for me, too!

For the first part only a slight modification:

se_diff represents our (Bayesian) uncertainty in our model’s performance on a set of new data. The model posterior predictive distribution for the future data p(\tilde{y}\mid y) is conditioned on the observed data y. That is, the uncertainty is related not to the models’ predictions (which is fixed for a given model) but to the unknown distribution of future data.

Then a new longer explanation (my thinking has been based on this, but had assumed bringing up Dirichlet process is not necessary, but let’s try).

The expected future log score predictive performance for one unknown future observation is \int \log p(\tilde{y}\mid y) p(\tilde{y}) d\tilde{y}. We use a minimal assumption Dirichlet process model for the future data distribution p(\tilde{y}). We set the probability of base distribution to 0, and have point probability mass w_i at \tilde{y}_i, with w \sim \mathrm{Dirichlet}(n, 1, \ldots, 1). We re-use the observed data y as a proxy for the not yet seen future data observation. Conditioning on y we get \tilde{y}_i=y_i. The distribution of w is not influenced by y. The future data distribution is then modelled as DP(\tilde{y}) = \sum_{i=1}^n \delta(y_i) w_i. The posterior distribution
p\left(\int \log p(\tilde{y}\mid y) DP(\tilde{y}) d\tilde{y}\right) \approx p\left(\sum_{i=1}^n \log p(y_i \mid y_{-i}) w_i\right),
where we have replaced p(y_i \mid y) with p(y_i \mid y_{-i}) to avoid the double use of the data. Uncertainty here comes from unknown point probability masses w_i, that is, we have epistemic uncertainty about w_i. Asymptotically p(\int z DP(z) dz) approaches normal distribution, and thus normal approximation is justified as approximation for the posterior of mean functional given Dirichlet process prior. However, pre-asymptotically and especially due to the dependency between p(y_i \mid y_{-i}) terms the normal approximation may fail as discussed in the paper.

When the true future data distribution is continuous, it may seem strange to model the future data distribution with point masses, but it has been shown that the posterior of mean functional is good already pre-asymptotically. This approach for estimating the posterior of mean functional using Dirichlet process models is also known as Bayesian bootstrap.

Let me know if you think this would be more clear than the current explanation in the paper.

Further comment: Dirichlet process model for the future data connects different methods for assessing predictive performance. Using probability 0 for a base distribution corresponds to in-sample and cross-validation predictive approaches depending whether p(y_i \mid y) or p(y_i \mid y_{-i}) is used. Using probability 1 for a base distribution corresponds to self or reference predictive approaches, depending whether the predictive distribution of the assessed model itself or a reference model is used as the base distribution. It would be possible to use base distribution probability between 0 and 1, but there doesn’t seem to be clear benefit from this.

Too true…

Maybe \widehat{\text{SDE}}_{\text{LOO}}? Standard Deviation of the Error distribution?


Okay, I think I get it now! (Though the part with Dirichlet Process is outside my scope - but I understand that to be the Bayesian bootstrap, so it’s hard for me to evaluate how well that would fit in the paper).

I do think the paper would benefit from a clear spelling out that the uncertainty being quantified is over p(\tilde{y}|y) not p(\Theta|y) - I understand that this information is in the paper, but I think even having a small aside or something would be very helpful to the less statistically fluent.

1 Like

We could change this also in loo package version 3, so that we would have sd_diff .

I think loo_compare function is just fantastic and so usefull. For that reason I also think that it might be good to clear up the titles a bit… I personally got really confused from se_diff thinking at first this is the difference between the se of each vector, and that SE is refering to some t-test… :)

I think that you can maybe consider saying “elpd_diff” then “se_of_elpd_diff”. Since loo_compare output is not very wide, I think using a few more words can pay off. If you are willing to consider further change, maybe also go with “var” instead of se? (I assume var=se^2 ?). Then also maybe use “gain” instead of “diff” since it points to a direction towards the better model. So how about:
elpd_gain | var_of_elpd_gain ?
I guess nothing is perfect in the end, but I share the feeling that both SE is confusing (freq wise) and that “diff” can be misinterpret :)

Take care,
Nitzan