I am conducting a forward iterative model comparison analysis of longitudinal data, similar to that discussed by Singer and Willett in Applied Longitudinal Data Analysis but using loo-cv methods in stan/brms. Excuse the mixed-effects model nomenclature, but I start with a null model (with no fixed predictors and only a random factor of each participant’s intercept) and then add fixed predictors to the model. There are not a large number of models (six including the null model) and I have more than 100 participants, the majority with 5 observations each.

In one of his comments on this thread (Clarifying interpretation of loo_compare output) @torkar implies that a |elpd_{diff}/se_{diff}| value of 2 or more might be “noteworthy” (I am studiously avoiding the term “significant” and “noteworthy” is the best synonym I could come up with).

In one of his comments on the same thread @avehtari linked me to the excellent cross-validation FAQ resource (Cross-validation FAQ) where, in sections 5 and 15, he writes

When the difference (elpd_diff) is larger than 4, the number of observations is larger than 100 and the model is not badly misspecified then normal approximation and SE are quite reliable description of the uncertainty in the difference. Differences smaller than 4 are small and then the models have very similar predictive performance and it doesn’t matter if the normal approximation fails or SE is underestimated.

However this specifies a threshold value for elpd_{diff} rather than |elpd_{diff}/se_{diff}|.

So, my questions are:

1. Where can I find a publication or website with some threshold guideline values for making decisions about model comparison using elpd_{diff} and se_{diff}?

2. Should I use elpd_{diff} or elpd_{diff}/se_{diff} as the metric to guide my decisions?

Good question. I think @avehtari could help in clarifying this.

Nowadays, I personally feel comfortable, if I have a large N, and if \mathrm{elpd}_{\mathrm{diff}} / \mathrm{SE}_{\mathrm{diff}} \geq 4. It would be wonderful if Aki could add a references to his FAQ which one can cite in papers to solidify this claim. In short, Aki’s post here:

Tuomas Sivula, Måns Magnusson, and Aki Vehtari (2020). Uncertainty in Bayesian leave-one-out cross-validation based model comparison. arXiv preprint arXiv:2008.10296.

We do mention when we assume elpd_diff and se_diff based normal approximation to be well calibrated (|elpd_diff|>4, N>100, and no bad model mis-specification). and when the error is likely to be small even if the normal approximation is not well calibrated (|elpd_diff|<4 and no bad model mis-specification). Here the threshold 4 is related to when the normal approximation can be assumed to be well calibrated and not to probability of one model being better than other.

For the case when the normal approximation is assumed to be well calibrated, we intentionally don’t provide any threshold for “noteworthy” or “significant” as it is better to report continuous values of elpd_diff, se_diff and other information available. When the normal approximation is assumed to be well calibrated you can use that approximated distribution to make inference and decisions as with any well calibrated distribution describing uncertainty. It is common to use arbitrary 2SE for reporting and illustration, but there is nothing magical in 2SE that we would recommend it over other choices. Also instead of x SE, you can think in terms of probabilities and choose a probability you like to report.

We don’t recommend looking |elpd_diff/se_diff| as it loses the important information when the normal approximation is well calibrated and when the difference is small anyway. It’s not that useful to say one model is likely to provide better predictions than other if the the predictions are almost the same anyway. That’s why it’s better to report both elpd_diff and se_diff.

Use both elpd_diff and se_diff to form the normal approximation.

Thanks @avehtari and @torkar, much appreciated. Aki is the “SE” in the “2SE” quoted above not the same as “se_diff”? And, if they are not the same, what “SE” are you referring to?

SE is standard error in any normal approximation. Normal approximations and SE are common in other contexts than elpd, too. se_diff is the SE for elpd_diff. Sorry, if it’s confusing that sometimes the abbreviation is written SE and sometimes se.

Thank you @andrjohns I’m just still confused how to apply Aki’s rules.

He says above

“When the normal approximation is assumed to be well calibrated you can use that approximated distribution to make inference and decisions as with any well calibrated distribution describing uncertainty”

But how? Using the example you provided I used the 2SE “rule” (while acknowledging all Aki’s caveats)

-5352.2 + c(-1,1)*2*709.2
[1] -6770.6 -3933.8

Is this what he means by “using the approximated normal distribution”, some sort of credibility interval for the comparison?

Here I’m still using se_diff for the SE though so I assume it’s wrong, but I don’t know what other SE to use. Is it the SE you extracted from the loo object (your t above), and, if so, which SE do you use, the SE from the first model being compared or the second model being compared?

How should I apply the rule to this model comparison? Clearly the |elpd_diff| is > 4. There are more than 100 observations and the second model is well-calibrated. Is there enough information here to make a decision (assuming a 2SE rule as a guide) or do I need to do something more? If there is enough information here could someone walk me through the steps and if there isn’t tell me where to look and what to extract?

I’ll take a step back from loo specifically. The standard error (SE) of a parameter refers to the standard deviation of the parameter itself. We can use the parameter value and its standard error to approximate the distribution of the parameter as a normal distribution. When Aki says “When the normal approximation is assumed to be well calibrated”, this refers to the assumption that this approximation of the parameter’s distribution is accurate/faithful. If the normal approximation is accurate, then we can use it to make inferences about the range of values that a parameter can take.

Every parameter can have its own standard error, and this can be used to make inferences about the parameter itself. Using the original loo example:

Estimate SE
elpd_loo -6247.8 728.0

If we use the 2SE rule:

-6247.8 \pm 2*728.0 = [-7703.8, -4791.8]

This means that a normal approximation to the distribution of the ELPD-LOO (for this model) implies that the 2 standard deviations around the parameter estimate (or ~95% of values) fall in the range [-7703.8, -4791.8]. Note that this is pretty much how confidence intervals are derived.

To use this to make inferences about whether two models differ (according to LOO), you look at the difference in their LOO values (elpd_diff) and the standard error of that difference (se_diff). With these two estimates, you can approximate the range of possible values for the difference in LOO.

Given the elpd_diff and se_diff you posted above, you would then run:

> -19.4 + c(-2,2) * 6.7
[1] -32.8 -6.0

Which gives ~95% confidence intervals for the difference in LOO values between the two models. Because these intervals do not contain 0 (implying equal LOO), this gives support for the two models being different.

@avehtari Any corrections there? It’s been a while since I’ve dealt with much of this.

Pretty much (as confidence intervals, to be specific). They both provide similar information about the magnitude of the difference relative to its standard error. The elpd_diff/se_diff approach is analogous to the p-value approach (which would compare elpd_diff/se_diff to the quantiles of a standard normal), and the approach above analogous to the confidence interval approach

Brilliant thank you! I’ve always wondered why journals these days champion confidence intervals and discourage p-values when generally speaking if the p-value is <.05 the confidence interval will exclude 0. But I suppose even though both can be to make binary decisions about acceptable amount of evidence to conclude presence or absence of an effect or superiority of a model, confidence intervals encourage people to take into account the uncertainty in the estimate whereas p-values are a little opaque.

It’s a big topic but there are issues with both. While confidence intervals can help quantify the uncertainty in an estimate, they’re reliant on the parameter (approximately) following a normal distribution. That’s where bootstrapped confidence intervals (or Bayesian methods) get recommended, since they’re not dependent on the normality of the parameter’s distribution (speaking *very* generally)

Your answer was excellent. As the elpd_diff>4, the normal approximation is likely to be good.

And my caveats were

don’t report just 2.895522 or that some not shown value was bigger than 2, but report explicitly both elpd_diff and se_diff, because then others can see how big the absolute value of elpd_diff is which is useful diagnostic, and others can decide which tail probabilities to use

we don’t want to encourage hypothesis testing with some arbitrary threshold, as different cases may require different thresholds, thus we don’t say that use 2SE, but instead we say that use that normal approximation to make your decision, but you need to decide yourself what tail probability is meaningful in your specific case. This means I can give you reference which discusses the normal approximation and it’s diagnostics. but I don’ recommend any specific threshold.

Thank you @avehtari. I have already reported elpd_diff and se_diff for each comparison, however I was also reporting the elpd_diff/se_diff value as well, like this

Backing this up a bit, I’d first ask myself “for which purpose am I selecting a model” ? Which then leads to “for which purpose am I using loo.”

To answer these questions, I’d need to think of what inference I hope to make. If I am looking to interpret the parameters, then I’d think about what my research question is. I’d let this be my guide, eg, if I am interested in the individual effects, then I’d just fit the full model with varying/random slopes . Then I’d describe the results. Or do I want to interpret certain parameters, given others are controlled for ? Etc.

Further, If loo did not favor a model, it’s likely that’d be reflected in the posterior distributions.

Now if I was going to take the model and make some predictions, then I would be really interested in loo and what difference is notable for that goal. My lab does intensive longitudinal data analysis and typically we just about never use the model for forecasting, but rather to interpret the parameters, in light of our question.

Sorry this is perhaps a non answer answer, but it seems one alternative, that avoids all of these issues, is to just fit the full model :-)

Yes @donny I did wonder if that would have been a better approach. It would certainly have been much simpler and would have allowed me to quantify the absence of effects. In the their book Applied Longitudinal Data Analysis, Singer and Willett advise the forward iterative model comparison approach to test the overall effects of predictors over simply running the full model. I am just emulating their approach using an analogous Bayesian method. How valid it is to transpose the NHST approach to the Bayesian context I don’t know. But I definitely take your point and wavered between the two approaches.