One or two-tailed normal approximation for elpd_loo?

ADMIN EDIT: Moved to a new thread from If elpd_diff/se_diff > |2|, is this noteworthy? as it is at least partially a new question

Reviving this thread since my question is so similar.

N = 2319 and the model is as well specified as I can manage.

I have three covariates whose removal decreases elpd by more than 4 but with a large se_diff so that 1.96 > |elpd_diff/se_diff| > 1.64. In other words, |elpd_diff/se_diff| is large enough to be statistically significant in a one-tailed z-test but not in a two-tailed one.

Is it appropriate to use the one-tailed approximation here? It seems to me that it is, because the degree to which elpd can improve by removing covariates seems to be capped. On expectation, it will improve by 0.5 units per parameter whenever useless covariates are removed, but I donā€™t see how it could improve much more than that. I am hence inclined to declare the covariates ā€œsignificantā€ as per the one-tailed z-test.

Any input?

bump

Bump. Btw, Iā€™d appreciate if the title was worded more accurately. A better title is ā€˜One or two-tailed normal approximation for elpd_loo?ā€™

Iā€™ve changed the title per your suggestion.

1 Like

Bump. It may also be worth mentioning that Iā€™ve detected an interaction which poses the same dilemma. Elpd_diff is 8.43, SE_diff is 4.8 which yields a z-value of about 1.75. Thatā€™s significant in a one-tailed z-test but not in a two-tailed one.

I donā€™t recommend dichotomizing to non-significant vs significant with some arbitrary threshold. Itā€™s better to report the actual values.

As I would not report non-significant vs significant I donā€™t have opinion on one-tailed vs two-tailed. Iā€™m sorry Iā€™m not able to help with this.

Itā€™s difficult to say much without seeing the loo diagnostics and knowing more about the model, but if both the difference and se_diff are large it indicates that the models are making very different predictions and it would be useful to look how their predictions differ. If you want you can post the loo diagnostics and more info about the model and I can then comment more.

You could also use an application specific loss or utility (instead of the log score) so that you could also calibrate the difference as practically non-significant vs practically significant based on what is the use of the model.

(btw, bumping doesnā€™t increase my answer speed as I tend to visit here only once per week, but very frequent bumping feels impolite)

3 Likes

Hi,

EDIT: Looks like Aki posted while I was writing - heā€™s definitely more knowledgeable about the topic than I am, so his advice should take precedence over mine.

This would IMHO depend a lot on how your priors for the ā€œuselessā€ parameters look like. I did a quick check and it seems that empirically this is not completely accurate:

library(rstanarm)
set.seed(32156855)
dd <- data.frame(y = rnorm(10), x1 = rnorm(10), x2 = rnorm(10))
fit1 <- stan_glm(y ~ 1, data = dd)
fit2 <- stan_glm(y ~ 1 + x1, data = dd)
fit3 <- stan_glm(y ~ 1 + x1 + x2, data = dd)

loo1 <- loo(fit1, cores = 1)
loo2 <- loo(fit2, cores = 1)
loo3 <- loo(fit3, cores = 1)

loo_compare(loo1, loo2, loo3)

#     elpd_diff se_diff
# fit1  0.0       0.0   
# fit2 -0.7       1.1   
# fit3 -1.5       1.6   

The values above are pretty typical among multiple seeds, so loo definitely can get worse by more than 0.5 per parameter. The difference can be made smaller by using tight priors centered on 0 for the coefficients. We can also make the difference arbitrarily worse by badly misspecifing priors for the parameters:

set.seed(235488)
dd <- data.frame(y = rnorm(10), x1 = rnorm(10), x2 = rnorm(10))
prior <- normal(location = 1, scale = 0.1)
fit1 <- stan_glm(y ~ 1, data = dd, prior = prior)
fit2 <- stan_glm(y ~ 1 + x1, data = dd, prior = prior)
fit3 <- stan_glm(y ~ 1 + x1 + x2, data = dd, prior = prior)

loo1 <- loo(fit1)
loo2 <- loo(fit2)
loo3 <- loo(fit3)

loo_compare(loo1, loo2, loo3)
#      elpd_diff se_diff
# fit1  0.0       0.0   
# fit2 -3.2       1.2   
# fit3 -8.8       1.9   

So I suspect that especially if your models are not very simple and you are not using tight priors, then the guarantees against large increase might be less strong than you expect.

I would repeat what was said in the original thread and generally caution against making binary decisions based on thresholds. If using one-tailed or two-tailed normal approximation makes a difference, a simple interpretation is that you donā€™t have enough data to make a clear decision. Remember that it is almost certain that all your models are at least slightly misspecified. So if your decision is sensitive to minor changes in the values of elpd_diff / se_diff you are at very high risk of being misled. Looking at the result of loo_model_weights might also be informative whether a clear decision is warranted (e.g. whether one model has weight close to 1)

Also note that loo captures how good we would expect the model to be at predicting new data. Thatā€™s a different goal than verifying whether ā€œthere is an effect/interactionā€ - which would in many contexts IMHO be an ill-posed question and examining the posterior for the coefficient in the larger model might be more relevant to many scientific questions - my current thinking on the topic and some possible alternatives are at Hypothesis testing, model selection, model comparison - some thoughts

Iā€™ll also add that I donā€™t think frequent bumping of unanswered topics is productive - we unfortunately usually have a backlog of questions that get unanswered for several days and activity may even take the question off radar for some people who specifically look for unanswered questions. Today I answered several questions that were left without reaction for longer than yours. (if a topic is abandoned for ~ longer than a week, then bumping may be sensible).

2 Likes

Many thanks to both of you for the sound advice.

Iā€™ve had to start over the whole model fitting ordeal due to a couple of annotation errors in the dataset. So I donā€™t (yet) have concrete test statistics to report.

But I ran an experiment similar to Maxā€™s. I used exact LOO in a frequentist setting because itā€™s faster to compute with a simple dataset (no model compilation delay) and because AFAIK, it is what loo() approximates.

1 -> myseed
100 -> N
set.seed(myseed)
x1 <- rnorm(N)
x2 <- rnorm(N)
y <- rbinom(N, size = 1, prob = plogis(x1)) # x1 is the only true predictor
dd <- data.frame(x1, x2, y)
fit1 <- glm(y ~ x1, family = binomial) # model WITHOUT the useless covariate
fit12 <- glm(y ~ x1 + x2, family = binomial) # model WITH the useless covariate
loo1 <- ell_loo(fit1, dd, F) # a simple home-made function for exact frequentist loo-cv
loo12 <- ell_loo(fit12, dd, F)
difference <- loo12$Stats["loo.ll"] - loo1$Stats["loo.ll"]

Then, starting from 1, I had a While Loop increase the seed by 1 and repeat this experiment until the loo score difference favored the simpler (true) model by more than 2. Differences of more than 1 were frequent, and seed 7394 yielded a difference greater than 2. So, as Max suggests, there is apparently no guarantee that the removal of a strictly meaningless covariate wonā€™t increase the loo log score by much more than 0.5.

This would seem to suggest that my assumption was wrong. Even though removing meaningless covariates normally improves loo fit, it is not guaranteed to do so, and it is fully possible for it to even degrade the loo fit, sometimes by quite a bit.

I guess this could be taken as evidence that you need the two-tailed z-test after allā€¦

To clarify loo() approximates Bayesian cross-validation, so thereā€™s a difference, and as Bayesian inference reduces the variance the results are likely to differ if you would repeat the experiment with Stan and loo(), but to be similar within order of magnitude. See also results in Uncertainty in Bayesian leave-one-out cross-validation based model comparison, which suggests that the normal approximation is not well calibrated when the difference is less than 4. Thus in a case similar to your simulation it would be likely that two-tailed z-test would not be calibrated and using it would be inappropriate.

2 Likes