Hi,
EDIT: Looks like Aki posted while I was writing - he’s definitely more knowledgeable about the topic than I am, so his advice should take precedence over mine.
This would IMHO depend a lot on how your priors for the “useless” parameters look like. I did a quick check and it seems that empirically this is not completely accurate:
library(rstanarm)
set.seed(32156855)
dd <- data.frame(y = rnorm(10), x1 = rnorm(10), x2 = rnorm(10))
fit1 <- stan_glm(y ~ 1, data = dd)
fit2 <- stan_glm(y ~ 1 + x1, data = dd)
fit3 <- stan_glm(y ~ 1 + x1 + x2, data = dd)
loo1 <- loo(fit1, cores = 1)
loo2 <- loo(fit2, cores = 1)
loo3 <- loo(fit3, cores = 1)
loo_compare(loo1, loo2, loo3)
# elpd_diff se_diff
# fit1 0.0 0.0
# fit2 -0.7 1.1
# fit3 -1.5 1.6
The values above are pretty typical among multiple seeds, so loo
definitely can get worse by more than 0.5 per parameter. The difference can be made smaller by using tight priors centered on 0 for the coefficients. We can also make the difference arbitrarily worse by badly misspecifing priors for the parameters:
set.seed(235488)
dd <- data.frame(y = rnorm(10), x1 = rnorm(10), x2 = rnorm(10))
prior <- normal(location = 1, scale = 0.1)
fit1 <- stan_glm(y ~ 1, data = dd, prior = prior)
fit2 <- stan_glm(y ~ 1 + x1, data = dd, prior = prior)
fit3 <- stan_glm(y ~ 1 + x1 + x2, data = dd, prior = prior)
loo1 <- loo(fit1)
loo2 <- loo(fit2)
loo3 <- loo(fit3)
loo_compare(loo1, loo2, loo3)
# elpd_diff se_diff
# fit1 0.0 0.0
# fit2 -3.2 1.2
# fit3 -8.8 1.9
So I suspect that especially if your models are not very simple and you are not using tight priors, then the guarantees against large increase might be less strong than you expect.
I would repeat what was said in the original thread and generally caution against making binary decisions based on thresholds. If using one-tailed or two-tailed normal approximation makes a difference, a simple interpretation is that you don’t have enough data to make a clear decision. Remember that it is almost certain that all your models are at least slightly misspecified. So if your decision is sensitive to minor changes in the values of elpd_diff / se_diff
you are at very high risk of being misled. Looking at the result of loo_model_weights
might also be informative whether a clear decision is warranted (e.g. whether one model has weight close to 1)
Also note that loo
captures how good we would expect the model to be at predicting new data. That’s a different goal than verifying whether “there is an effect/interaction” - which would in many contexts IMHO be an ill-posed question and examining the posterior for the coefficient in the larger model might be more relevant to many scientific questions - my current thinking on the topic and some possible alternatives are at Hypothesis testing, model selection, model comparison - some thoughts
I’ll also add that I don’t think frequent bumping of unanswered topics is productive - we unfortunately usually have a backlog of questions that get unanswered for several days and activity may even take the question off radar for some people who specifically look for unanswered questions. Today I answered several questions that were left without reaction for longer than yours. (if a topic is abandoned for ~ longer than a week, then bumping may be sensible).