Understanding LOOIC

Longshot408 · February 29, 2020, 12:51pm

Hello all,

I’m making progress on my first Bayesian models, but I’m having a hard time understanding how to assess model performance and compare various competing models. As of right now I am doing that with bayestestR::bayesfactor_models(). This gives me a Bayes factor indicating under which model my observed data is more probable.

However, I would also like a way to quantify model performance, and I think I understand that LOOIC is the best way to do that? But how do you interpret the output from loo() and loo_compare()? For LOOIC, for example, are higher or lower numbers better? What is a good range of numbers to be in? What does a LOOIC of 354.4 mean? Etc.

Attached is a picture of some output.
Loo

andre.pfeuffer · February 29, 2020, 1:20pm

For looic, lower numbers are better. What you want is to compare the looic values of other models. having same data, ~~likelihood~~ (and seed). If you look at your comparison, elpd_diff equals -1.9, that is the difference between the two models and the difference se_diff is 0.3 is significant, because:

0 \notin[ - 1.9 - 0.3 , -1.9 + 0.3 ]
looic = -2 * elpd_{loo}

p_loo is explained in detail here:

Longshot408 · February 29, 2020, 1:50pm

Thanks for the link, but I’m still a bit fuzzy on what I’m looking at since I have no context to understand these numbers. For instance, I can see that the first model is “better” from the lower LOOIC, but how much better? And ok, its “better”, but what does a LOOIC of 354.4 actually mean? Is that good, or is the model terrible? Same goes for the -1.9 elpd_loo…I’m not sure how to interpret that number.

I cannot seem to find anything online to help me in this regard.

andre.pfeuffer · February 29, 2020, 2:02pm

https://en.wikipedia.org/wiki/Akaike_information_criterion

LOOIC performs like AIC: “Note that AIC tells nothing about the absolute quality of a model, only the quality relative to other models.”.

-1.9 / 0.3 > 6 \sigma, that’s really significant:
https://en.wikipedia.org/wiki/68%E2%80%9395%E2%80%9399.7_rule

Maybe @avehtari has something valuable to add.

mcol · February 29, 2020, 2:32pm

Rather than looking at LOOIC, you should look at the elpd (expected log predictive density), as LOOIC is just -2 * elpd as @andre.pfeuffer has already mentioned.

For elpd, the larger the better, but there is no intrinsic meaning in that number on its own (as that depends, among other things, on the number of observations you have). It starts to get a meaning only when compared to the elpd of a different model (on the same data), and then you can start interpreting the difference.

In your example, the interaction model is worse compared to the main effects model. The difference is large wrt to the standard error of the difference. I think @avehtari recommends at least 4-5 times, as the computed standard error is likely to be an underestimate. In your case you have elpd_diff > 6 * se_diff, so you can be confident that the simpler model is better.

Longshot408 · February 29, 2020, 3:26pm

Thanks so much! This provides lots of clarification. For future reference though, what if the difference was not 4-5 bigger? Say it was only twice as big…would that mean there is a negligible difference between the models?

And in general, can I think of this kind of like a Wald chi-square test for my models, or is that not really a correct interpretation?

andre.pfeuffer · February 29, 2020, 4:17pm

“Such first potential hints of new phenomena, with experimental significances of 2-3 sigma, often cause great discussion.”
https://www.kip.uni-heidelberg.de/~coulon/Lectures/23Sigma/

LOO uses Leave one out cross-validation with Pareto distribution fitting for tails, thus the estimate is out-of-sample performance approximation which is capable of comparing non-normal likelihoods, using the full Bayesian methodology.

avehtari · March 1, 2020, 9:33am

Not necessarily negligible difference, but certainly there is non-negligible uncertainty about the difference. In addition of looking elpd, it would be good to use application specific utility or cost function, so that it is easier to assess whether there is non-negligible probability of practically relevant difference.

We will also soon have a paper discussing in more detail the issues in elpd based model comparisons.

Can you tell more about your model and modeling task, so I may have recommendations for easier to interpret utility or cost functions?

Dat needs to be the same, but likelihoods can be different. See Can WAIC/LOOIC be used to compare models with different likelihoods? - #2 by avehtari

There is intrinsic meaning, but not many are well calibrated for that. For discrete observation models it’s sum of log probabilities and if you know also the number of observations and the range of the data, you can compute average probabilities and compare that to probabilities, e.g. from uniform distribution which is easy way to check whether the model has learned anything from the data. See an example of using this as diagnostic in thread Loo: High Pareto k diagnostic values for beta binomial regression - #2 by avehtari. For continuous distributions we have log densities, and usually people are even less calibrated to think what they mean and the meaning depends also on scaling of the data, but it’s still possible to infer things from elpd without reference to another model.

As I said, people are not used to think so much of probabilities and densities, and thus I also recommend to use also application specific utilities and cost functions and we are adding some convenience functions for easier use of some common more easy to interpret cost functions. The reason why we still favor elpd in the model comparison is that it measures the goodness of the whole predictive distribution, while many commonly used measures such RMSE or R^2 measure only the goodness of the point estimate.

@Longshot408, form the loo output I can see that you are using 75000 posterior draws, which is probably about 71000 more than you would need taking into account that you seem to have quite simple model (p_loo around 4-6) and plenty of observations so that the posterior is likely to be very easy.

Longshot408 · March 1, 2020, 12:21pm

Thanks so much for the detailed response!

Can you tell more about your model and modeling task, so I may have recommendations for easier to interpret utility or cost functions?

Surely. I am running a cognitive psychology experiment with a 3x2 between-subjects design. The purpose is to find out what is more important to decisions to accept or reject a deal: The initial time period spent waiting for it (“Floor”) or the percent-reduction in time from an initially longer period (“Discount”).

Since I’ve got a theory I want to test (it’s a combination of the two, not just the Floor value), I want to find the model that provides the best theoretical account of my data. I’m not as concerned with its overall utility to predict as I am its ability to explain my data. But at the same time, I recognize that I need to have some way to critically evaluate the model’s performance to make sure it’s not garbage, which is why I’m now trying to learn about LOOIC.
My two models (so far) are below. Still trying to fix the code for a different one.

Main_EffectsModel=stan_glm(Accept_Reject~Discount+Floor, 
                  family = binomial(link = "logit"), 
                  data=sonadata_clean, 
                  prior = student_t(df=5,location = NULL,scale = NULL,autoscale = TRUE),
                  #prior_intercept = normal(), 
                  #prior_PD = TRUE, 
                  algorithm = c("sampling"), 
                  mean_PPD = TRUE,
                  adapt_delta = 0.95, 
                  #QR = FALSE, 
                  #sparse = FALSE,
                  chains=3,iter=50000,cores=3,
                  diagnostic_file=file.path(tempdir(), "df.csv"))

Interaction_Model=stan_glm(Accept_Reject~Discount+Floor+Discount*Floor, 
                  family = binomial(link = "logit"), 
                  data=sonadata_clean, 
                  prior = student_t(df=5,location = NULL,scale = NULL,autoscale = TRUE),
                  #prior_intercept = normal(), 
                  #prior_PD = TRUE, 
                  algorithm = c("sampling"), 
                  mean_PPD = TRUE,
                  adapt_delta = 0.95, 
                  #QR = FALSE, 
                  #sparse = FALSE,
                  chains=3,iter=50000,cores=3,
                  diagnostic_file=file.path(tempdir(), "df.csv"))

form the loo output I can see that you are using 75000 posterior draws, which is probably about 71000 more than you would need taking into account that you seem to have quite simple model (p_loo around 4-6) and plenty of observations so that the posterior is likely to be very easy

The reason I am using so many steps is because of bayestestR::bayesfactor_models(). I was initially using 5,000 draws, but then when I ran this command to compare my models I got this message in the console…

Bayes factors might not be precise.
For precise Bayes factors, it is recommended sampling at least 40,000 posterior samples.Computation of Bayes factors: estimating marginal likelihood, please wait..

…so I added another 0 and made it 50,000 draws to be safe.

avehtari · March 2, 2020, 9:33am

If one “floor” or “discount” would have an effect, how would you like to measure if that effect is practically interesting? Change in probability of accepting? Something else?

Sorry, I forgot that you were doing that, too. I know it needs much more draws to have any hope for stable computation.

Longshot408 · March 2, 2020, 12:02pm

Yep! I’m following the recommendations and guidelines from this paper, which say to look at the probability of direction to see if your effect exists, and then gauge its meaningfulness with the ROPE+HDI method. Currently using the bayestestR package to accomplish both. I’ll be reporting in terms of odds ratios instead of the coefficient estimates probably.

What is less clear/straightforward to me though is how to evaluate a model once you have one (or a few competing ones). This page from the bayestestR site gives me a way to compare my models with Bayes factors, but I’m not sure that alone is going to be enough to justify my models to a peer reviewer. Or will it? I was looking for some way to quantify the models’ performance too (a la Wald chi-square change and Nagelkerke/McFadden R^2).

Also investigating this guide on Posterior Predictive Checking…but I’m not sure how related this is to my question. Undoubtedly the hardest part of learning Bayes for me is finding a set of “best practice guidelines”. There doesn’t seem to be any standardized procedure for how to carry out an analysis as of yet.

avehtari · March 2, 2020, 2:41pm

There are many. It’s not possible to have one solution that fits all cases. It’s possible to use one solution for all cases, but that will be suboptimal and ignore lot of useful information. General guidelines for procedures an be found e.g. in BDA3, Statistical rethinking (especially 2nd ed), Bayesian workflow using Stan, and see also Writing - betanalpha.github.io . Specific workflow instructions differ as people are thinking different types of modeling problems.

This is fine if you don’t have collinearity.

Probability of direction and ROPE+HDI are also ways to evaluate models.

Usually important thing would be to consider whether you assume your results could generalize outside of the observed data. Do you have several subjects in the study? Do you want to claim that what you’ve learned from the subjects generalizes to new subjects from the same population? Or that what you’ve learned from the subjects generalizes to new decisions of the same subjects assuming there is no time effect (e.g. getting tired)? If you are not going to claim any generalization outside of the observed data, then looking at the posterior is fine. If you want to claim some generalization outside of the observed data (even in the limited population) then looking at out-of-sample predictive performance is sensible and have interpretable meaning. I would probably use leave-one-subject-out cross-validation to assess out-of-sample generalization ability. BF corresponds to leave-all-subjects-out validation, and tests just the prior and not what you learned from the observations, which makes it more sensitive to prior choices and computation more difficult.

Topic		Replies	Views
Interpreting elpd_diff - loo package Modeling loo , interpret-results	47	15042	November 9, 2020
Demonstrating the 'absence' of an effect for a publication General loo , interpret-results	7	1016	November 1, 2017
Loo comparison in reference to standard error General loo	10	3156	May 1, 2018
Interpreting output of multiple comparisons using loo Modeling loo , interpret-results	3	866	October 3, 2018
Compare for manyish models Interfaces rstan , loo	11	729	March 15, 2019

Understanding LOOIC

Related topics