I have two nested models estimated in Stan and currently I am comparing them by LOO and WAIC as this is what seems to be the “best” way. However, i was told that the people I am sharing results with may want to know about the actual “significance” of this. I assume this is referencing a chi sq test? How can i determine if one model is statistically significantly better than the other since it seems WAIC and LOO are not empirical in that sense. If chi sq is possible, how would i go about seeing if adding an additional parameter is needed? Or the likelihood ratio test? I was not sure how to go about this in the context of bayesian inference.
In general we don’t really do formal statistical significance testing or recommend it, but sort of along those lines we do have the
loo_compare function in the loo package, which gives an approximate standard error that you can use (i.e., you can check how many standard errors away from 0 the difference is). There is info on how to interpret it in the Details section of Model comparison — loo_compare • loo and also in the FAQ: Cross-validation FAQ • loo.
If your collaborator or colleague has a specific “significance” test in mind, it would be better to directly ask them for clarification.
In general, we need to think about the goal of evaluation. On what standard we define a good model? (Predictive performance OR reflection to the true model) AND parsimony? Sometimes we know the data were generated from a true model, the goal of model evaluation is then to find out which candidate model is the best reflection to the true one; sometimes we do not know if a true model exists, the goal of model evaluation is to find out which model makes the best prediction for future data that will come from the same data-generating process. See Wasserman 2000. Further discussions are available about different scenarios of model evaluation in the Introduction section of Yao et al. 2018.
LOO and WAIC are both predictive models, in terms that they are comparing predictive performance of candidate models. The traditional AIC also evaluates predictive performance. See Vehtari et al. 2017
Parsimony is often built into evaluation criteria to prevent overfitting. It penalizes the model by number of parameters. Methods of penalization is covered in all the papers above.
Thank you. To clarify, this is a simulation study in which i generated data according to a covariate model and then estimated parameters in both the covariate and baseline models using that data. This is what i currently am doing:
LOO/WAIC to tell me predictive performance, i.e. which model makes best prediction for future data. I assume this also can provide some information on whether the covariate model is overfitting.
I am also calculating RMSE and Bias for the estimated parameters from Stan against the true parameters to gauge estimation accuracy.
I do not have a technical goal necessarily outside of showing the covariate model is useful and should be used when the covariate effect is present over the baseline model.
For overfitting, compare the performance between a simple model and its complex form with addtional parameters. If the performance of the complex one is not better than the simple one, then the added parameters are overfitting. Again, an integrated solution is to built the penalization by parameter into evaluation criteria.
PSISLOO evaluate KL divergence then penalize by parameter numbers. Follow its guide, make sure all the assumptions are met.
The unpenalized RMSE by its name does not penalize overfitting. RMSE approximates zero when the model infinitely overfits so that the predicted data is almost identical to the true data. RMSE also has an underlying assumption of normality. See Hodson 2022.
“Signifiance” test is a very frequentist word. Your colleague or collaborator may therefore the traditional frequentist AIC or BIC. Both penalizes overfitting. Note that AIC evaluates predictive performance while BIC evaluates reflection to the true model knowing a true model exists. See [Wasserman 2000].