Calculating RMSE/MAE for Bayesian models

Hi there,

I recently raised a feature request on the brms github ( about calculating RMSE (root mean square error) and MAE (mean absolute error) for Bayesian models given the k-fold IC implemented within the package.

Commonly, these metrics are calculated for each test set, using the training model, and then one takes the mean or median (+SE) across k-sets. However, @avehtari stated this may not always be the best approach. Given these metrics are easy-to-intepret, as they are in the units of the response variable, I believe they have wide utility but I’d like to know more about how best to calculate these in a Bayesian framework?

Thanks in advance

1 Like

Common only in certain sub group making it not so common in general.

Calculating a statistic for a test set with size m is ok if you plan to use your model to make predictions in the future for sets of data of size m. However, usually when these statistics are are calculated for each test set, the test set size is arbitrary chosen with no know connection to how the model is going to be used in the future. The arbitrary test set size makes the uncertainty for the future predictive performance arbitrary. Yes, you can still compare models with the same test set division, but for example k-fold-cv with smaller k, makes the uncertainty to increase (which is then often reduced by making several random divisions for k-fold-cv). If the model is going to be used to predict the future one at a time, then getting the relevant uncertainty from the statistics computed for test sets is complicated.

Currently kfold in loo package (used also by brms) assumes single predictions at time in the future. kfold makes one random division to k sets, and cross-validation predictions for all observations and then statistics are computed for test sets of size 1 as in leave-one-out cross-validation. Thus, the default behavior of kfold can be considered to approximate loo-cv. loo package has also helper functions to divide the data respecting the group structure in hierarchical models. leaving whole groups out at the same time. Currently SE is still computed as in loo-cv, which is ok for some cases. We will eventually add SE computation as in leave-one-group-out computation. Leave-one-group-out SE is a bit more complicated if the groups have different sizes.

They are useful, but you need to specify in more detail what you want to predict to know what is the best way to calculate in Bayesian or in any other framework.

As you see, there are different options, and that’s why we’ll first make it easier to get cv predictions, so that the user can more easily to compute what they need and then add documentation and case studies to illustrate the different variations. If you have an interesting example for RMSE/MAE computations, I’m happy to help to make a small case study for computations (after we have kfold_predict functions…).

1 Like

Thank you for your detailed response @avehtari, I agree with you entirely regarding the estimate of performance/error being arbitrary, I look forward to seeing how the cv predictions are implemented.

In my case, I am comparing new models with pre-existing models for which only the regression co-efficients are available. Comparing the RMSE values of the new models without training/test sets or K-folds and the RMSE values from pre-existing equations seems unfairly biased toward the new models, even if the training/test set performance is arbitrary. In saying that, results of with or without training/test or k-fold splits (I’ve tried both) seem equivalent to one another so maybe it’s ok. Would love to hear your thoughts on this on this as RMSE/MAE seem the best option for comparing model performance in this scenario.

Sorry, I don’t understand what you mean by: “without training/test sets or K-folds” and “with or without training/test or k-fold splits”. The size of test tests depends on k, and If you compute k statistics and look at the SE of these, then the result depends on k and is usually selected without considering the prediction task. To make it less arbitrary, I suggested computing n statistics even if using k-folds splits (with no overlapping test sets, unless you have such prediction structure which fixes k and the future prediction data set sizes.

Apologises, I will attempt to make it clearer - if the model is fitted sequentially using each training set and then assessed in its prediction of each test set in K-Fold CV (in my case K=10) - you have a proxy for assessing your model on ‘untested’ data and deriving the RMSE from this - that is what I meant by with training/test sets. Where I say without training/test sets, I simply derived the RMSE from the actual-predicted values from the entire dataset used in model formulation.

What do you mean by n statistics? excuse my ignorance

Sorry for not being more clear in my previous messages.

Say you have 12 observations,

you can divide this as
k=n ((1),(2),(3),(4),(5),(6),(7),(8),(9),(10),(11),(12))
k=6 ((1,2),(3,4),(5,6),(7,8),(9,10),(11,12))
k=4 ((1,2,3),(4,5,6),(7,8,9),(10,11,12))
k=3 ((1,2,3,4),(5,6,7,8),(9,10,11,12))
k=4 ((1,2,3,4,5,6),(7,8,9,10,11,12))

and naturally for k=6,4,3,2 other permutations are possible. In
k-fold-cv we leave out one group at the time, train with others and
predict for the left out set. From your earlier posts, I understood
that you would compute RMSE for k sets, e.g. if k=4, you will compute
RMSE for 4 sets which each have 3 observations. You will then have 4
statistics (RMSEs) and you would compute SE of these 4 statistics
(RMSEs). Now you will get different results depending on which k you
use, and if that is not connected to your actual prediction task then
it is arbitrary value. You may have a justification to choose a
specific k, say you would in the future always predict for groups of
3, or say you have a hierarchical model with 4 groups, and you want to
know the predictive performance for new groups. If you don’t have a
specific reason to choose a specific k, then I suggest to compute
cross-validated square error for each n individual observation, and
then you have n statistics (square errors) for which you can compute
RMSE and SE for that. This way the result is less sensitive to
specific k value. With k<n there are different possible permutations,
and if you repeat the data division with different permutations, you
can average the predictions over the different permutations for each
observation and then compute n square errors, and then continue as in
the case of one permutation for folds.


Thank you for the explanation @avehtari - I agree with you entirely and appreciate you taking the time to explain it so comprehensively. I will now apply your advice in my work :)

1 Like