Hi,
This is kind of a general question, not very Stan specific. I’m trying (together with @ducoveen) to find papers with best practices or not even best, but any practice, for comparing predictions from Bayesian models with predictions from Machine Learning models such as k-nearest neighbors, support vector machine, etc.

I guess I could use the mean of the posterior, get the fitted value and then do RMSE. But I was wondering if people looked at this before. (I have to guess that someone did, but I’m struggling to find any paper at all!)

(I know that if I have an utility function, this would be straightforward, but I’m wondering for cases where there isn’t one).

There are plenty of ways to compare predictions, although you might find your options a bit limited if some (or most) of these predictions are point predictions. The keyword to look for is (strictly) proper scoring rules. Here’s a nice introductory tutorial aimed at epidemic models: https://arxiv.org/abs/2205.07090 .

I think Max is getting at an interesting idea here, namely that there is a fundamental tension between ML style predictions and Bayesian inference. Namely, that ML is (usually) concerned with minimizing some loss function whereas Bayesian inference cares about things like p(\theta|y) and p(\tilde{y}|\theta,y). Of course there is certainly some overlap – many ML techniques can be seen as special cases of (nonparametric) statistical models and you can probably get a low evaluation set MSE using by taking the posterior predictive mean. But, the fundamental disconnect hits because you really only would care about things like strictly proper scoring rules if you already subscribe to a need for a probabilistic model of reality. Which I think is not a universally agreed upon premise in the classification and prediction world.

There are plenty of ways to compare predictions, although you might find your options a bit limited if some (or most) of these predictions are point predictions. The keyword to look for is (strictly) proper scoring rules . Here’s a nice introductory tutorial aimed at epidemic models: https://arxiv.org/abs/2205.07090 .

This was extremely useful. For continuous outcomes what I wanted is called “Continuously ranked probability score”, it can be calculated for probabilistic models and for deterministic models it ends up being just the mean absolute error. Also according to Gneiting, T., & Raftery, A. E. (2007) is a good strictly proper scoring rule.

The best part is that it’s quite easy to calculate and it’s actually kind of implemented in the dev version of loo:

Continuing with and pinging the people I see contributed to CRPS @avehtari@yuling@jonah .
I know it says “continuous”, but why? Do you think that this rule won’t make sense with discrete predictions?

And maybe a second question about Brier score, I see that it’s used for probabilistic predictions of discrete outcomes (but it could also work for non-probabilistic predictions, right?), for two possible outcomes it is:

where f_{t} is the probability that was forecast, o_{t} the actual outcome of the event at instance t 0 if it does not happen and 1 if it does happen) and N is the number of forecasting instances.

However, it assumes that for each discrete outcome there is one probability mass value f_t, but MCMC gives you many draws of this probability. I saw that in some papers people use the mean or median. But how would you implement it taking into account this uncertainty?

At this risk of sounding like it’s semantics, I’d say Bayesian and Machine Learning are orthogonal concepts, special cases of inference and models respectively – for any given model you can use different inference approaches. In practice specific problems lend themselves to different model+inference solutions, and communities cluster around certain combinations, i.e. practices, for all kinds of reasons (justified or not).

Machine Learning seems to put a lot more emphasis in the models, and rely on whatever optimization algorithms will get the job done, while Bayesian inference is model-agnostic. Since they are not mutually exclusive you could do bayesian inference of an ML model, you’d get the same kind of result by using the MAP or MLE point estimate; the fair comparison would then probably be the fit of a model using MCMC-based inference and whatever optimization algorithm is used for ML (RMSProp, Adam, etc). My guess is MCMC would reveal all kinds of issues with many of these models, in addition to being a lot more costly, which is probably the reason it’s not the way it’s done.

At this risk of sounding like it’s semantics, I’d say Bayesian and Machine Learning are orthogonal concepts, special cases of inference and models respectively – for any given model you can use different inference approaches. In practice specific problems lend themselves to different model+inference solutions, and communities cluster around certain combinations, i.e. practices, for all kinds of reasons (justified or not).

Ok, true, but that’s also orthogonal to my question. I guess that my title was misleading: what I wanted to really know is how to compare the predictions of models with probabilistic vs deterministic predictions.

Right. So more to the point (and following onto part of my reply above), there are probably two main aspects here. First, the theoretical rationale behind the method – e.g. HMC uses information of the curvature of the surface and an empirical metric to sample points in high probability areas, while something like Adam uses a weighted-average tuning of the “momentum”/“learning rate” that is only empirically justified.

The second aspect is the practical results, and here it only depends on the actual optimum found by whatever method, that MCMC produces a set or a distribution of parameters is only because the method keeps the iterations based on an accept/reject mechanism, while common ML “gradient descent” methods mostly don’t care about intermediate steps. Assuming all methods “work”, there wouldn’t be much to compare, but again that’s a strong assumption and relaxing it may mean going down the rabbit hole of multimodality, computational cost, the goal of ML models like “prediction” and even more philosophical questions like point estimates vs distributions of estimates.

I understand that there’s a very practical aspect of your question, that requires technically matching aspects of Machine Learning and Bayesian approaches, and I think that can be done, but maybe the reason I am insisting to some extent on orthogonal themes is because doing a simple comparison probably means willfully ignoring a host of complex decisions.

I think a point I was trying to make (and maybe caesoma as well) is that in order to compare probabilistic and deterministic predictions you have to get your audience to care about probabilistic predictions in the first place. If they do, then I think with proper scoring rules it is trivial (with proper scoring rules) that probabilistic predictions will outperform deterministic ones. Perhaps you may not have this problem in your field but speaking from personal experience in mine phrases like “sampling variability” and “epistemic uncertainty” tend to be met with blank stares at times.

Then it’s called just “ranked probability score”. This is used, and the special case for categorical target was proposed by Brier, and the binary case is known as Brier score.

yes, I know. But it’s based on the idea of having a single probability for each outcome, and in a Bayesian setting one has many draws of the probability of each outcome. I wondered if there was a scoring rule that accounted for that, or should I just take the mean of the draws.

How is this different from having a single predictive density (obtained by integrating over the posterior or LOO-posterior) for each possible outcome in the continuous case?

There’s interesting examples in Kaggle competitions, where it was n not just about point predictions. E.g. M5 uncertainty and the pulmonary fibrosis progression challenges, which featured some kind of reasonable metrics for assessing the uncertainty provided alongside point predictions. If I remember correctly in the second one a Bayesian random effects model on the tabular data (ignoring the images) was pretty good to the extent that you’d question whether in practice you’d bother with the imaging.