Posterior distribution of model performance metrics

When fitting a regression model there are a variety of performance metrics such as R^2, pseudo-R^2, mean squared error, and rank correlation between X\beta and Y. One rank correlation measure is Somer’s D_{xy} which is a simple translation of the c-iindex / concordance probability / area under the ROC curve – D_{xy} = 2 (c - 0.5).

To get credible intervals for such model performance metrics I’m taking a random sample of 500 of the posterior draws of \beta, computing 500 values of, say, D_{xy}, and computing quantiles from those 500. I am getting reasonable looking credible intervals for Brier score (mean squared prediction error in the binary Y case but the credible interval for D_{xy} is about \frac{1}{5} as wide as a frequentist sampling error-based confidence interval.

In the limit in which there is only one predictor, every posterior draw for \beta will yield the same D_{xy} because every linear predictor vector will have a rank correlation of 1.0 with every other linear predictor draw.

What are some better ways to think about this? Does my approach even work for non-rank-based metrics such as Brier score?


Sorry, it seems your question fell through. I am by far not an expert on this, but will try to provide at least my opinion. Maybe I am also just misunderstading your question.

That is generally a reasonable approach, although I wonder why you wouldn’t use all the posterior draws. Performance concerns are IMHO the only good reason to do that, if autocorrelation of samples is a concern (which I think should not be for posterior quantiles), than thinning is AFAIK preferable to taking a random subset of the samples.

Just to check, here c = P(Y_2 > Y_1 | X_2 \beta > X_2 \beta), where (X_1, Y_1) and (X_2, Y_2) are randomly chosen (predictor, response) pairs from the data set right?

There are some technical limitations in that MCMC will only correctly (for some meaning of “correctly”) estimate posterior expectations of values that are square integrable (and potentially something more, @betanalpha would be more qualified to speak about that), but my guess is that D_{xy} should satisfy all those conditions easily.

Could you share more details about how you compute the frequentist case and the model you use? I can for example imagine the posterior for D_{xy} being very non-normal which could result in the Bayesian estimate having very different quantiles than a normal approximation following a frequentist approach. Can you rule out priors are influencing this quantity? I can imagine the frequentist estimate being driven by extreme \beta values that were cut off by the prior.

I am not sure I can follow this - in the limit of what? In the limit of infinite data with fixed number of predictors, and for an identifiable model, I think the posterior of basically any quantity should concentrate on a single value, D_{xy} not being an exception. But I don’t see how that could be a problem…

Hope that helps at least to understand where I’ve misunderstood the question :-)

There should be no problems with MCMC estimators here, provided that the chains are exploring sufficiently well.

I can’t tell if the question is about some posterior retroactive metric based only on the fit and the observed data or some calibration based on a simulated model configuration, simulated data, and comparison of the fit to the simulated model configuration. If the latter there’s no reason to expect why a Bayesian calibration (testing against only model configurations in the scope of the prior model) would be close to frequentist calibration (testing worst case over the entire model configuration space).

1 Like

I am very sorry not to have checked the forum for a while, and didn’t realize that questions and useful ideas have come in.

I’m taking random samples only because one of the model performance metrics is Somers’ D_{xy} rank correlation (a simple rescaling of the c-index or concordance probability). Even with Terry Therneau’s very fast algorithm for this in the R survival package, execution time can mount up when you have censored data and need to repeat the calculation more than 500 times, especially when there are more than 5,000 observations.


You can get frequentist standard errors of such U-statistics using the general Hoeffding U-statistic variance formula (which needs all possible pairs), or using the unconditional bootstrap by sampling with replacement from the original data. Therneau has a fast approximation to the U-statistic method. So the goal is estimating sampling uncertainty in a concordance probability (or other measure such as Brier score or pseudo R^2).

The Bayesian posterior is incredibly skewed. I’m using highest posterior density intervals which seem to work; it’s just that they are narrower than the other (sampling) uncertainty one sees.

I meant this to be something simpler. I’m not really talking about limits but instead the special case where there is one predictor which means that uncertainty about \beta is irrelevant because it doesn’t affect a rank correlation measure (unless you flip the sign).

The process I’m using is to take the dataset as fixed and to sample the posterior draws of the regression coefficients \beta. For each posterior draw I compute e.g. a concordance probability (AUROC in the binary Y case). I then get the posterior distribution over all the draws. I would have thought that the deep issue here is something related to sampling uncertainty vs. uncertainty about a single unknown data generating process.

1 Like

I guess I’m not sure why you have any expectation that the posterior credible interval will have any similarity to frequentist confidence intervals given that they are fundamentally different things.

Even if a frequentist confidence interval has proper coverage (necessary assumptions hold or approximation error has been shown to be negligible, etc) there are no guarantees that the confidence intervals realized from a single observation will have any particular behavior without knowing the full sampling distribution. Similarly even when the model captures the true data generating process there are no guarantees for what any particular realized Bayesian posterior distribution, and hence quantities like credible intervals, will look like.

Certainly for narrow, normally-shaped likelihood functions the realizations of many frequentist confidence intervals and Bayesian credible intervals (namely intervals of the parameter functions) will be similar but one still has to verify that the realized likelihood function is sufficiently nice in the direction of each of the parameters.

To flip things around a bit – why did you expect there to be any similarity in the first place?

That’s a key question, and thanks for the other thoughts. I had no expectations, just a wish that the posterior intervals would be useful to the researcher in describing the uncertainty of model performance. I want to have an exact interpretation to be able to explain to others, if the calculations are useful. From what you wrote here is my attempt:

The highest posterior density interval for the model’s predictive discrimination as measured by Somers’ D_{xy} captures our ability/inability to measure the true performance of the model on this dataset due to our lack of knowledge about the true regression coefficients generating this single set of observed data. Ordinary frequentist confidence intervals for D_{xy} (often mistakenly computed as if D_{xy} had a symmetric sampling distribution) captures the uncertainty in estimating D_{xy} due to not measuring the performance of the modeling process on an infinitely large training sample. In other words, the confidence interval focuses on sampling error whereas the posterior interval focuses on estimation error in estimating the performance in a single never growing dataset.

Suggestions welcomed.

Unfortunately the topics being discussed here are pretty subtle and I’m not sure if you’ll be able to explain what’s going on here without getting into at least some technical detail. In particular we have

Inference: quantifications of model configurations that are consistent with the observed data (at least with the scope of the model).

Calibration: quantification of how inferences behave as the observation is varied.

Within each of these we have frequentist and Bayesian variants (for more detail see for example Section 2 and 3 of,

Frequentist Inference: Deterministic functions of the observed data that identify one or more model configurations. When well-engineered these model configurations may be related to model-consistency with the observed data. Otherwise known as estimators.

Frequentist Calibration: The worst case performance of a frequentist estimator over possible observations within the context of a given model. More formally the worst case expected loss of a given model-based loss function.

Bayesian Inference: A probability distribution, as summarized through posterior expectation values, that weights the model configurations by how consistent they are with both the observed data, as defined by the observational model, and domain expertise, as defined in part by the prior model. Otherwise known as a posterior distribution and posterior expectation values.

Bayesian Calibration: A distribution of possible posterior distribution behaviors over possible observations within the context of a given model. If a model-based utility function is defined then this often takes the form of the distribution of possible utility outcomes or summaries thereof such as average utilities.

Asymptotics, and calibration defined only in the asymptotic limit, only confuses the matter further!

The problem with comparing a confidence interval and a credible interval is that the former is defined with respect to its calibration (frequentist coverage is worst case loss of interval estimators under an inclusion loss function) whereas the latter is a pure inference. We can’t compare calibration to calibration because the credible interval often hasn’t been calibration, and even if it had been frequentist and Bayesian calibrations are fundamentally different (one worst case over the model configurations one distributional over the model configurations).

I think one can communicate the different goals – calibration verses inferences – in a relatively terse, straightforward way but to go any further you either have to get superficial and abstract away important details or spend some additional time laying out those details in preparation.


This is deep — and appreciated. In the interest of having a simpler explanation for non-statisticians do you think my summary is far off?

In my opinion the challenge is that whether or not your summary is close depends on the particular background of the person with whom you are talking. To have anything relatively robust (dare I say – high worst coverage?!) I think that you have to step back and go through many of the details to ensure that everyone is on the same page. In any particular instance, however, much shorter abstractions can be successful once you work out the particular background of the person.

In other words I have to hedge and given the Betancourt™ answer of “it depends”.

1 Like