 # Shortest posterior intervals

By default we use central posterior intervals. For example, the central 95% interval is the (2.5%, 97.5%) quantiles.

But sometimes the central interval doesn’t seem right. This came up recently with a coronavirus testing example, where the posterior distribution for the parameter of interest was asymmetric so that the central interval is not such a reasonable summary. See Figure 1 here: http://www.stat.columbia.edu/~gelman/research/unpublished/specificity.pdf

Instead, sometimes it can make sense to use a shortest probability interval (similar to the highest posterior density interval), as discussed in the paper with Ying Liu and Tian Zheng: http://www.stat.columbia.edu/~gelman/research/published/spin.pdf

The brute force approach to computing a shortest probability interval is to compute all the intervals of specified coverage and take the shortest. For example, if you have 1000 simulation draws and you want the 95% interval for a particular quantity of interest, z, you sort the 1000 simulations of z, compute 50 intervals, (z_1, z_951), (z_2, z_952), …, (z_50, z_1000), and you take the interval that’s the shortest.

There’s one additional twist: if the quantity of interest is bounded, you want to allow the possibility of the interval to go all the way to the boundary. So if there’s a lower bound a and an upper bound b, you also consider the intervals, (a, z_950) and (z_51, b) as candidates.

Here’s R code:

spin <- function(x, lower=NULL, upper=NULL, conf=0.95){
x <- sort(as.vector(x))
if (!is.null(lower)) {
if (lower > min(x)) stop("lower bound is not lower than all the data")
else x <- c(lower, x)
}
if (!is.null(upper)) {
if (upper < max(x)) stop("upper bound is not higher than all the data")
else x <- c(x, upper)
}
n <- length(x)
gap <- round(conf*n)
width <- x[(gap+1):n] - x[1:(n-gap)]
index <- min(which(width==min(width)))
x[c(index, index + gap)]
}


This particular code is kinda hacky in that it does not account for fencepost errors, but it conveys the general idea.

It works! But there is a concern that the procedure has unnecessary Monte Carlo error, in that we’re computing the minimum width of these empirical intervals. It seems like some smoothing should improve things.

In the above-linked Liu et al. paper, we present a pretty complicated procedure to compute a shortest probability interval efficiently. It works well in the paper, but I don’t really trust it in general. I think we should have something more transparent and reliable.

The short version of this post is that it would be good to have the shortest posterior interval as an option, not for Stan’s automatic output–I think by default the central interval is fine–but as an available option when users want it because it does come up.

The longer version is that I think there’s a research opportunity to come up with a more efficient shortest posterior interval computation that’s also clean and robust.

7 Likes

I know how to do this. It would be a nice student project of putting together couple existing pieces of code and making the experiments to show it’s robust and has smaller variability than the naive implementation. I can help if someone has time to work on this.

4 Likes

Sounds good!

This discussion should be in the algorithms category.

I still just don’t see this. I think @andrewgelman wants SPIn summaries because the hacked versions he suggests might include zero. But what makes that a better summary than any other interval? Here, it just seems to be a desire to include zero.

But requiring zero the way Andrew suggests by just tacking on a lower and upper bound seems to present two problems. One, not every distribution is consistent with the boundary. Secondly, adding two more elements biases the results since now our interval has two additional non-sampled points.

I wrote back to Andrew personally about this before I knew it was a Discourse post (I don’t usually read the General category).

I don’t see how this algorithm could be made more efficient than \mathcal{O}(N \cdot \log N), which you get from sorting up front and is something Stan’s already doing to compute quantiles. In some cases, the intervals might be wide enough that we can reduce sorting time by only extracting the top (1 - p) and bottom (1 - p) for a p interval, the algorithm’s still going to be \mathcal{O}(N \cdot \log N). But if you need multiples, it’s still going to be more practical to sort. And a lot easier. Finding the widest interval after the sort is linear, so that doesn’t even play into the overall complexity.

1 Like

Done.

Is this the same as the highest density interval? Isn’t the HDI equivalent (assuming unimodality) to a shortest-length interval containing X% of the mass?

If so, R already has a package for it (HDInterval), and it’s trivial to use with Stan already. It supports a ton of sample formats, and a method for stan-fits would be dead-simple to implement (e.g., I made a method to support logspline output, because it has a function generic; it would be equally easy to just make a hdi.stanfit method).

1 Like

If the distribution is unimodal, HPD and SPI are the same. If there are multiple modes, HPD can be more than one interval. I looked up the HDInterval function that you mention, but it is not quite what we want because it does not allow for user-specified boundaries of the distribution. In the cases where I want to do this interval, the parameters are bounded, and a key reason for doing the SPI is because the interval might go all the way to the boundary.

Bob:

1. I only want the interval to go to 0 if that is the shortest posterior interval.

2. You write, “not every distribution is consistent with the boundary.” I would use these boundaries only for random variables that are bounded. Fortunately, for Stan, we know the boundaries of all the parameters. We can’t automatically get boundaries for arbitrary qois, but we can get boundaries for model parameters.

3. Yes, the SPI is parametierization dependent. This is related to a point that came up elsewhere on this thread (or maybe in the blog or in some email), which is that there is no Bayesian justification for interval estimation, period. Nonetheless, we often find interval estimates to be helpful. There just happen to be cases where I think the SPI is more helpful than the central interval.

4. When I sopke about making the algorithm “more efficient,” I did not mean “computationally efficient.” I meant “statistically efficient,” i.e. getting a less noisy estimate by using the simulation draws more efficiently. It’s funny–as a statistician, when I say “efficient,” I’m by default thinking of statistical efficiency. But you’re more in the computing world, so you think of computational efficiency!

I’ll take a look at this in detail soon. Good to know we’re looking for statistical efficiency and not computational efficiency.

I’m saying there are boundaries that are not consistent with zero. For instance,

parameters {
real<lower = 0> sigma;
model {
sigma ~ lognormal(0, 1);  // lognormal is not consistent with 0
sigma ~ normal(0, 1);  // half normal is consistent with 0


This means that your approach can return intervals the ends of which are not in support. Is the idea that they’re all open intervals? I get confused because everything’s essentially a closed interval with floating-point rounding and the finite number of possible values.

How can you know? You never sample 0 so you don’t know how much bias gets introduced if you throw it into the mix. So let’s say I have a beta(2, 500) distribution to take an extreme example. Let’s take 20 draws and sort:

> y <- rbeta(20, 2, 500)
> sort(y)
 0.00049 0.00078 0.00081 0.00107 0.00263 0.00328 0.00337 0.00340 0.00345
 0.00426 0.00441 0.00458 0.00493 0.00527 0.00549 0.00671 0.00776 0.00782
 0.01163 0.01317


If you add 0 and 1, then your function will return (0, something). But 0 isn’t in the support of this distribution and the true shortest posterior interval here does not include 0.

So why not just round? (0.00049, 0.00776) rounds to (0.000, 0.008)), which while introducing a lot of relative error in the size of the interval (which is smaller than 0.008), is more conventional.

This seems wholly dependent on wanting to say the result is consistent with zero. Can’t we just get at this directly without indirectly going through SPIn.

If we think about intervals probabilistically, then we have an expectation of a variable falling in an interval, which should be OK.

That’s why adjectives are so important and other people’s fields are so hard to break into. (I don’t have a solution—learning a field is in large part learning the lingo.)

1 Like

If sigma has a lognormal prior, then I don’t think it’s shortest 95% posterior interval would include zero. But if it did, from computation, that would be ok at the resolution of whatever simulations we have.

@andrewgelman If this is the aim, is it possible that specification of a Region Of Practical Equivalence (ROPE) might achieve the same end without opening the can of worms that is density interval estimation? That is, define a maximum value greater than zero that you still consider practically equivalent to zero, express this as the end of a range that is bounded on the other side by zero, and compute the % of samples that fall in this range. Decision rules for shorthand expression can then be applied such that if that % is very small (say <5%), then zero-or-practically-zero would seem incredible, and if that % is very large (say >95%) , then values that are practically-greater-than-zero would seem incredible, and finally if somewhere in between we can simply say that the data do not yet inform on the relative credibility of practically-zero and greater-than-practically-zero.

No. I just want to summarize the posterior distribution. From a Bayesian standpoint, there is no justification for any interval estimate. The central probability interval has certain properties that the shortest probability interval doesn’t have, and vice-versa, but neither is the answer to any decision problem.

In this case, I am not concerned with practical equivalence. I just want to summarize the posterior in a way that makes some sense. We have more discussion of the general point in this paper: http://www.stat.columbia.edu/~gelman/research/published/spin.pdf
I don’t like the computational method in this paper–it’s so complicated that I don’t trust it–but we lay out the issues.

More generally, your comment is getting at the idea that there are multiple purposes of interval estimation. One goal is an uncertainty interval: a range where we think the true parameter might be (conditional on the model). Another goal is a compatibility interval: a range of parameter values that are consistent with the data and model. These two things are not necessarily the same! There’s some discussion of these issues here: http://www.stat.columbia.edu/~gelman/research/published/uncertainty_intervals.pdf

So I think the whole area of interval estimation is murky. But, in the meantime, I think there are sometimes good reasons for computing shortest posterior intervals, just as there are soimetimes good reasons for computing central probability intervals. So when we do this, I’d like it to make efficient use of the simulations that we have.

1 Like

Would you care to expand a bit more on this?

You can take the entire posterior distribution, or you can use the posterior distribution to make decisions, integrating a utility function over the posterior. But there’s no decision question for which the posterior 90% interval (of any sort) is the answer. This is not to say that posterior intervals are useless. It’s just to say that they are summaries to aid understanding, in the same way that graphs are summaries to aid understanding. Some graphs can be better than others, but it will depend on context. Similarly, there is no right or wrong way to define a posterior interval; what to do will depend on your communication goals.

2 Likes

I think the practical justification of including 0 is so that the researcher can say the model doesn’t exclude 0.

Another way of saying this is that they’re all the same in a well calibrated model. Every 90% interval has roughly 90% of the probability mass. Which you choose to use as a single summary is a matter of aesthetics, part of which I would argue is convention. Had Andrew suggested the leftmost interval, everyone would’ve chuckled, but by choosing SPIn and defining it by tacking on a zero, he now has something that reports an interval with 0 as a lower boundary. I think everything’s working backwards from that goal of reporting (0, x) as an interval.

But I think there is a justification for excluding a 0 density endpoint.

For example, here are 100 draws from a standard lognormal:

> print(sort(exp(rnorm(1e2, 0, 1))), digits = 1)
 0.08 0.09 0.12 0.12 0.14 0.18 0.19 0.19 0.20 0.23 0.24 0.24 0.24 0.25 0.25
 0.27 0.29 0.30 0.30 0.31 0.31 0.31 0.32 0.34 0.34 0.37 0.37 0.38 0.38 0.39
 0.44 0.44 0.45 0.49 0.50 0.51 0.52 0.59 0.59 0.61 0.65 0.68 0.70 0.72 0.73
 0.73 0.75 0.76 0.77 0.78 0.79 0.80 0.87 0.93 0.94 1.11 1.13 1.15 1.17 1.20
 1.23 1.32 1.36 1.39 1.40 1.40 1.44 1.45 1.46 1.48 1.58 1.63 1.69 1.75 1.78
 1.87 1.97 2.12 2.16 2.19 2.22 2.23 2.29 2.36 2.38 2.79 2.86 3.21 3.24 3.33
 3.57 3.86 3.94 4.14 4.81 5.05 6.04 6.19 7.80 9.28


It’s clear this will include 0 if you calculate SPIn by adding 0 to the above sequence and taking shortest intervals. The central interval will be (0.12, 6.19) and the SPIn will be (0, 5.05). Given the strong right skew of the lognormal, neither seem like a good summary.

@andrewgelman—If you’re OK with rounding, why not just round the central interval?

My original motivation was the tau parameter in the 8 schools model. There, the central interval excludes zero–it has to!–but I wanted zero in the interval because zero is well supported by the data and model.

This is related to Sander Greenland’s idea of “compatability intervals” rather than “uncertainty intervals”: http://www.stat.columbia.edu/~gelman/research/published/uncertainty_intervals.pdf
In the 8 schools example, zero is fully compatible with the data and model, so I want to include it in the interval.

This is very much related to HPD reasoning. But I still think it’s hard to formalize the reasoning. It’s also hard to come up with a formal justification for a central 95% interval. Yes, the central interval is invariant to parameterization, but that’s not really a justification for the interval. It’s a pleasant property of the interval, once we’ve decided to use it, but it’s not a reason to use it in the first place.

2 Likes

Ah, yes, I’ve thought of this in precisely the same context; in meta-analysis it can be useful to ask to what degree precisely-zero is a credible hypothesis for the magnitude of between-study heterogeneity (i.e. variance in means is explainable by measurement noise alone). I’ve had some data sets where it clearly is not and the posterior shifts dramatically away from zero, while in others the posterior shrinks toward zero but of course the standard posterior intervals don’t capture zero itself. So far I’ve just presented plots of the posterior and in text argued that zero-or-nearly-zero seem relatively credible. This is the one time where I was tempted to look at a Bayes Factor (though fleetingly).

1 Like

At this point, why not just show the posterior histogram?

I don’t want to launder default intervals through a SPIn cycle to report an interval with zero in it because I want to make the point that the data are consistent with zero.

1 Like

Sure, post histograms are fine too. I think these come for free in bayesplot.

The point is that we (and others) are already displaying posterior intervals. 30 years ago, if you were a Bayesian and you wanted to report a posterior interval, it would be HPD. That was the standard. Then, under the influence of BDA among other things, the default changed. Now if you are a Bayesian and you want to report a posteiror intervals, by default it’s the central interval. In general, neither the central nor the HPD is “correct.” Both are “putting default intervals through a spin cycle.” If we want to report intervals at all, we have a choice of how to report them.

1 Like