In general, neither the central nor the HPD is “correct.” Both are “putting default intervals through a spin cycle.” If we want to report intervals at all, we have a choice of how to report them.
What I’m taking away from this discussion is that the proposal isn’t a different way to report posterior samples, the proposal is that a change be made to the samples before we calculate an existing summary statistic. In which case you’re not comparing like things here. What I understand your argument to be is closer to implying something like ‘we have a choice of what data points to add to the posterior samples.’
It seems like the examples that have come up are why we have model comparison methods. If this is a method for testing something, then there must be some formal justification. The initial post presents this as a software development question; wouldn’t it be better posed as a research question first?
The posterior draws are the same. The question is how to summarize them. Posterior mean, posterior median, posterior sd, posterior quantiles, central posterior interval, shortest posterior interval . . . all of these are possible. For unimodal symmetric bell-shaped distributions, it doesn’t really matter much what you report. When there are long or short tails or skewness, even then it usually doesn’t make much of a difference. But sometimes it can make a little bit of a difference.
What summaries are useful is a statistics research question. But it’s also a software development question of how to implement these summaries in a way that’s efficient in the sense of minimizing Monte Carlo error.
I don’t think there will ever be a clear formal justification for reporting one sort of posterior interval or another. I don’t think any posterior interval is the answer to any decision problem. Here I’m not thinking about model comparison or testing, I’m thinking about communication, about summarizing models that we have fit. It’s hard to formalize communication goals in a mathematical way, but communication is important.
Similarly, when to use Bayesian inference and what priors and likelihoods to use is a statistics research question. But how to get posterior draws in Stan is a software question. It can be hard to separate these.
I have been thinking about some of these issues for a while and here are some additional thoughts.
First I agree that the full posterior is definitely the best and do not suggest replacing it with a single interval. That said, I am not as good at interpreting areas under distribution curves as others I have seen who can tell where the median is from a glance. I would assume that there are others out there like me. Even for people better than me at estimating the areas, consider comparing histograms of draws from gamma(1,1) and gamma(1.1, 1) distributions. Could even the best visual estimators tell the difference based on the leftmost bar in the histogram? Yet it the first case the value of 0 is the mode and in the second place the height of the distribution is 0 at 0.
I find having some intervals along with the posterior distribution helps my understanding. There are also some cases where we may need to convey some quick information (with the plan of further exploring the full posterior later) and an interval may be appropriate to start when a full graph cannot be used (text only medium, explaining over the phone, summaries for someone who is blind, etc.).
The central interval will always include the median of the posterior. The HPD interval will always include the mode. The only interval that I can think of that would always include the mean would be a Wald style based on estimating the standard error from the posterior, but if the posterior is skewed enough that the mean is not in the central interval, then this interval would probably be the worst, so I will not consider it further.
For a central interval, the probability that the parameter is outside the interval on the left is the same as it being outside the interval on the right. For the HPD interval the posterior probability of the parameter being equal to the leftmost bound (or in a small region around that bound) is equal to it being equal to the rightmost bound (or almost the same being in the small region around it).
Both have nice properties, do we really need to decide which is “Better”? or can we just use both?
One place that I do see a difference is if we want to look at a confidence region for 2 or more parameters simultaneously, e.g. could these 2 regression parameters both be negative or 0 at the same time? Here an HPD region (or better regions) makes sense. I am not sure what a meaningful central interval would be in this case.
But only one’s putting it through the SPIn cycle, buh-dump-dump. Maybe the joke wasn’t obvious enough.
I get that intervals have the same kind of arbitrariness as point estimates. The only difference is that I don’t recall ever seeing a decision-theoretic criterion used to choose them.
It’s not a method for testing things. It’s just two different ways to define an interval that’s expected to have 95% coverage—the shortest interval covering 95% of the probability mass or the ones defined by the (2.5%,. 97.5%) quantiles. Both should have 95% coverage.
They’re different summaries of the posterior, much like the mean or median. The posterior mean and median of parameters have the nice properties of being the point estimates that minimize expected square error and absolute error respectively. I don’t know of any decision-theoretic justifications for different posterior intervals.
Actually, it is. The issue is whether we should report shortest posterior intervals (and if so by means of what estimator/algorithm) or if we should stick to reporting arbitrary quantiles. As is, our interface is very simple and users can specify quantiles. If they choose 5% and 95%, they get the bounds of the central 90% interval. There’s no way for users to get the shortest posterior 95% interval (the SPIn). For SPIn, they’d have to specify coverages. And sometimes it wouldn’t work, as in uniform distributions for which the SPIn is not defined.
We’ve been having the same discussions around reporting posterior means vs. medians. There’s no right answer, just different properties as summaries.
I think there’s some miscommunication on my part—I didn’t intend to respond to the spin method at all, just this specific recommendation which I took to be the only point of contention here:
if the quantity of interest is bounded, you want to allow the possibility of the interval to go all the way to the boundary. So if there’s a lower bound a and an upper bound b, you also consider the intervals, (a, z_950) and (z_51, b) as candidates.
@Bob_Carpenter I thought you raised some interesting objections to this idea, and I chimed in because I still hadn’t seen any response to them. So the point by @andrewgelman that I quoted, as far as I could tell, was a response to your criticisms, but one that I thought didn’t add up. Maybe I’d missed the boat already…
I was confused because SPIn’s the method @andrewgelman was using to get 0 in the interval.
I was mainly objecting to the slight bias and arbitrariness of locating an interval of a given coverage (same problem with central intervals).
I was not actually using the Spin method. The Spin method was in that earlier paper of mine, but I don’t really trust the Spin method as it’s so complicated so I just used the empirical shortest interval. I still think there should be a good stable alternative.
The cran published SPIn() function is quite fragile and easy to be broken. I am comparing it with path sampling estimate today and use SPIn() as a benchmark-- but it is often broken and returns NAN.
# add static points to mimic metropolis hastings
x= c(x, x[1:30])
For example, Bernardo (2005ab) has proposed selecting credible regions as a lowest posterior loss (LPL) regions, where all points in the region have smaller posterior expected loss than all points outside. I bet someone else has also written about that. (I also wrote a paper on extending ideas of Bernardo, but it was before arxiv, and after it was panned by reviewers I never got enough energy to revise it. Everytime I look it, I think I should put it in arxiv. Maybe this time I’ll do it.)
Bernardo, J. M. (2005a). Intrinsic credible regions: An objective bayesian approach to interval estimation. Test, 14(2). 317–384.
Bernardo, J. M. (2005b). Reference analysis. In Dey, D. and Rao, C. R., editors, Handbook of Statistics, volume 25. Elsevier. 17–90.
Sure, but I don’t consider those to be serious decision-theoretic criteria. To me, the Bernardo papers seem like retroactive reasoning: people use interval estimation so let’s come up with a justification for it. I think the missing part here is that there is a goal of communication, which isn’t really part of decision theory.
That’s exactly what I don’t like about the whole business of applying SPIn when we want to include zero in an interval.
I agree, which is why I think using SPIN is confusing. It may have been less confusing when SPIN was standard, but now it just needs all the extra explanation of “we tacked endpoints onto empirical draws and used empirical SPIN so our summary includes 0”.
Rather than computing SPIN with SPIn, you used an empirical method for computing SPIN. Naming a method “SPIn” that computes SPIN was perhaps not the clearest choice. I’ll stick to “SPIN” for the theoretical notion of shortest posterior interval rather than its estimator.
I thought SPIn just bootstrapped the sample and averaged the empirical interval boundaries (what the ML folks would call a “bagged” estimator for boostrap aggregation). If that’s the intended estimator, it should only be as broken as empirical SPIN.
SPIN itself isn’t well defined for uniform distributions. And they crop up inadvertently at the software level when doing what @yuling did to generate data. If we generate a sequence as @yuling proposes, by duplicating normal draws, we run into the uniformity problem whenever taking odd-sized SPINs
Suppose we get the following after sorting.
-0.60 -0.60 -0.53 -0.53 -0.47 -0.47 -0.46 -0.46 0.35 0.35
The empirical 30% SPIN here is (-0.47, -0.46), but it can be derived from subsequence (-0.47, -0.47, -0.46) or (-0.47, -0.46, -0.46). So the software needs to be careful.
I recommend we not talk about “SPIn” anymore, as this is a particular algorithm that appeared in one paper, and I don’t think the algorithm is so good, which is why I didn’t use it in our recent paper! And I say that even though I’m a coauthor of that SPIn paper.