Using (autocorrelated?) posterior samples in calculations

I’ve fit a nonlinear model to data that measures how several species are affected by competition with one another. The model converges, and the diagnostics— \hat{R}, ESS, posterior predictive check—all look good. I now want to calculate two quantities of interest for each pair of species using the parameters estimated by the model. Say, for the sake of simplicity, that my model has two parameters \alpha_j and \beta_j estimated for each species, j. The quantities are calculated on each species pair j,k using simple arithmetic operations such as, \beta_j -\sqrt{\frac{\beta_k-\alpha_k}{\alpha_j}}.

The obvious first pass is to use the estimated population level effects (i.e., the central tendency of each parameter’s posterior distribution), but of course the value of having an estimate of the full posterior distribution is that I can propagate the uncertainty associated with each of the parameters to understand its impact on the quantities of interest.

I understand that Stan’s posterior samples are autocorrelated, and I’m concerned about how autocorrelation in the posterior samples might bias the calculation of the quantities of interest. Suppose I want to calculate these quantities using a subset, i, of the total number of posterior samples, N.

I can imagine several different approaches for selecting this subset, i:

  1. Use all posterior samples (i = N). The disadvantage here is that any visualization of the quantities of interest would have substantial overplotting, compounded by the need to expand the axis ranges to encompass rare values far from the central tendency. The upshot is a figure that’s difficult to parse.
  2. Use sequential samples, N_1,N_2,...,N_i. Depending on how many post-warm-up iterations were done on each chain, it would be possible that all samples come from the same chain.
  3. Draw i samples from N with equal probability across all chains.
  4. Stratify sampling by drawing every n-th sample. Effectively this would be a thinning approach. n could be chosen to space samples as far apart from one another as possible, or could be drawn from a uniform random distribution with lower & upper bounds chosen to prevent sequential or near sequential samples from being drawn.

From a conceptual perspective, is there any reason to suspect that these different approaches to subsetting the posterior draws would give different answers when the draws were used in further calculations? If yes, which (if any) would be the most robust strategy? Is there a principled way to use model diagnostics (e.g., ESS vs. N) to guide the decision?

Under ideal conditions MCMC reports not just an estimate of the posterior expectation value of a given quantity of interest but also an estimate of its standard error which you can use to judge whether you have enough precision to say anything interesting about that expectation value. Incidentally the diagnostics are to check for violations of those ideal conditions – although I didn’t see anything about divergences – so that if you don’t see any failures then you have some mild confidence that the MCMC estimators are unbiased and the error is quantified by the MCMC standard error.

The differences between 1, 2, 3, and 4 don’t make much difference on the MCMC estimator itself but they can change how autocorrelations, and hence the effective sample size, and hence the MCMC standard error are calculated. 1, 2, and 4 will yield self-consistent error estimates, while 3 will not. 2 and 4 needlessly through away information and hence shouldn’t be used unless you have memory issues, and even then there are better alternatives.

The disadvantage listed for 1 isn’t really a disadvantage, it just means that you’re visualizing the wrong thing. If you don’t want to show those tails then apply a visualization technique that suppresses the tails (histograms with bounded bins, inner quantiles, log scales, etc) while also clearly communicating that the tails are being suppressed. Moreover, subsampling your chains doesn’t make the tails go away it just ignores them with higher probability which is a fragile basis for visualizations.

2 Likes

Thanks @betanalpha, this is very helpful. (P.S., There were no divergences.)

To summarize (and check my understanding), you’re advocating using all posterior samples (approach #1). The rationale is that the standard error of the MCMC estimator reported by the full model is only valid when using all posterior samples. Showing only a subset of the data would be misleading, in that any kind of subsetting will necessarily change the standard error. By crude analogy, one wouldn’t plot a mean & SE of a 1000 observation dataset, but only show the first 100 observations in the background. Although the reasoning for why the subset is insufficient differs, the common upshot is that estimators & SEs only hold for the whole dataset.

I’ll use all posterior samples in my visualization. Ultimately it will be a scatterplot showing the quantities of interest in a phase space. The bounds of one quantity are (0,\infty), and behave well with a log-transformation to compress extreme values. The other is bounded by (-\infty, 1], and I don’t know of a transformation that would work well to bring in the rare extreme negative values. I think a marginal plot that captures all of these points as a categorical x<-n where n is chosen to maximize data visibility should do fine.

To be more formal, you can compute an expectation estimate and standard error (assuming the ideal conditions, yadda yadda yadda) for any proper subset of the Markov chains (your 2 and 4). The standard errors for any subset will just be worse than the standard errors for the full Markov chains. There’s no reason to not use all of the information that your Markov chain provides unless there are computational limitations (namely limited memory) which rarely hold these days.

The visualization issue is somewhat similar – you don’t want to ignore any information inadvertently. The formal setting for the visualization is somewhat different, however. Here you’re no longer just compute an estimate of the mean of the quantity of interest you’re trying to visualize the entire pushforward distribution of the quantity of interest which is a more complex object. There’s no sense of standard error for that entire distribution so it’s hard to be as formal, but again the basic intuition holds – use everything the Markov chain generously gives to you.

What you describe is consistent with really long tails in the distribution of that quantity of interest which is typically something very important to communicate. If for some reason you don’t care about tails then you should consider another visualization entirely — maybe ribbon plots of nested central quantiles, for example.

Yep—a few species pairs have long, slender tails. The interesting question is whether these tails spread out across multiple regions of the phase space, or if they point parallel to/away from the phase boundaries such that the qualitative outcome of biological interest won’t change regardless of how long they extend. A scatterplot with truncation marginal plots seems like the best way forward to accomplish this goal, but I’ll keep your tips for alternative visualizations in mind. Thanks for clarifying the mechanics under the Stan hood for me, and for challenging me to think more carefully/creatively about visualizing the gifts from my Markov chain.

In case like this I’d recommend breaking your phase space up into squares and computing probabilities. For example if your phase space was positive and T_{x} demarcated the centrality from the tails in the X direction and T_{y} demarcated the centrality from the tails in the Y direction then you could consider for boxes

  • central (x \le T_{x} and y \le T_{y})
  • x tail (x > T_{x} and y \le T_{y})
  • y tail (x \le T_{x} and y > T_{y})
  • xy tail (x > T_{x} and y > T_{y})

If the xy tail box has high probabiiltiy then you know you’re getting the physically important correlation.

The challenge with estimating tail probabilities like these is that you need sufficient events out in the tails, which typically means lots and lots of events in the bulk. You’d want at least 10-100 effective samples in those tail boxes to get decent estimates. Not having enough samples is also why you would see the variable visualizations that you mentioned above.

1 Like