In the case study, he uses “tail k_hat” diagnostics to verify that the effective sample sizes for the Cauchy parameters aren’t well-defined. The only time I’ve ever seen k_hat mentioned in this context is in Pareto-smoothed Importance Sampling, as in this paper: https://arxiv.org/pdf/1507.02646.pdf

As far as I can tell there’s no importance sampling of any kind in the case study, but I assume the principle behind the use of k_hat is roughly the same: we use Generalized Pareto distributions to approximate (the tails of) the posterior, and the values of the shape parameters provide estimates on the number of posterior moments that exist. Is that correct?

If so, wouldn’t we expect the k_hat’s to be above 1 for the Cauchy distribution? Presumably (by analogy with the discussion in the PSIS paper), k_hats between 0.5 and 1 imply infinite variances but finite means. Similarly, I tried fitting a student_t(2, 0, 1) and got no k_hat warnings - even though the second moment of this distribution is certainly infinite. Am I missing something in the interpretation here?

Yes, this use case of estimating the number of existing moments was inspired by that paper.

Yes.

Mike is using quite large proportion of the draws and thus including considerable part of the bulk, and thus the distribution is not GDP and the estimate for the the number of existing moments is biased. Theory says that if the cutpoint far enough in the tail, then the tail can be approximated with Generalized Pareto distribution. The problem is that if we make the threshold be far in the tail, we get less draws to use to fit GPD and the variance of the estimate increases. Mike decided to favor lower variance and higher bias. If you move the threshold further you will get the results you would expect on expectation but with high variance. Further experiments by Mike and me indicate that this can be useful as additional diagnostic when focusing on some parameter distributions (as then you use more effort to find where bulk ends and tail similar to GPD starts), but it’s unlikely to be useful to compute it automatically for every parameter in the model. This is a bit different in PSIS as the behavior is more clear near khat 0.7, and the decision task and possible actions are different.

What @avehtari said. The exact thresholds, and corresponding trade offs, for identifying random variables with ill-defined means and/or variances are still being researched.

Thanks guys. So the lower cutpoint in the case study biases the k_hat’s down then?

To what extent would you recommend using the case study method as an overall diagnostic? I’m doing some smoothing spline stuff right now and the n_eff for many parameters appears quite sensitive to priors and/or parameterizations, but I want to be sure I can trust the comparisons I make between formulations.

If the random variables you’re considering (i.e. the variables defined in the parameters, transformed parameters or generated quantities blocks) have distributions where khat is close to one then I would be suspicious regardless. A small khat indicates large variances so even if the variance is technically finite the MCMC estimators will be extremely noisy even with reasonable effective sample size (remember that the MCMC standard error is sqrt{ variance / ess }). At the very least you’d want to consider rescaling your model or identifying stronger priors that regularize the posterior diffusiveness.

Yes, sorry. Large \hat{k} <–> small degrees of freedom for a Student-t density. I put the degrees of freedom on the x-axis of my plots that attempt to calibrate the threshold, which is why I keep thinking “small is bad”!