K_hat diagnostic in "Fitting the Cauchy distribution" case study

ShaunMcD · February 7, 2019, 1:45am

I was looking through the excellent “Fitting the Cauchy distribution” case study by @betanalpha (https://betanalpha.github.io/assets/case_studies/fitting_the_cauchy.html), and I had a conceptual question I was hoping somebody could clear up for me.

In the case study, he uses “tail k_hat” diagnostics to verify that the effective sample sizes for the Cauchy parameters aren’t well-defined. The only time I’ve ever seen k_hat mentioned in this context is in Pareto-smoothed Importance Sampling, as in this paper: https://arxiv.org/pdf/1507.02646.pdf

As far as I can tell there’s no importance sampling of any kind in the case study, but I assume the principle behind the use of k_hat is roughly the same: we use Generalized Pareto distributions to approximate (the tails of) the posterior, and the values of the shape parameters provide estimates on the number of posterior moments that exist. Is that correct?

If so, wouldn’t we expect the k_hat’s to be above 1 for the Cauchy distribution? Presumably (by analogy with the discussion in the PSIS paper), k_hats between 0.5 and 1 imply infinite variances but finite means. Similarly, I tried fitting a student_t(2, 0, 1) and got no k_hat warnings - even though the second moment of this distribution is certainly infinite. Am I missing something in the interpretation here?

avehtari · February 8, 2019, 12:17pm

Yes, this use case of estimating the number of existing moments was inspired by that paper.

Yes.

Mike is using quite large proportion of the draws and thus including considerable part of the bulk, and thus the distribution is not GDP and the estimate for the the number of existing moments is biased. Theory says that if the cutpoint far enough in the tail, then the tail can be approximated with Generalized Pareto distribution. The problem is that if we make the threshold be far in the tail, we get less draws to use to fit GPD and the variance of the estimate increases. Mike decided to favor lower variance and higher bias. If you move the threshold further you will get the results you would expect on expectation but with high variance. Further experiments by Mike and me indicate that this can be useful as additional diagnostic when focusing on some parameter distributions (as then you use more effort to find where bulk ends and tail similar to GPD starts), but it’s unlikely to be useful to compute it automatically for every parameter in the model. This is a bit different in PSIS as the behavior is more clear near khat 0.7, and the decision task and possible actions are different.

betanalpha · February 8, 2019, 5:32pm

What @avehtari said. The exact thresholds, and corresponding trade offs, for identifying random variables with ill-defined means and/or variances are still being researched.

ShaunMcD · February 8, 2019, 6:44pm

Thanks guys. So the lower cutpoint in the case study biases the k_hat’s down then?

To what extent would you recommend using the case study method as an overall diagnostic? I’m doing some smoothing spline stuff right now and the n_eff for many parameters appears quite sensitive to priors and/or parameterizations, but I want to be sure I can trust the comparisons I make between formulations.

betanalpha · February 8, 2019, 8:46pm

If the random variables you’re considering (i.e. the variables defined in the parameters, transformed parameters or generated quantities blocks) have distributions where khat is close to one then I would be suspicious regardless. A small khat indicates large variances so even if the variance is technically finite the MCMC estimators will be extremely noisy even with reasonable effective sample size (remember that the MCMC standard error is sqrt{ variance / ess }). At the very least you’d want to consider rescaling your model or identifying stronger priors that regularize the posterior diffusiveness.

ShaunMcD · February 8, 2019, 9:13pm

Thanks Michael. To clarify:

A small khat indicates large variances

I thought large k_hat values were indicative of problems?

betanalpha · February 8, 2019, 9:37pm

Yes, sorry. Large \hat{k} <–> small degrees of freedom for a Student-t density. I put the degrees of freedom on the x-axis of my plots that attempt to calibrate the threshold, which is why I keep thinking “small is bad”!

ShaunMcD · February 8, 2019, 10:44pm

Right, thanks for the clarification!

Topic		Replies	Views
New R-hat and ESS Developers	35	7308	July 1, 2019
Stochasticity in pareto-k diagnostics over different fitting runs Modeling loo	4	563	October 23, 2020
{loo} truncated importance sampling? General loo	15	812	April 20, 2023
Split-Rhat diagnostic and relative effective sample size Algorithms	8	1874	July 16, 2019
Minimising k-hat General	1	383	July 17, 2019

K_hat diagnostic in "Fitting the Cauchy distribution" case study

Related topics