Difficulties assessing convergence of Markov chains

Consider a simple normal-normal hierarchical model for meta-analysis (i.e. a “random-effects meta-analysis”). This is basically the model for the “8 schools” example, but I will use it in its marginal form (i.e. with the true study-specific effects integrated out) and with other (proper) priors:

data {
  int<lower=1> K;  // Number of studies
  vector[K] hat_theta_k;  // Study-specific effect estimates
  vector<lower=0>[K] hat_sesq_k;  // Study-specific squared standard error estimates (treated as known, i.e. constants)
}
parameters {
  real theta;  // Overall treatment effect
  real<lower=0> tau;  // Between-study standard deviation
}
model {
  // Prior:
  theta ~ cauchy(0, 5);
  tau ~ cauchy(0, 0.5);
  // Likelihood:
  hat_theta_k ~ normal(theta, sqrt(square(tau) + hat_sesq_k));
}

Now I first sampled from the posterior of this model in RStan without thinning, i.e. using the defaults:

MA_fit <- stan(
  file = path_to_stancode_above,
  data = MA_dat,
  seed = 46996L
)

with MA_dat as follows:

$K
[1] 2

$hat_theta_k
[1] 0.5099547 0.7532247

$hat_sesq_k
[1] 0.01134995 0.10782436

I then tried to assess if the posterior draws may be used for inference. I don’t get divergent transitions or hits of the maximum treedepth. \hat{R} and the new \hat{R} (implemented in rstan::Rhat()) look fine, too (maximum \hat{R}: 1.007, maximum new \hat{R}: 1.006). However, the trace plots look suspicious (for theta, tau, as well as lp__):




For me, those peaks indicate that the parameters occasionally drift out to the tails where the draws get highly correlated. This is also reflected by the autocorrelation which is quite high (maximum absolute value: 0.73). Of course, the heavy-tailed Cauchy priors require that the tails are also explored and it seems natural that in doing so, autocorrelation is high. Nevertheless, for a finite sample from the posterior that we are unfortunately forced to use, the peaks in those trace plots made me doubt that convergence to the true posterior is guaranteed. [Note: The effective sample size (ESS) ratio is always above 0.1, though (minimum ESS ratio: 0.13). The same holds for the bulk-ESS ratio (minimum: 0.18) and the tail-ESS ratio (minimum: 0.12). So perhaps – if my deductions so far are correct – the common threshold of 0.1 should be discussed. The same holds for the Monte Carlo standard error (MCSE) with a maximum value of 0.04 here (the common threshold being 0.1).]


Because of the diagnostics from above, I then thinned the Markov chains by steps of 4:

MA_fit <- stan(
  file = path_to_stancode_above,
  data = MA_dat,
  seed = 46996L,
  iter = 5000, # Increased so that the total number of posterior draws remains the same.
  warmup = 1000,
  thin = 4
)

Now the diagnostic metrics improve (no divergent transitions, no hits of maximum treedepth, maximum \hat{R}: 1.0016, maximum new \hat{R}: 1.0007, maximum absolute autocorrelation: 0.47, minimum ESS ratio: 0.46, minimum ESS-bulk ratio: 0.58, minimum ESS-tail ratio: 0.52, MCSEs at about 0.02) and the trace plots, too:




Nevertheless, I am still not sure whether the remaining peaks indicate that the posterior draws should not be used for inference, keeping in mind that this is a finite sample from the posterior. Could maybe someone with more experience tell me if she or he would regard those posterior draws as valid for inference? If not, I guess I will have to switch to less heavy-tailed priors like a Student-t with 7 degrees of freedom?


Session info:

R version 3.6.2 (2019-12-12)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363)

Matrix products: default

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] rstan_2.19.2       ggplot2_3.2.1      StanHeaders_2.19.0

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.3         rstudioapi_0.10    magrittr_1.5       tidyselect_0.2.5   munsell_0.5.0      colorspace_1.4-1  
 [7] R6_2.4.1           rlang_0.4.2        fansi_0.4.0        dplyr_0.8.3        tools_3.6.2        parallel_3.6.2    
[13] pkgbuild_1.0.6     grid_3.6.2         gtable_0.3.0       loo_2.2.0          cli_2.0.0          withr_2.1.2       
[19] matrixStats_0.55.0 lazyeval_0.2.2     assertthat_0.2.1   tibble_2.1.3       lifecycle_0.1.0    crayon_1.3.4      
[25] processx_3.4.1     gridExtra_2.3      purrr_0.3.3        callr_3.4.0        ps_1.3.0           inline_0.3.15     
[31] glue_1.3.1         compiler_3.6.2     pillar_1.4.3       prettyunits_1.0.2  scales_1.1.0       stats4_3.6.2      
[37] pkgconfig_2.0.3

I guess the question is what are the Neffs?

The question of how much do we trust our MCMC comes in terms of those, cause we’re assuming our standard errors of parameter A are something like:

\frac{\sigma_A}{\sqrt{N_{eff}}}

where \sigma_A is the posterior standard deviation of the parameter.

So definitely there’ll be wiggle in the answer and the question is how much. If you think the N_{eff} estimates Stan gives you are bad, presumably that means repeating the MCMC calculation gives estimates that are wildly off than what you’d expect if you trusted those standard error estimates.

If you’re concerned they’re misleading, repeat the calculation a few times and have a look.

As far as thinning goes, don’t do it. I think the reasons to thin are memory constraints and then there’s a reason to do it with SBC ([1804.06788] Validating Bayesian Inference Algorithms with Simulation-Based Calibration), but if you’re just estimating posterior means or whatever don’t worry about it. Just look at the $N_{eff}$s.

Trace plots for Cauchy are not very useful because of the thick tails. See examples in https://arxiv.org/abs/1903.08008 and online appendix https://avehtari.github.io/rhat_ess/rhat_ess.html. Rank plot would be better visualization.

Old Rhat fails for Cauchy as discussed and demonstrated in https://arxiv.org/abs/1903.08008
New Rhat, ESS-bulk and ESS-tail are reliable for Cauchy.

If ESS-bulk and ESS-tail keep increasing proportional to increasing number of MCMC iterations then it’s more likely that all is good. See figures 5, 12, 32, 36 in https://arxiv.org/abs/1903.08008 for examples

Even without the computational aspects, it might sensible to use more thin tailed prior.

Distributions specified by heavy-tailed densities are counterintuitive for most people – to build up some intuition for what you should expect I recommend playing around with ensembles of exact samples from Cauchy densities. You should see that your “peaks” are simply a consequence of having very heavy tails.

That said, heavy tails are poor modeling practice for priors. They are poor quantifications of domain expertise because they constrain extreme values so little (allowing for all of those excursions). Moreover, as @avehtari noted if the posterior inherent this heavy-tailedness then most of the parameter functions will fail to be square integrable and the MCMC estimators of those parameter functions (as well as their standard errors, effective sample sizes, and Rhats) will be ill-defined; you can compute estimators of all of those values, but they won’t corresponding to anything meaningful. Finally the default MCMC sampler in Stan uses variance estimates to tune the sampler and if the variance doesn’t exist because your posterior density is too heavy-tailed then you’ll get noisy adaptation.

Thanks a lot for all your answers. They really helped me understanding MCMC and the diagnostics in Stan better, as well as the specialties for a heavy-tailed prior such as the Cauchy. I would have loved to mark all your answers as solutions, but I realized it’s not possible. So I simply picked the first one. I hope you’re not offended.

1 Like