In a manuscript draft, I report the number of iterations, number of chains, and size of the burn in as part of my data analysis method description.
In the results, I report the maximum (i.e., worst) potential scale reduction factor \hat{R} across all of my parameters. I was also planning to report the minimum (i.e., worst) effective sample size. I’ve noticed that the log-posterior tends to be the “parameter” with the smallest n_eff by quite a bit (~1000 in my case). Do other Stan users report minimum effective sample sizes like this? If so, do you include the log-posterior in the parameter set for taking the minimum?
Good questions. I think it would be good if the community can point to a paper where they present a good way of writing these things down in a paper. I usually write that all \hat{R} values were approximately 1.00, neff was > 0.2 and, that chains seemed to have mixed well. But what do I know :)
here’s what I said - still waiting to get reviewer’s feedback:
The RStan function check_hmc_diagnostics found no problems
encountered by the sampler and the RHat values for all parameters
were extremely close to 1.0, indicating that the model had
successfully converged.
the n_eff is going to be small for hierarchical parameters - these have constrained geometries which make it difficult for the sampler to explore.
I was planning to report the raw N_eff (“minimum effective sample size across all parameters was 1,500”), not N_eff/N. Basically, I was thinking of using it to convey an approximate number of independent(ish) parameter draws.
I think the original question is still unanswered: what exactly does n_eff and r_hat mean for the log posterior (lp_)? Are these diagnostics calculated for lp_ just because it’s lumped in with all the other parameters, and are therefore meaningless? I understand why the lp_ is saved like a parameter but does it’s autocorrelation have any impact on sampling?
lp__ is a random variable and therefore n_eff and r_hat mean the same thing for lp__ as for all other parameters in the model - from the Stan Reference Manual - Stan Reference Manual
A question that often arises is whether it is acceptable to monitor convergence of only a subset of the parameters or generated quantities. The short answer is “no,” but this is elaborated further in this section.
For example, consider the value lp__ , which is the log posterior density (up to a constant).27
It is thus a mistake to declare convergence in any practical sense if lp__ has not converged, because different chains are really in different parts of the space.
Sounds like this is the answer, then. We should treat lp__ is a parameter to monitor and I should include in when I take the minimum effective sample sizes.
This discussion has made me think more about why these are the numbers I choose to report. I use \hat{R} and N_eff/N to convince myself convergence has been achieved, but I think about N_eff as a way to justify that the Monte Carlo error of my estimates is low. I believe Bayesians in the old days used to thin their very long chains because the autocorrelation was so bad that 100,000 posterior samples of a parameter might only contain the information of 1,000 or 5,000 independent posterior draws (and also data storage was more expensive). With Stan, I hardly ever thin and prefer to just report my effective sample sizes as a way to argue that I have obtained enough [post-convergence] parameter draws that the uncertainty is mostly due to real uncertainty in the posterior and not in my numerical approximation of it. Hopefully someone will chime in if they think this is a misguided way of looking at it.
Thinning is necessary for samplers which use MH or Gibbs or similar random-walk based sampling algorithms. Stan’s HMC-NUTS sampler algorithm moves through the parameter space very effectively. In some cases, it will produce anti-correlated draws -
therefore the current recommendation is to not thin - ask for the number of samples you need.
if you’re lucky, you’ll might get even more than that ;-)
There are two issues with reporting MCMC fits – verifying geometric ergodicity (i.e. that MCMC estimators for expectation values will be reasonably well behaved after only a finite number of iterations) and then quantifying performance (i.e. how accurate those estimators are). The former considers the behavior of the entire fit whereas the latter considers only the behavior for a specific expectation value.
Rhat near 1 for every function whose expectation value is well-defined is a necessary condition for geometric ergodicity, but a relatively weak one especially if you are running only a few chains. When communicating Rhat results then one should quote any function considered for maximal power. When using Hamiltonian Monte Carlo divergences and the energy fraction of missing information (E-FMI) are much more sensitive diagnostics for geometric ergodicity and consider the joint behavior directly and I strongly recommend that those are considered first and foremost.
Once geometric ergodicity has been verified (or lack of geometry ergodicity optimistically rejected…) then one can consider the performance of individual MCMC estimators which is quantified by the effective sample size. Some functions are more sensitive to the autocorrelations in the Markov chain and some are less sensitive so the effective sample size that manifests in their expectation estimates can strongly vary. Consequently conditioned on geometric ergodicity one has to consider and communicate only the effective sample sizes of the expectation values being used in the analysis. Even better, one can just report the MCMC estimator and its standard error estimator when discussing each relevant expectation value.
That said, keep in mind that we estimate the effective sample size and that estimator need not be well-behaved. The N_eff / N check provides some heuristic verification that the autocorrelations in the Markov chain are small enough that we can accurately estimate the effective sample size for each expectation estimator with a reasonable number of iterations (i.e. the defaults in Stan). In other words, one wants to first check for obstructions to geometric ergodicity, then problems in estimating the effective sample sizes, and then finally reporting the relevant effective sample sizes (or the MCMC estimator standard errors).
I typically report Rhats or N_eff / N results with a verification that they’re all within the desired regimes, but if one wants to report more then I highly recommend showing histograms of all of the individual values instead of summarizing with a minimum or maximum.
I feel fairly confident about convergence (or rather, I do not move to writing up results until I do). My \hat{R} values are < 1.01, not 1.1, and the lowest N_eff/N I am worried about is still over 0.3. I always change adapt_delta or the model itself (e.g., making it less flexible) until I get zero divergences and zero E-FMI warnings.
So if I’m understanding @betanalpha, he is suggesting that the effective sample size (ESS) that really matters is not any individual parameter’s ESS but whatever the ESS is for the “parameter of interest” (assuming you’ve already deluded yourself into believing convergence and your results are not so garbage that the ESS estimates themselves are bad). Unfortunately, my “parameter of interest” is calculated in post-processing steps in R. Is there easy way to get an ESS for something like that?
In principle, any well-defined expectation with respect to the posterior can be estimated and hence have an associated ESS. It really depends on what your quantity of interest is. If you can provide some more detail I have a suspicion @betanalpha and others (myself included) would be able to give further advice.
If that post-processing is independent for each draw from the joint posterior of parameters, e.g, if you have parameters a and b and want to compute c=a/b, then you can do these for draw separately and you get draws from the posterior of c. If you maintain the chain structure for c then you can use monitor() function to compute Rhat and n_eff for c in R.
If I tell you what my quantities of interest are, you will rightly conclude that I just made them up. But it counts as “methods development” and “motivation of new estimands” because I am writing a paper to justify them.
More seriously, the quantities are both functions of every parameter in my model + every observation’s observed covariates and outcomes because they are finite sample causal effects. I have some old but still-mostly-relevant slides about it here (if you click on “presentation” it automatically starts the download): https://ww2.amstat.org/meetings/jsm/2018/onlineprogram/AbstractDetails.cfm?abstractid=329352.
I did not think to keep the chain structure in my calculations, @avehtari, but I could probably back it out from the vector of posterior draws I end up with. I’ll look into what monitor() expects from its inputs.
Alternatively you could move all of the post-processing calculations to the generated quantities block and the MCMC estimation will automatically be applied to them. It will also be easier to share the intent of your model with anyone as it would all be contained within the one Stan program.
Otherwise it’s always a good idea to write your post-processing code to accept vector inputs, in which case you can plug in the samples retrieved by extract(fit).