Currently I’ve fitted the data with a model through 4 chains and 1000 iterations: no warnings, all Rhat of 1.0. I guess the resulting 2000 posterior samples are good enough to obtain the central 95% posterior interval.

Suppose that I would like to obtain the central 99% posterior interval. How many samples would be considered reasonable? Is it valid to obtain more samples by simply increasing the numbers of chains:
for example, 20 chains with 1000 iterations for 10,000 samples?

As long as the multi-chain R-hat is good (and you don’t see other convergence issues) you can combine chains. In many ways it’s better because it’s more likely to show you issues with the posterior. That said, out of curiosity, why 99% posterior intervals?

The customer complained that the current results from the central 95% interval are too overwhelming, and would like to prune the results to some extent. I don’t know how to address the issue, and would like to hear any suggestions. It seems that the 95% interval is the custom, but does such a cutoff share the same complaint the conventional statistics is criticized about regarding the p-value of 0.05: binarized decision and arbitrariness?

Depending on context I would resort to ranking, especially if this is something where statistically identified leads are further confirmed. For example in GWAS, any particular lead might be followed up by a manipulative study to confirm the association. In drug development you might need a set of candidates to put through further screening. What these have in common is that you need to identify the most promising set of candidates to follow up with a limited set of resources. So think top 10 rather than a discretized yes/no.

What these have in common is that you need to identify the most promising set of candidates to follow up with a limited set of resources. So think top 10 rather than a discretized yes/no.

Thanks for the suggestion! I’ll discuss this possibility with the customer.

Just be careful – you’ll need about 10,000 effective samples to pin down the 1% and 99% quantiles well enough, not just 10,000 samples.

A very good point, Mike! I guess I’m fine so far with the current result: the number of effective samples was 2000 out of 2000 draws for the effect of interest. However, the number of effective samples was pretty low (250-400) for an effect I’m not interested: Does this indicate anything inappropriate for the model or parameterization overall?

I don’t think there’s an established custom as there’s no established “significance level”. If you choose 95% intervals, you are calibrated if 95% of the true values fall in those 95% intervals; same for 10%, 50%, or 99% intervals.