Reasonable success rate in MCMC sampling

Hello, While debugging a model that seems to have some ergodicity issues, we realized that our model has about a 10% sample success rate. We require 1000 samples to be accepted, with 300 warmup samples, and we can see via debug information that there are about 10,000 samples offered in the MCMC sampling.

Is this success rate reasonable? If not, what success rate would be expected for a ‘good’ model?

Thanks!

Hi, @s-hoyt and welcome to the Stan forums.

I’m not sure which Stan interface you’re using, so I can’t be specific, but you want to evaluate the R-hat convergence diagnostic statistics—those should all be very close to 1 (1.01 is OK, 1.05 is dodgy). Then the thing you care about is the effective sample size—that measures the strength of your sample in terms of an equivalent number of independent draws. That is mainly of interest as it controls the standard error on expectation estimates, such as parameter estimates. Stan reports the standard errors, which is just posterior standard deviation over the square root of the effective sample size.

If you share your model, we can help diagnose what might be going wrong.

Thanks, @Bob_Carpenter!

I don’t think I’ll be able to post the full model, as it’s something we’re developing for research purposes. I’ve shared some diagnostic information below though, and a description of what we are trying to do with the model. My main question is: we definitely have multimodality based on the diagnostics and the plots. Does this mean we necessarily also have non-identifiability? (terms/definitions mostly based on my reading of this: Identity Crisis) If so, are there steps we can take to initialize the model in another way or reparameterize it, or will we always have this multimodality? I realize these questions might be hard/impossible to answer without seeing the code, so any feedback you can give would be appreciated, no matter how limited!

Our model is for estimating the amount of imbalance between RNA-seq (gene expression) read counts on two alleles, for related individuals. We use the parameter theta to represent the fold change between the alleles. In an original version of our model, we limited theta to be less than 1 (or less than 0 in log2 space), for which the biological interpretation is that an ‘affected’ allele has a decrease in expression. Our model is also interested in estimating how affected alleles are inherited between these individuals so it matters which allele is the affected one. The original model performs well on simulated data where theta is less than 1. We have recently been working with an updated version of the model that allows theta to be greater than 1, for which the biological interpretation is that an affected allele could have an increase in expression. When we apply this model to the same simulated data we see that it does not do as well as the previous model. The issue is that the model sometimes returns 1/theta instead of theta. We are confused about why this is happening; we don’t understand why the sampling isn’t exploring both the space around theta and the space around 1/theta. While these are both local maxima, the likelihood around theta is definitely the single global maximum.

I am working with cmdStan, so here is the output of the diagnose command on 4 chains. logtheta is the only ‘real’ parameter in our model; p = theta/(1+theta) and the numerators and denominator are all from the generated quantities block. For this model, theta is in log2 space, limited to be between -10 and 10, and is initialized to 0. I see the E-BFMI suggestion to parameterize the model but I don’t know what specifically that could entail.

Processing csv files: tmp_stanout_refactored_constant_logtheta_init0_run1.stanoutputs, tmp_stanout_refactored_constant_logtheta_init0_run2.stanoutputs, tmp_stanout_refactored_constant_logtheta_init0_run3.stanoutputs, tmp_stanout_refactored_constant_logtheta_init0_run4.stanoutputs

Checking sampler transitions treedepth.
Treedepth satisfactory for all transitions.

Checking sampler transitions for divergences.
No divergent transitions found.

Checking E-BFMI - sampler transitions HMC potential energy.
The E-BFMI, 0.0245856, is below the nominal threshold of 0.3 which suggests that HMC may have trouble exploring the target distribution.
If possible, try to reparameterize the model.

The following parameters had fewer than 0.001 effective draws per transition:
  logtheta, p, numerator[2], numerator[3], numerator[4], numerator[5], numerator[7], numerator[8], numerator[9], numerator[10], denominator
Such low values indicate that the effective sample size estimators may be biased high and actual performance may be substantially lower than quoted.

The following parameters had split R-hat greater than 1.05:
  logtheta, p, numerator[2], numerator[3], numerator[4], numerator[5], numerator[7], numerator[8], numerator[9], numerator[10], denominator
Such high values indicate incomplete mixing and biased estimation.
You should consider regularizating your model with additional prior information or a more effective parameterization.

Processing complete.

Here is a plot of all of the MCMC samples (pulled from stderr so this includes accepted, rejected, and warmup). I plotted logtheta on the x-axis and the likelihood on the y-axis, where likelihood is the value added to ‘target’ in the model section of the stan file. Each color is from a different chain (4 total; 2 settle at logtheta=-2.5 [true value] and the other 2 settle at logtheta = 2.5 [1/true value]). Maybe kind of hard to see in this plot (I can upload chain-specific plots if that would be helpful) but each chain does have at least a couple of points on each peak. This is where a lot of our confusion is coming from: why aren’t those samples accepted, or at least why isn’t that space explored more in all chains? I will also point out that the peak around logtheta=-2.5 is definitely higher (more likely) than the one around 2.5, so if those values were ever compared directly, our true value should be the one selected. Is this just not something Stan/the sampler is capable of doing, due to the discrete multi-modality, or is there a way to re-parameterize the model to have both peaks be explored? I imagine there is an issue of the step size here as well, but I am a little confused about that, since we do see that the sampler visits both peaks at some point.

I would appreciate any feedback you can give, but I understand if there isn’t enough information here to make any conclusions. If there are any resources you can direct me to that might be able to give me more direction that would also be really helpful. Thank you!

What do you mean by this? Can you show what you’re calling and what the output looks like? I’m confused because Stan doesn’t do rejection sampling. Are you getting a ton of warnings from Stan about failed bounds checks or reject statements?

It looks like what’s happening here is classic multimodality with chains getting stuck in local modes. What makes you think that the chains “should” be able to jump between these modes? It looks like they are 350+ units of log-likelihood deep!

This is not an example of non-identifiability, just an example of multimodality.

Hi @jsocolar, thanks for your response! Maybe ‘reject’ is not a precise term for me to use. I added print statements to my Stan code so that I could see all of the proposals for logtheta. Depending on the version of the model, there are about 5000-10,000 values of logtheta or theta printed in the stderr, but only 1000 in stanoutputs file (1000 is the specified number of MCMC samples from when I ran the model). I was referring to all of the proposed values in the stderr that are not in the stanoutputs file as ‘rejected’ samples. Maybe I should have instead said these are proposals not accepted during the Metropolis accept step. I don’t believe we are getting warnings about bounds checks or reject statements.

Regarding jumping between the modes, my expectation is less that it should easily jump between the modes throughout the sampling process, and more that we should would like to be seeing more samples in both modes before it settles. Is it possible to parameterize the model in such a way that the sampling considers more disparate proposals and their likelihoods before it settles into one mode? We are not getting a peak around logtheta=2.5 because that likelihood is better than at -2.5, it is only because by random chance the sampler considered a value closer to the center of that peak than close to the center of the -2.5 peak.

After constructing an approximate Hamiltonian trajectory via discrete steps and triggering the no-u-turn criterion, Stan samples a single value from along this trajectory via multinomial sampling. This will necessarily select a single value from among quite a few potential values, and the ratio of “accepted” values to total leapfrog steps depends only on the realized treedepth in the model. There’s nothing pathological or unexpected here.

In general, I would expect the chains to fall into one mode or the other, and not necessarily to jump around between modes before settling.

With a priori knowledge of where the modes are, it is possible to parameterize the model in such a way that it can mix between the modes. If you know that there will always be one positive and one negative mode, you can marginalize over a binary indicator for whether theta is positive or negative. This approach will be particularly effective if you know that in general the positive and negative modes will be symmetric about zero.

It seems very probable that the secondary mode around 1/theta is simply the same mode as the original, but where the model swaps its opinion of which allele is the affected one and which allele is the unaffected one. If I’m right, the fact that these modes are of different heights is due to the prior.

A question for you: given two alleles whose expression differs by a factor of theta, how do you know which one is affected and which one is unaffected, and therefore whether the correct value of theta is greater or less than one?

Good to know that nothing unexpected is happening!

That’s a good idea to include a binary indicator, we will definitely try that.

Yes, you are correct that the only difference between those modes is which allele is the affected copy. We are working with data from familial trios (mother, father, child). We order the alleles in the parent based on which is passed on to the child. If the mother has an affected allele with decreased expression but does not pass it down, and the father does not have an imbalance between the alleles and the child does not have an imbalance then logtheta < 0 makes sense as a mode of inheritance and should be our true case. For the affected allele in the mother to swap to the other copy (now with increased expression and logtheta > 0) such that now she should be passing the affected allele down to the child (and information in the other 2 individuals stay the same) there would have to be a recombination event such that the child no longer actually inherits the imbalance/affected copy in order for the read counts in the child to make sense. We consider recombinations to be very rare. The probability of the recombination is the only difference between the two modes and that is why the peak around 2.5 is lower.