Thanks, @Bob_Carpenter!
I don’t think I’ll be able to post the full model, as it’s something we’re developing for research purposes. I’ve shared some diagnostic information below though, and a description of what we are trying to do with the model. My main question is: we definitely have multimodality based on the diagnostics and the plots. Does this mean we necessarily also have non-identifiability? (terms/definitions mostly based on my reading of this: Identity Crisis) If so, are there steps we can take to initialize the model in another way or reparameterize it, or will we always have this multimodality? I realize these questions might be hard/impossible to answer without seeing the code, so any feedback you can give would be appreciated, no matter how limited!
Our model is for estimating the amount of imbalance between RNA-seq (gene expression) read counts on two alleles, for related individuals. We use the parameter theta to represent the fold change between the alleles. In an original version of our model, we limited theta to be less than 1 (or less than 0 in log2 space), for which the biological interpretation is that an ‘affected’ allele has a decrease in expression. Our model is also interested in estimating how affected alleles are inherited between these individuals so it matters which allele is the affected one. The original model performs well on simulated data where theta is less than 1. We have recently been working with an updated version of the model that allows theta to be greater than 1, for which the biological interpretation is that an affected allele could have an increase in expression. When we apply this model to the same simulated data we see that it does not do as well as the previous model. The issue is that the model sometimes returns 1/theta instead of theta. We are confused about why this is happening; we don’t understand why the sampling isn’t exploring both the space around theta and the space around 1/theta. While these are both local maxima, the likelihood around theta is definitely the single global maximum.
I am working with cmdStan, so here is the output of the diagnose command on 4 chains. logtheta is the only ‘real’ parameter in our model; p = theta/(1+theta) and the numerators and denominator are all from the generated quantities block. For this model, theta is in log2 space, limited to be between -10 and 10, and is initialized to 0. I see the E-BFMI suggestion to parameterize the model but I don’t know what specifically that could entail.
Processing csv files: tmp_stanout_refactored_constant_logtheta_init0_run1.stanoutputs, tmp_stanout_refactored_constant_logtheta_init0_run2.stanoutputs, tmp_stanout_refactored_constant_logtheta_init0_run3.stanoutputs, tmp_stanout_refactored_constant_logtheta_init0_run4.stanoutputs
Checking sampler transitions treedepth.
Treedepth satisfactory for all transitions.
Checking sampler transitions for divergences.
No divergent transitions found.
Checking E-BFMI - sampler transitions HMC potential energy.
The E-BFMI, 0.0245856, is below the nominal threshold of 0.3 which suggests that HMC may have trouble exploring the target distribution.
If possible, try to reparameterize the model.
The following parameters had fewer than 0.001 effective draws per transition:
logtheta, p, numerator[2], numerator[3], numerator[4], numerator[5], numerator[7], numerator[8], numerator[9], numerator[10], denominator
Such low values indicate that the effective sample size estimators may be biased high and actual performance may be substantially lower than quoted.
The following parameters had split R-hat greater than 1.05:
logtheta, p, numerator[2], numerator[3], numerator[4], numerator[5], numerator[7], numerator[8], numerator[9], numerator[10], denominator
Such high values indicate incomplete mixing and biased estimation.
You should consider regularizating your model with additional prior information or a more effective parameterization.
Processing complete.
Here is a plot of all of the MCMC samples (pulled from stderr so this includes accepted, rejected, and warmup). I plotted logtheta on the x-axis and the likelihood on the y-axis, where likelihood is the value added to ‘target’ in the model section of the stan file. Each color is from a different chain (4 total; 2 settle at logtheta=-2.5 [true value] and the other 2 settle at logtheta = 2.5 [1/true value]). Maybe kind of hard to see in this plot (I can upload chain-specific plots if that would be helpful) but each chain does have at least a couple of points on each peak. This is where a lot of our confusion is coming from: why aren’t those samples accepted, or at least why isn’t that space explored more in all chains? I will also point out that the peak around logtheta=-2.5 is definitely higher (more likely) than the one around 2.5, so if those values were ever compared directly, our true value should be the one selected. Is this just not something Stan/the sampler is capable of doing, due to the discrete multi-modality, or is there a way to re-parameterize the model to have both peaks be explored? I imagine there is an issue of the step size here as well, but I am a little confused about that, since we do see that the sampler visits both peaks at some point.
I would appreciate any feedback you can give, but I understand if there isn’t enough information here to make any conclusions. If there are any resources you can direct me to that might be able to give me more direction that would also be really helpful. Thank you!