I’ve got a model which uses a non centered paramterization. So for a parameter
alphaT there is also a parameter
alphaT_raw and in transformed parameters I have
vector[J_State] alphaT = muAlphaT + (sigmaAlphaT * alphaT_raw);
In some model runs I get n_eff/N warnings (so n_eff < 10% * N) for
muAlphaT and, for various indexes n,
alphaT_raw[n] but not
I get no “mcse/sd” or “Rhat” warnings for those (or any) parameters. Nor any divergences.
I’m not sure what to make of that! Only
alphaT is used for post-stratification, etc. So can I ignore these warnings since I am getting sufficient n_eff for the parameters that are meaningful in terms of the outputs of the model?
Is this a thing that happens often and has a standard explanation? Or is there a good way to think about this that might help me?
And no EBFMI warnings either presumably?
Have you tried a centered version too? (That’s a shot in the dark, as usually if the data implies a strongly-informed likelihood, you’d start getting outright divergences with a non-centered)
Right. EBFMI is fine.
I have tried centered as well, although with a simpler version of the model. I don’t think I had this issue, though I’d have to track back some to see why I switched to non-centered. I think I was getting divergences with the centered version.
I’ve been building up from simpler pieces and it’s just this last jump that has this issue (though I’ve seen it before, off and on). Complicating this is that this model takes 12 hours or so to run! But the one-step-simpler version (which takes 2 hours) doesn’t show this issue, so I can’t easily try a lot of things.
If you do venture more exploration of centered/non-centered while we wait for anyone else to chime in, maybe consider that while most guides/posts on centered-vs-non only show joint centering/non-centering of both the mean & sd in concert, they can actually be separated:
mu ~ ... ;
sigma ~ ... ;
//theta ~ normal(mu,sigma) ;
//theta_raw ~ std_normal() ;
//theta = mu + sigma*theta_raw ;
// mean-centered, sd-non:
//theta_raw ~ normal(mu,1) ;
//theta = sigma*theta_raw ;
// mean-non, sd-centered:
//theta_raw ~ normal(0,sigma) ;
//theta = mu+theta_raw ;
So you might consider trying all four versions.
That’s an interesting idea!
I roughly (very roughly!) understand the idea that if the contexts are weakly informed, then non-centered will work better. This model has been tricky to build up from simpler bits because some of the pieces, once added, make the contexts more weakly informed. So what was once a context difference becomes explained instead by a predictor, once it’s added to the model. So as I add the predictors in, I’ve had to shift from centered to non-centered.
Anyway, is there some equivalent, though obviously more subtle, way of understanding in what situations you might want the sigma or mu to be non-centered rather than both? Like something specific about how the contexts are weakly informed? Like, is it about whether the amount of pooling is weakly informed (sigma) or the actual value of the parameter (mu)?
Thanks for the help!
You know what, I actually misremembered (and possibly originally misunderstood) my source for the idea of doing the mean & sd separately. Looking back at it now (one of @betanalpha’s amazing deep dives, this one on hierarchical modelling) I realize that his discussion there is not about the possibility of centering/non-centering the link between theta & its mean/sd separately as my code above shows, but instead (and more sensibly now that I think about it) that tradition is to treat all of the elements in theta the same, whereas it’s perfectly possible that some have more strongly informed data than others and thereby some should be centered and some should be non-centered, as in:
int num_theta_elements_to_be_centered ;
int theta_elements_to_be_centered ;
int num_theta_elements_to_be_noncentered ;
int theta_elements_to_be_noncentered ;
vector[num_theta_elements_to_be_noncentered] theta_noncentered ;
theta_centered ~ normal(mu,sigma) ;
theta_noncentered ~ std_normal() ;
vector[num_theta] theta ;
theta[theta_elements_to_be_centered] = theta_centered ;
theta[theta_elements_to_be_noncentered] = mu + sigma*theta_noncentered ;
Of course, given that
num_theta is usually >4, we’ve now substantially expanded how much trial-and-error is necessary to figure out what will sample well. I still find @mgorinova et al’s automatic pre-warmup tuning a super cool solution and find it odd that it hasn’t really been taken up to any extent in Stan core.
Anyway, sorry for the tangent on all that. Hopefully my replies don’t leave the thread seemingly-answered for others glancing through new posts, as I don’t think we’ve really come up with any good understanding for the pathology your OP notes.
Ah, right. That would be difficult in this case.
That paper is super cool! Are either of the transformation algorithms they are suggesting relatively straightforward to implement in Stan core?
Also, do you have any insight into how pathological my issue is, given that it only shows up in “latent” (is that the right word here?) variables? That is, pathological in terms of relying on the posterior distribution of the non-latent variables, all of which have reasonable N_eff.
It would add a whole new phase to adaptation, occurring pre-warmup, so while I’m not C++ savvy at all, I can’t imagine it’d be easy. But possibly one could crib from the recent work on Pathfinder, which has a different goal (getting to the typical set quickly) but also reflects a pre-warmup phase.
That’s a really good question on whose answer I have little intuition. I’m wondering if maybe @betanalpha or @Bob_Carpenter might have any input?
This approach seems to achieve something different from the other three because here, mu is (implicitly) multiplied by sigma as well.
Ah, yup that’s an error, good catch.
Given the lack of details I can only offer speculation here.
An incremental effective sample size for an expectand, i.e. the expectand effective sample size divided by the number of iterations, less than 0.1 is not problematic in any way. Assuming everything is well-behaved the precision of the corresponding Markov chain Monte Carlo estimator may not be sufficient, but it won’t indicate any fundamental pathologies.
My guess here is that the
alphaT_raw[n] that exhibit smaller effective sample sizes are more strongly informed by the likelihood function so that the corresponding
alphaT[n]are strongly constrained. This constraint on
muAlphaT + sigmaAlphaT * alphaT_raw[n] couples
sigmaAlphaT which can induce stronger autocorrelations in the realized Markov chains and hence a smaller effective sample size that that for
Often this coupling is so bad that one gets an inverted funnel and divergences or an E-FMI warning, but sometimes it can be just mild enough that is slows the sampler but doesn’t compromise accuracy. This could be such a case.
That said if this is such a case then you could see substantial speed ups by centering just those parameters while leaving the rest non-centered. See the end of Section 4.2.3 in Hierarchical Modeling for some examples for who to program a mixed parameterization in Stan.
@betanalpha Thanks! This is immensely helpful.
It’s both useful to know that these low n_eff/N don’t pose a problem in themselves and to know that they indicate something that might benefit from a parameter transformation.
In my case, I don’t think I’ll be able to center and non-center flexibly enough to optimize all that. There are a lot of contexts and parameters and I’d need a systematic way to figure out which need what. And then I’d have to add some features to my code generator to manage that. Though I think the code-generation would be relatively straightforward.
It does make me yearn for something like automatic pre-warmup tuning , mentioned above by @mike-lawrence.
One thing I’ve found confusing as I’ve started using Stan is trying to figure out when the sampling is slow because the model is complicated in some way and when it is slow because something is wrong/could be re-parameterized. Your post above is helpful here by pointing out another thing that might indicate sub-optimal parameterization. Is there a good resource for how to use all the ShinyStan diagnostics to figure out when there might be a faster parameterization?
These are not mutually exclusive. Reparameterizations transform the model configuration space to be more compatible with the posterior geometry; the more complicated the model the more complicated the posterior geometry can be and the more complicated of reparameterizations might be useful.
In general one has to work with the Stan diagnostics to hypothesize and investigate possible posterior geometry complexities. Because every model can have different complexities this has to be done for each bespoke model and largely can’t be automated as much as people would like it to be. For more discussion of investigation strategies see Identity Crisis.
Certain modeling techniques can have specific complexities to which they are particularly susceptible, which motivates certain patterns, but the interactions between the different modeling techniques that form a full model can introduce additional complexities.
In the hierarchical modeling case study I present strategies for motivating an initial parametrization of the contexts as well as investigating which parameterizations might be insufficient. See for example my factor modeling case study, Impact Factor, to see these strategies in action when there are lots of interacting hierarchies and hence lots of contexts to consider.