I see how sum to zero is not a great solution in some sense - but what do people do to avoid the issues this seems to bring in more complex models? For me it seems that bumping up adapt_delta
mostly works, but it feels quite unsatisfactory (and the performance suffers). Or is it actually fine for most people (and hence my problems might lie in other parts of the models)?
Regarding the sufficient statistic line of thinking: I’ll try to rephrase it to see if I understand you correctly.
Let’s say we have K groups, then the mean for group i is
\mu_i = \alpha + \beta_i where
We can reparametrize in terms of a zero sum vector \bar\beta such as
Now one should be able to derive (too lazy to do this explicitly) a matrix \Sigma that is a function of \mu_\alpha, \sigma_\alpha, \sigma_\beta such that:
Now \Sigma transforms any prior on the original model into a prior on the zero-centered model. Additionally, we note that \mu_R doesn’t enter the model beyond this prior, so we can actually leave it out of the model and use \Sigma to derive the distribution \pi(\mu_R|\bar\alpha, \bar\beta). This means we can sample \mu_R in gen.quants and use it to recover the original \alpha, \beta. Since \Sigma is going to have a lot of structure this all should be amenable to simple analytic formulas. Is that what you had in mind?
Once again the question remains: if this works why isn’t it used widely (removing a parameter feels like an unquestionable win). So I presume it is not so simple and either I made an error or there are drawbacks I missed. Thanks for any hints.