I have a grouping factor where >50% of the clusters have n = 1, and overall between-cluster variation seems rather small. This makes my models pretty divergence-prone, although tighter priors and/or high adapt_delta
s can eliminate them. The response is categorical, using a logit link.
Question 1:
McElreath (2020: 420–6) shows that when using Stan with rethinking::ulam()
, non-centered parameterization can greatly improve sampling when dealing with a random-effects factor with a very low SD.
I use brms, and I’ve been racking my brain all day trying to figure out how to implement non-centered parameterization for the random intercepts using the package’s non-linear syntax option. But now I ran into old threads suggesting that brms uses non-centered parameterization for random effects by default. Is this still the case? If so, then I suppose I don’t have to spare another thought for that mathematical trick, which after all is rather challenging for a non-statistician to wrap his head around.
Question 2:
What’s wrong with using a proper uniform prior on the random-intercept SD? I understand this is generally not recommended. But I did a little test, fitting the same simple-ish hierarchical model to a binary subset of the data 120 times, deliberately keeping adapt_delta
at just 0.85 and using either an exponential(2)
, normal(0,1), lb = 0
, or uniform(0,10), lb = 0, ub = 10
prior on the SD. All three options yielded posterior means in the same neighborhood, but the uniform prior had divergences slightly less often than the others (45.8% of the time as opposed to 55.8% for half-normal and 56.7% for exponential).
There’s also another thing I like about the uniform prior, namely the fact that it has no mode. When visualizing the posterior, this makes it easier to determine whether the resulting mode is due to the prior or the data. If the data alone place the mode at zero, I take this to mean that the data contain little to no information on between-group variability, so that a frequentist model would estimate it at zero and complain about a singularity. This is exactly what lme4
does when fit to the same data, but I can’t and won’t use lme4
for the analysis proper because it doesn’t support categorical responses.
Is there some pitfall that I’m missing about the use of a U(0,10) prior in this situation? Or are such priors considered an abomination, whose use might risk discrediting the whole study?