Brms, priors for random-effect SDs, and non-centered parameterizations

I have a grouping factor where >50% of the clusters have n = 1, and overall between-cluster variation seems rather small. This makes my models pretty divergence-prone, although tighter priors and/or high adapt_deltas can eliminate them. The response is categorical, using a logit link.

Question 1:
McElreath (2020: 420–6) shows that when using Stan with rethinking::ulam(), non-centered parameterization can greatly improve sampling when dealing with a random-effects factor with a very low SD.

I use brms, and I’ve been racking my brain all day trying to figure out how to implement non-centered parameterization for the random intercepts using the package’s non-linear syntax option. But now I ran into old threads suggesting that brms uses non-centered parameterization for random effects by default. Is this still the case? If so, then I suppose I don’t have to spare another thought for that mathematical trick, which after all is rather challenging for a non-statistician to wrap his head around.

Question 2:
What’s wrong with using a proper uniform prior on the random-intercept SD? I understand this is generally not recommended. But I did a little test, fitting the same simple-ish hierarchical model to a binary subset of the data 120 times, deliberately keeping adapt_delta at just 0.85 and using either an exponential(2), normal(0,1), lb = 0, or uniform(0,10), lb = 0, ub = 10 prior on the SD. All three options yielded posterior means in the same neighborhood, but the uniform prior had divergences slightly less often than the others (45.8% of the time as opposed to 55.8% for half-normal and 56.7% for exponential).

There’s also another thing I like about the uniform prior, namely the fact that it has no mode. When visualizing the posterior, this makes it easier to determine whether the resulting mode is due to the prior or the data. If the data alone place the mode at zero, I take this to mean that the data contain little to no information on between-group variability, so that a frequentist model would estimate it at zero and complain about a singularity. This is exactly what lme4 does when fit to the same data, but I can’t and won’t use lme4 for the analysis proper because it doesn’t support categorical responses.

Is there some pitfall that I’m missing about the use of a U(0,10) prior in this situation? Or are such priors considered an abomination, whose use might risk discrediting the whole study?

brms uses the non-centered parameterization by default. You can check this out by taking your fitted model object and using the function stancode() to extract the Stan code that is working behind the scenes. I’m not on a computer with R at the moment, so I can’t remember Paul’s exact notation, but what you will see for a varying intercepts model is something like z_1 declared as a vector of parameters that is the standardized varying intercepts (actually offsets from intercept) and sd_1 that is the standard deviation of varying intercepts in the parameters block; a transformed parameter r_1 = z_1 * sd_1 in the transformed parameters block which is the actual varying intercepts; and target += std_normal_lpdf() in the model block.

Same neighborhood for the estimate of the sd of the varying intercepts? Note that normal(0, 1) can be a pretty wide prior on the logit scale depending on your data.
The goal is regularization or partial pooling of information. Try some prior predictive checks and see if a sd of 10 for your varying intercept looks very realistic ;-) As your prior becomes wider, you allow for less regularization and move toward the no-pooling model (as if your put your group as a ‘fixed effect’ in the model). As your prior becomes narrower, you allow for more regularization and move toward the complete-pooling model (as if you left out your group altogether in the model).

1 Like

It’s almost never an accurate encapsulation of plausible prior beliefs about the sd. You think it could just as easily be 0.0000001 or 9.99999999 but you’re absolutely certain it isn’t 10.000001?

You sure that’s not sampling variation? In any case, I don’t think this is a functionally important difference in the nice-ness of the posterior.

If the data plus a weakly regularizing prior (like (half)normal(0, 3)) place the mode at zero, this is also a clear indication that the data contain little to no suggestion of between-group variability. Note that this isn’t the same thing as containing no information–the data might tell you with confidence that the variability is quite small.

With all that said, there’s not gaping pitfall here, and you won’t discredit your whole study this way. Still, it’s generally good practice to pick priors that offer a bit of regularization and that don’t encode patently weird a priori beliefs (like that the plausibility of the sd drops off a cliff at values of 10).

1 Like

Thanks, that’s great news! 1 out of 2 hereby solved :)

Yes. The exponential prior yields an estimated SD (posterior mean) at 0.183, half-standard-normal at 0.229, and U(0,10) at 0.240, excluding all iterations with divergent transitions. The ML estimate (mode of the log likelihood as estimated by lme4) is 0.

In this particular case though, the Bayesian models do less regularization than their frequentist counterpart (which estimates the SD at 0 and equates the “varying” intercepts with the population-level one). Both methods agree that the SD is small, but the Bayesian models are less categorical about it, presumably because they use means instead of modes.

It is obviously absurd when formulated in those terms. But my main interest lies in sampling efficiency and parameter estimates which are informed by data rather than speculation. I have no solid prior information on what the group SD might be — this is uncharted territory. I’m well versed in the topic, and the best I have is speculation. Admittedly 10 sounds too extreme (implying that some clusters may prefer one option categorically), but I like round numbers. They seem less arbitrary/manipulative.

That said, I just ran 120 iterations with U(0,5), and it did even better than U(0,10). The average posterior mean was still 0.240, but divergences now occurred in only 22% of the iterations. Even if the slight edge of U(0,10) earlier was coincidence, this larger edge can hardly be so.

Now I’m even more tempted to go uniform. :/

I tried that for 120 iterations too. As might have been expected, that prior underperformed all previous candidates, with 76 divergent runs out of 120 (67%).

To my mind, if the data tells you (confidently) that the SD is small but not zero, then the posterior will also have a clear peak at some small non-zero value. But if the mode tightly hugs zero (the way an exponential distribution does), then I’d be inclined to conclude that there’s no solid evidence (relative to the prior) of any non-zero variability at all. I’ve done some mock-data simulations with small but non-zero (0.2) logit-scale SDs in a huge (8k) sample and an exponential(2) prior, and that’s what I witnessed. (can share the code if someone wants it).

With all this said, if uniform priors really are as unpopular as these gentlemen’s comments suggest, then I still have some serious thinking to do.

But you are not speculating with a prior like this (or any prior). What you are actually doing when setting a prior is adding this term to the log posterior. Think of it as adding pieces of information to the model. So what you have added is target += uniform_lpdf(sd_1 | 0, 10) - 1 * log_diff_exp(uniform_lcdf(10 | 0, 10), uniform_lcdf(0 | 0, 10)); to the log posterior. So instead of speculating, you have added the information that the sd has an upper bound of 10 and is equally plausible between 0 and 10. Why didn’t you use 5? or 20?

1 Like

Exactly :)

If the idea of some regularization via the prior is anathema to you, then you probably need to consider frequentist approaches.

(posterior) Iterations or sampling runs?

Two things. First, tightly constraining an effect to be not much larger than zero (without excluding zero) is still quite informative about the size of the effect. Second, the standard deviation will never be literally zero (or any particular exact value chose a priori), and so constraining the effect to be very small and constraining it to be very small but nonzero are the same thing.

The fact that you are observing post-warmup divergences means that the sampler is encountering difficult geometries post-warmup. If your observed variation in the number of divergences is meaningful (and I’m still not totally convinced that it is, but that’s beside the point here), the implication is that the different priors are altering the posterior geometry to make it “nicer”, often by “trimming off” the difficult parts. But if your goal is to let the data do all the talking and avoid influence of the prior, then the very fact that you are “trimming off” regions that the sampler otherwise actually visits post-warmup means that you are definitely injecting information into the inference via the prior.

5 seems to work even better, as noted in my previous post.

120 sampling runs, each with two chains with 2k post-warmup iterations each, for a total of 4k post-warmup draws per run. Here are the experimental results so far:

> results

    exp2.diverge hnorm1.diverge unif10.diverge unif05.diverge hnorm3.diverge exp2.sigma hnorm1.sigma unif10.sigma unif05.sigma hnorm3.sigma
1              0              2             16              0              0  0.1754645    0.2233841    0.2387672    0.2423030    0.2389320
2              0              1              0              0              0  0.1889790    0.2385830    0.2391384    0.2342946    0.2393784
...
120           16              0              0              0              0  0.1845497    0.2208282    0.2393018    0.2331550    0.2438354

> apply(sapply(results[,1:5], ">", 0), 2, sum)

  exp2.diverge hnorm1.diverge unif10.diverge unif05.diverge hnorm3.diverge 
            68             67             55             27             76 
#                                                        ^ A winner is you. Big difference too.                                                                                   

That makes sense, yet they all yield similarly shaped posteriors piling up at zero, and even the means are within 0.05 of each other. But the accursed uniform priors have fewer divergences than the others, which would seem to be a huge blessing from a computational perspective.

Do any of the priors under present consideration constrain the effect to be very small? That’s not what I want to do at all — it would count as leading the witness. I want to let the data talk without computational problems, and computational problems rule frequentist approaches out. I chose these options to consider because I thought they were quite liberal and prone to yield similar results as a maximum-likelihood-based analysis. But I didn’t feel the need to accommodate SDs in the double digits — this is the logit scale after all, and we’re not modeling the occurrence of a rare disease.

I’m not sure I have a lot more to add to the discussion, other than your uniform prior is, to my mind, a bizarre piece of information (i.e. assumption) to add to your model - that your information about the sd is dramatically overdispersed but yet has a hard upper boundary.

As far as divergent transitions go, FWIW what you provide here goes against my personal experience in every multilevel model that I have fit - narrower priors with more probability density toward zero, like half-normal(0, 1) in this case, have always lowered divergent transitions opposed to something wider, in my experience. That’s not a vast amount of experience, but I am surprised and am curious what model you are fitting. Since what seems like decent priors apparently make the geometry worse, I wonder if you should look at your model specification?

1 Like

No, I was referring to scenarios where the likelihood constrains the effect to be small.

1 Like

I’ve had the same experiences up until now. I wanted half-normal to win so I could just use it and avoid having to justify going against the cookbook. But alas!

It’s a simple binary multilevel one: y ~ (1|id) + x1 + x2 + x3 + x4, family = bernoulli. If you or anyone else wants to see for themselves that the thing with the priors is indeed happening, here’s the anonymized dataset:

d_anon.txt (20.2 KB)

In my limited experience, they may the same thing with regard to the posterior mean, but not to the posterior mode. Below, the “weak likelihood” curves are from the experiments at hand. Whether the prior is exponential(2), half-normal(0,1) or uniform(0,5), the posterior lands on zero. This coincides with the frequentist ML estimate, which is also zero. By contrast, the green curve is from a similar model fit to mock data designed to have a small but unambiguous random-effect SD around 0.2 (the MLE is 0.19), based on n = 8000. Even with the exponential(2) prior, which has a fairly strong mode at 0, the posterior mode is pulled well clear of 0, even though the posterior mean (0.172) is even smaller than in the weak data. If I saw a posterior like the green one, I would infer that the data provide strong evidence for weak but non-zero variability. Faced with a posterior like the three others and assuming the priors were not stupid, I would infer that the data simply have very little information on what the magnitude of the variability (if any) might be.

I ran the models on your data. I only ran them once, but with the same specification y ~ 1 + x1 + x2 + x3 + x4 + (1|id) and normal(0, 2.5) priors on fixed effects and half-normal(0, 1) prior on the sd of the varying intercepts I got 8 divergent transitions. FYI by increasing adapt_delta to 0.95, the model ran quickly and had zero divergent transitions. The uniform(0, 5) prior had a single divergent transition with adapt_delta at default value.

It is no surprise that you encounter divergent transitions with this data. Not only do more than 50% of groups have only a single observation, as you mentioned, but also, a single group has 58% of the observations in the entire dataset. It’s possible that you need likelihood functions with a centered parameterization for the groups with more data and a non-centered parameterization for those that have little data, as described here https://betanalpha.github.io/assets/case_studies/hierarchical_modeling.html#423_Unbalanced_Likelihood_Functions by Michael Betancourt with an implementation in Stan.

2 Likes

Thanks a lot for the input so far, guys.

I’ve been working with another similar dataset during the past few days, and am seeing a similar pattern again, just in a more extreme form: divergences galore with exponential(2), a few with normal(0,1), and almost none with U(0,5). This time I’m more leery of the uniform priors though, because the estimates differ more radically than in the previous scenario. Exp(2) with the most divergences lands at 0.70 with lots of divergences, half-normal converges at 1.0 or so, while the uniform prior arrives at ~1.25.

Question: is brms’s prior(normal(0,1), lb = 0) an actual half-normal density or just one-half of a standard normal? That is to say, does its prior density correspond to the right half of a standard normal, or is it the right half times two?

It just parses a stan command which leads to a truncated normal distribution. Given that the only probability space is positive it leads to the same thing. For a mean of 0 you will end up with right side of the normal density multiplied by two. But that is just because the integral of a probability function hast to sum up to 1. So if you half if you have to account for that otherwise it is not a probability density.

Regarding the U(0,5) prior it is really hard to justify something like that for an error. As mentioned before stopping at 5 is actually an extremely strong prior because you define >5 not as unlikely but impossible opposed to <5 which you say is very likely. At least in my head that makes it very weird and impossible to justify. Again uniform priors on a continuous scale is (I would argue) one of the strongest priors possible.
The nice thing about stan divergences is that it usually hints to problems with the model. Most likely there are some parameters with too weak priors (possibly hierarchical) where the error term is compensating for too much wiggle room. Then this can easily explode. So now (I guess) you are restricting some wide priors somewhere else with restricting the error term. How does the pairs plot look?

1 Like

It leads to the same inference on parameters, but the normalized version is required if working with Bayes factors. Presumably for this reason, brms normalizes bounded priors, which you can see via make_stancode() in the form of (for example in the case of a single bound) subtracting a *_lcdf or *_lccdf call from the prior to normalize.

2 Likes

Ah good to know. I have to admit that I haven’t checked brms code in a while.

Could you clarify this a bit? It runs counter to my own (limited) understanding of the interplay between group-level \sigma's and population-level \beta's. I’ve been taught that when you remove population-level parameters from a multilevel model, those population-level effects (however small) will instead be attributed to any groups with which the discarded covariates were correlated. In other words, the group-level SD acts as a kind of dispersion parameter that captures variability originating from unmeasured covariates. And, crucially, when you tighten the priors of population-level parameters, I would expect the same thing to happen. To my mind, tightening the prior of some population-level parameter pulls its estimate toward 0, which is just a less extreme version of forcing that parameter to zero by removing it altogether. In both cases, I would expect the estimated group-level SD to increase, not decrease.

Clearly I’m missing something again, and I’d love to know what it is.

You are right in your understanding that the variance would be then redistributed. In your case though (correct me if I misunderstood you there) you have a lot of groups with n=1. So it’s possible that information is very limited for the group variance. In this scenario it could help to increase the pooling a bit. @jd_c talked about this previously. One possible consequence of a wide group sd and hence hardly any regularisation can be that the variance is now going to your main error term or will lead to weird geometries to divergences.

If I understood you correctly you use a logit link to model your data. We can simulate some parameter with mean 0 and sd 1 and add a level on top reflecting our variance in the groups.

hist(boot::inv.logit(x = rnorm(n = 1e4, mean = 0, sd = 1)+rnorm(n = 1e4, mean = 0, sd = 1)), main = "1 sd", probability = T)

hist(boot::inv.logit(x = rnorm(n = 1e4, mean = 0, sd = 1)+rnorm(n = 1e4, mean = 0, sd = 10)), main = "10 sd", probability = T)

This is obviously an extreme example but by increasing the potential variability in the second term (level) we get into a less stable territory on the response scale. By having a tighter group sd you can remove some of the potential extremes. Limiting now the actual sd with a tighter uniform prior will abolish theses extremes as well, however it is a massive assumption tricky to justify (as was mentioned before also by others).
But it is also possible that I clearly miss something from your model.
Have you done some prior predictive checks?

Also just for good measure here the same with a uniform distribution:


Very drastic prior if you ask me.
Edit: Last image corrected

1 Like

This is not the first time on this forum that I encounter the criticism about normal priors for logits implying U-shaped priors for probabilities. So far, my response has been to quote Alan Agresti:

This seems intuitively to be rather informative, but in fact such priors have little influence,
with the posterior looking much like the likelihood function.
(Categorical Data Analysis 3rd ed, 2013, p. 258)

…which makes quite a bit of sense to me. When the responses are Bernoulli trials, I don’t see much of a practical difference between a U-shaped and a uniform prior distribution of p, as long as the U-shape is symmetrical. Here’s an ad-hoc example I made up:

> 1.25 -> Sigma
> hist(p.sd1 <- plogis(rnorm(1e4, 0, Sigma)), main = paste0("P with a N(0,",Sigma,") prior on the logit"))                                            
> hist(p.sdUni <- plogis(rnorm(1e4,  0, Sigma+runif(1e4, 0, 5))), main = paste0("P with a N(0, ",Sigma," + U(0,5)) prior on the logit"))
> hist(p.sdExp <- plogis(rnorm(1e4,  0, Sigma+rexp(1e4, 2))), main = paste0("P with a N(0, ",Sigma," + Exp(2)) prior on the logit"))

PlusU05
PlusExp2

Whether U-shaped or uniform, it seems to me that both distributions of p yield practically identical distributions for the Bernoulli response, like so:

> hist(rbinom(1e4, 1, prob = p.sdUni), main = paste0("Binary outcomes with a N(0, ",Sigma," + U(0,5)) prior on the logit"), probability = T)
> hist(rbinom(1e4, 1, prob = p.sdExp), main = paste0("Binary outcomes with a N(0, ",Sigma," + Exp(2)) prior on the logit"), probability = T)

PlusU05_Ysimul
PlusExp2_Ysimul
I have very little experience with prior predictive checks, but isn’t this pretty much exactly what they do, i.e. simulate responses from the prior distribution? If so, then I don’t see why the uniform prior on the random-effect SD is any worse than the exponential one, in practice.

It has become very clear though that many Bayesians feel a strong antipathy for the hard boundaries imposed by uniform priors, including proper ones.

I don’t think this has anything to do with being “Bayesian.” I think it has more to do with a sensible justification of the information that one is adding to their model. What if instead of the prior, we were talking about imposing hard upper and lower boundaries on the coefficient parameters of your regression model in the likelihood? Or, what if instead, in the data collection process you threw away every observation where the response=1 and had a certain combination of covariates? Would these practices be justifiable? What if they magically lowered divergent transitions? The prior isn’t a tuning knob for the purpose of greasing your machine (although it can help). It’s just another piece of information of a certain kind (the choice and specification of the likelihood, and the data collected, are information too, but different kinds) that you are adding to your model. When you add target += uniform_lpdf(sd_1 | 0, 5) - 1 * log_diff_exp(uniform_lcdf(5 | 0, 5), uniform_lcdf(0 | 0, 5)) to the log posterior, you are adding the information that the sd has equal probability all the way to a hard boundary of 5. How can you sensibly justify your choice of 5, 4.99999, 3.213, or anything else? Why are you so sure that 4.99999999999999999999 is plausible but not 5.0? Just as one would need some strong, rational justification for hard boundaries on the coefficients in their likelihood, or when putting hard boundaries on the data that they decide to discard, they need some sensible justification for a hard boundary on their prior. I can’t see that this has anything to do with being Bayesian (other than that we are discussing Bayesian models).

4 Likes

The problem is that it simply is not true in general that the influence of these priors is unimportant. Am I disagreeing with Agresti? I don’t know for sure without being able to see the quote in context, but I really doubt that Agresti means to claim otherwise. For some examples where this matters to inference, see https://www.frontiersin.org/articles/10.3389/fevo.2019.00501/full

What is true is that when you have a whole lot of data, it gets harder and harder to accidentally write down a prior that is strong enough to really matter. But to confirm that you are in this “likelihood dominates” regime requires actually conducting prior sensitivity checks for the quantities upon which you wish to base your inference.

You are right that for Bernoulli responses you won’t see any difference in prior predictive distributions as long as the prior is symmetric around 0.5 on the probability scale. That just means that the prior predictive distribution is a very weak check to understand the influence of the prior on the inference across this class of models. If the likelihood constrains the probability to be somewhere, say, between 0.7 and 0.99, then a flat prior on the logit scale will yield a posterior that is concentrated much more towards .99 than 0.7.

There is a larger point here. I’m not sure why you are so interested in using uniform priors, but I cannot envision any possible reason that doesn’t boil down to the uniform prior has some allegedly desirable influence on the posterior. But if choosing to use a uniform prior has any substantial influence on the posterior compared to using a prior that is actually consistent with plausible domain expertise, then you cannot logically argue that the likelihood is dominating and the prior doesn’t matter.

3 Likes