Brms, priors for random-effect SDs, and non-centered parameterizations

jsocolar · August 30, 2023, 1:47pm

Here’s some Stan to play with

data {
  int<lower=0, upper=1> identity_scale; // if zero, prior is flat on logit scale
                                        // if one, flat on identity scale
  int<lower=0> N;
  array[N] int y;
}

parameters {
  real mu;
}

transformed parameters {
  real p = inv_logit(mu);
}

model {
  y ~ bernoulli(p);
  if(identity_scale == 1){
    mu ~ logistic(0,1);  // this is the prior on mu that yields a flat prior on p; i.e. it is the Jacobian adjustment.
  }
}

library(cmdstanr)
mod <- cmdstan_model("logistic_scale.stan")

mod$sample(data = list(identity_scale = 0, N = 10, y = c(0, rep(1, 9))))
mod$sample(data = list(identity_scale = 1, N = 10, y = c(0, rep(1, 9))))

danielparthier · August 30, 2023, 2:27pm

I am also not sure if that is true specifically for “Bayesians”. I think where the aversion is coming from is that a uniform prior is sold as something which is “unbiased”, “flat”, “carrying less information” which is definitely not the case for transforms or different models. Personally for me (and I say that as someone working in Neuroscience as a biological field) I have a hard time buying into clear cut borders of anything… So even from a not Bayesian perspective but a “reality”/“logical”/“reasoning” perspective: If someone can explain with a very good reason why a specific prior is justifiable over another I might buy into it but generally a hard border is also a hard sell.

I always forget where I heard the example of the gambling farmer (maybe McElreath): Imagine yourself being a farmer and bet your land based on some outcome. How much of your land would you bet against/for something to happen. In your case you would bet your whole land against the sd being 5.0000001 and for 4.9999999. I am not 100% convinced that anyone would be that bold :)

Maybe also just to clarify my previous statement about divergence in regards to the statement by @jd_c:
I think they shouldn’t drive your decision to what your prior should be but can inform what is wrong with the model.

But I am really curious why to go for a uniform prior to begin with. Is it the assumption that it is “less influential” or something else? Because at one point it has to be sold to someone that the assumptions made are reasonable.

jsocolar · August 30, 2023, 2:53pm

Lol yeah. Unprincipled priors are the target of blistering critiques from avowed frequentists impugning applied Bayesian practice as they encounter it in the wild.

jd_c · August 30, 2023, 2:55pm

Here is another way to think about your uniform(0, 5) prior: Why do you choose 5 as the upper bound? I reckon that the reason is because you believe that the sd is well below 5, and you cover all of the outcome space that seems plausible plus some added buffer to be sure you don’t restrict estimation of the sd. So it doesn’t matter to you if it is 4.9 or 5.1, because it’s all ‘buffer’. This tacitly admits that you do have an idea of a plausible upper boundary and then some buffer. Thus, in your mind, you actually have something like a half-normal(0, 1.5) prior. Why? Because you have an idea of the plausible upper boundary but you need a buffer to ensure no undue influence on estimation, which is like having probability density through the plausible values with regularization and a tail through the ‘buffer’. I would argue that this is the true mental construct needed to actually decide on a uniform prior in this scenario. But when you encode uniform(0, 5), you do not reflect the beliefs that were required for you to form the prior.

blokeman · August 30, 2023, 8:14pm

You make a solid case.

I had a dataset (discussed earlier in this thread) where a U(0,5) prior on the group SD was much better than exponential or half normal, with fewer divergences but otherwise identical parameter estimates. And there’s indeed a certain appeal in the idea of proper uniform priors as a “bridge” between frequentist and Bayesian inference, with the posterior becoming a constant multiple of the likelihood. All you have to do is plot the posterior, and you’re essentially looking at the relevant regions of the likelihood.

jsocolar:

If the likelihood constrains the probability to be somewhere, say, between 0.7 and 0.99, then a flat prior on the logit scale will yield a posterior that is concentrated much more towards .99 than 0.7.
…
mu ~ logistic(0,1); // this is the prior on mu that yields a flat prior on p; i.e. it is the Jacobian adjustment.
…
mod$sample(data = list(identity_scale = 1, N = 10, y = c(0, rep(1, 9))))

Wow, this is subtle! Here’s me trying to illustrate your point using brms:

> dat <- data.frame(y = rep(c(0,1), c(1,9)))
> logit <- brm(y ~ 0 + Intercept, prior = prior(logistic(0,1)), family = bernoulli("logit"), 
    backend = "cmdstanr", data = dat, cores = 4, refresh = 0, seed = 1337, iter = 1e4)
> logitnorm <- brm(y ~ 0 + Intercept, prior = prior(normal(0,2.5)), family = bernoulli("logit"), 
    backend = "cmdstanr", data = dat, cores = 4, refresh = 0, seed = 1337, iter = 1e4)
> post.logit <- as_draws_df(logit)$b_Intercept
> post.logitnorm <- as_draws_df(logitnorm)$b_Intercept
> plot(density(plogis(post.logit)), ylim = c(0,6))
> lines(density(plogis(post.logitnorm)), col = "red")

Rplot01

It does indeed pull the posterior closer to 1 on the probability scale. Well, at least the posterior mode. This is what you were getting at, right? Fascinating nuances!

danielparthier · August 30, 2023, 10:31pm

I can see where you are coming from but you can also see it in a different way:
If we want to follow the frequentist route we would have to assume our priors to be U(-inf, inf). A very strange idea somehow. But let’s go with it. Most frequentist, as @jsocolar pointed out, would give us hell for using U(0,5) and removing infinite possible outcomes and even worse claim that there is something like an infinitesimal gradient jump at a very fix point.
It is a bit like 3 people guessing how long they have to walk to find the ocean. One person would say “maybe behind me, maybe I’m in it or maybe it is infinetly far away and all the three options are equally likely. So I will just walk all the time the same speed.” The next one goes like: “I’m not in the ocean and it should be in front of me. It can’t be too far away because we are close to the coast. I will just run for a bit and can get slower on the way before I’m exhausted.” The last one said: “I think the ocean is between here and 5km. But it could be where I’m standing, 2 km or 4.9. I will walk all the time the same speed”. The first one walked and walked until after a long time he reached the ocean. He also got lost on the way because he walked backwards for a bit… The second one ran so fast and then reached the ocean. The last one walked and walked. Then he fell down an imaginary cliff and was gone. The moral of the story: don’t use hard borders they make horrible gradients and you will fall down horrible cliffs. Also they are difficult to explain why we think they are there.

Also again good example for where simulations can come in handy as @jsocolar and you have shown as well. :)

jsocolar · August 31, 2023, 1:31am

Despite what you’ll read in some references, this is at worst untrue, and at best not well defined. The maximum likelihood estimate (MLE) is parameterization invariant. The posterior-under-uniform-priors is not parameterization invariant because the prior picks up a Jacobian term from nonlinear reparameterization. This not only renders the idea that “bayesian with flat priors recapitulates frequentist” false, it renders it waaaaaay false. How false? Well for any statistical model that has a well defined MLE (edit: and a likelihood that can be normalized), we can reparameterize to a representation of the model such that the MLE falls arbitrarily far outside any range of posterior quantiles (other than 0–1) obtained under flat priors! By reparameterizing in sufficiently creative ways, I can masquerade any prior I want as a “flat prior”.

This is definitely nonintuitive! If the posterior is just the normalized likelihood, then how can it badly miss the MLE? Well, under some parameterizations, the normalized likelihood will contort into an arbitrarily thin spike at the MLE that is very far away from all of the mass. You can always find a parameterization where a uniform prior is sheer lunacy.

If the sensitivity to reparameterization makes you squeamish, you’re not alone. Another person who felt squeamish about it was Sir Harold Jeffreys, who devised the so-called Jeffreys priors as the unique class of priors that are invariant under reparameterization. The problem is, Jeffreys priors are often terrible encapsulations of plausible domain knowledge under any parameterization, while at the same time they are not “uninformative” in any sense (including in the sense of recapitulating frequentist inference)!

If you are desperate to obtain frequentist inference via MCMC, then you might look into the technique known as data cloning (see https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1461-0248.2007.01047.x) which you can implement in Stan by raising the likelihood to a large power and then manipulating the posterior as described in the link. The alternative route, which I vigorously and wholeheartedly recommend, is the route of putting some thought into writing down well considered priors.

The process of writing down a well considered prior is (at least in theory) parameterization invariant, because the prior you write down will be different depending on how the model is parameterized. In practice that’s not completely true because we almost always write down priors represented by a fairly limited set of probability density functions in whatever parameterization we are using, which effectively gives us access to different potential choices of prior distribution depending on the parameterization. Hopefully, we can consistently find parameterizations that enable us to easily write down priors that are pretty close to the information that we want to encode. This is also why prior sensitivity checks can be a very important part of robust Bayesian workflows.

blokeman · August 31, 2023, 4:51am

Any simple example of this anywhere?

jsocolar · August 31, 2023, 12:49pm

x_new = exp(exp(x)). Suppose the likelihood is Gaussian (when the model is parameterized by x), so the right tail decays as something like e^{-x^2}. But the prior on x induced by the flat prior on x_new increases as e^xe^{e^x}.

jsocolar · August 31, 2023, 1:07pm

More generally, if you give me any prior representable by a continuously differentiable PDF in some parameterization, I can reparameterize to a representation where it’s a flat prior by taking the reparameterized coordinates to be a function of the original coordinates whose Jacobian determinant is equal to the desired PDF.

Topic		Replies	Views
What does non-centered parameterization actually do? How to interpret model? (brms) brms hierarchical-model , interpret-results , brms , reparametrization	2	4627	June 3, 2022
Options for Priors on Random Effects with Non-Centered Parameterizations Modeling specification	6	3235	August 29, 2018
Brms: non-centered to centered parameterization brms	8	824	October 3, 2023
Understanding reparameterization of nonlinear hierarchical models with brms brms fitting-issues , performance	3	1137	January 22, 2021
R2D2 prior and divergences under specific circumstances Modeling brms	3	65	September 18, 2024

Brms, priors for random-effect SDs, and non-centered parameterizations

Related topics