Prior specification for beta distribution in brms

I am running some regression models in which I’m predicting both mu (‘mean’) and phi (‘precision’) of the outcome variable, which is modelled as being beta distributed.

I would like to explore the influence of using some more informed priors than the defaults, and have a good sense of what I might like the prior distributions to be on the direct scale of mu and phi. However, the default model uses a logit link function for mu, and a log link function for phi. Hence, I think my priors would also need to be specified on that scale (?), and I’m not sure how e.g., I might convert my prior on the identity scale to the logit scale, given that the prior would have multiple values to specify its distribution and they would all have to be somehow transformed. This would also be complicated by adding deviations from the intercepts or reference groups for different predictors. Does anyone have suggestions on how to do this?

Alternatively, I believe there is a way to tell brms to model the parameters using a different link function, and perhaps I could just use the identity link and specify the priors directly. If I do that, I am wondering if there is some inefficiency in it or some drawbacks, as I imagine the default link functions were chosen for a reason.

I would really appreciate any suggestions, especially regarding changing the link function as an option, which seems like a decent option!

The thing we recommend people do now is think about their priors in terms of the thing that they’re predicting.

Have a look at: https://www.youtube.com/watch?v=ZRpo41l02KQ&t=2694

I think this is specifically motivated by setting priors in a Bernoulli regression.

We can work a quick example for a model that looks like:

\alpha \sim \text{normal}(0, 1)\\ p = \text{logit}^{-1}(\alpha)\\ y \sim \text{bernoulli}(p)

So this implies a prior on p (the transformed parameter) of:

inverse_logit = function(x) {
  1 / (1 + exp(-x))
}

hist(inverse_logit(rnorm(100, 0, 1)))

And then if we made the prior standard deviation 10, that implies:

hist(inverse_logit(rnorm(100, 0, 10)))
3 Likes

Thanks for your response @bbbales2

Those are great ways of visualising the priors and their transformed version, but I’m still struggling a bit in how to pass this information to brms/to understand whether brms wants the priors on the identity link or the link that it displays in the summary table. So in the example I give of predicting mu/mean of a beta distributed outcome, mu is presented in the summary tables of the regression using a logit link, so I am imagining that behind the scenes the estimation of the parameters is also happening using this link function. Let’s say I want to set the prior for the intercept on the identity scale as a beta distribution that looks like this:

hist(rbeta(1000, 3, 3))

So it is slightly peaking around .5 but allows quite a lot of other values. I know that I can use the set_prior function in brms to specify the prior. But given that it is being modelled with a logit link, do I need to pass it a prior that is already transformed, or does it take priors on the identity link and then convert them itself for the purposes of the parameter estimation?

If I do need to convert it, can I just put a transformation function in front of it in the set_prior text, like you have with these histograms? Alternatively, is it possible simply to tell brms not to use the link functions when it is making estimations, and is there a downside to doing this?

Again, thanks for you quick response.

The priors go on the parameters before the transformation.

The issue is the linear parts of the linear models (a x + b and whatnot) are unconstrained. You need to pass them through a link function to constrain them.

Any chance you can post your model? Or something like your model with a little R code fake-data thing that makes it run? It’ll be easier to illustrate points this way.

Sure, a little long but here I post a mock data frame and then an example model, with the brms regression also coded. I basically made it here so that Group B generally scores higher than Group A, and also that Condition 2 is higher than Condition 1.

I am predicting score from Group, Condition, and their interaction. I also have a distributional component where I predict phi just based on Group.

mockdata <- data.frame(score = c(0.5, 0.42, 0.58, 0.52, 0.53, 0.54,
0.5, 0.40, 0.53, 0.525, 0.503, 0.5,
0.48, 0.32, 0.48, 0.62, 0.43, 0.58,
0.5, 0.42, 0.58, 0.52, 0.53, 0.46,
0.5, 0.40, 0.53, 0.525, 0.503, 0.5,

                             0.55, 0.50, 0.58, 0.53, 0.70, 0.5,
                             0.58, 0.48, 0.50, 0.52, 0.513, 0.6,
                             0.56, 0.41, 0.5, 0.65, 0.57, 0.7,
                             0.55, 0.50, 0.63, 0.58, 0.6, 0.61,
                             0.7, 0.50, 0.50, 0.55, 0.59, 0.61,
                             
                             0.7, 0.55, 0.65, 0.8, 0.73, 0.6,
                             0.6, 0.55, 0.8, 0.63, 0.5, 0.504,
                             0.6, 0.5, 0.8, 0.75, 0.58, 0.5,
                             0.59, 0.52, 0.72, 0.80, 0.5, 0.5,
                             0.59, 0.60, 0.53, 0.625, 0.75, 0.8,
                             
                             0.7, 0.7, 0.7, 0.79, 0.85, 0.8,
                             0.62, 0.5, 0.89, 0.75, 0.6, 0.59,
                             0.9, 0.7, 0.92, 0.7, 0.75, 0.635,
                             0.8, 0.67, 0.9, 0.9, 0.55, 0.45,
                             0.79, 0.75, 0.5, 0.7, 0.89, 0.89
                             ),
                         group = c("A", "A", "A", "A", "A", "A",
                                   "A", "A", "A", "A", "A", "A",
                                   "A", "A", "A", "A", "A", "A",
                                   "A", "A", "A", "A", "A", "A",
                                   "A", "A", "A", "A", "A", "A",
                                   "A", "A", "A", "A", "A", "A",
                                   "A", "A", "A", "A", "A", "A",
                                   "A", "A", "A", "A", "A", "A",
                                   "A", "A", "A", "A", "A", "A",
                                   "A", "A", "A", "A", "A", "A",
                                   "B", "B", "B", "B", "B", "B",
                                   "B", "B", "B", "B", "B", "B",
                                   "B", "B", "B", "B", "B", "B",
                                   "B", "B", "B", "B", "B", "B",
                                   "B", "B", "B", "B", "B", "B",
                                   "B", "B", "B", "B", "B", "B",
                                   "B", "B", "B", "B", "B", "B",
                                   "B", "B", "B", "B", "B", "B",
                                   "B", "B", "B", "B", "B", "B",
                                   "B", "B", "B", "B", "B", "B"),
                         condition = c("Cond1", "Cond1", "Cond1", "Cond1", "Cond1", "Cond1",
                                       "Cond1", "Cond1", "Cond1", "Cond1", "Cond1", "Cond1",
                                       "Cond1", "Cond1", "Cond1", "Cond1", "Cond1", "Cond1",
                                       "Cond1", "Cond1", "Cond1", "Cond1", "Cond1", "Cond1",
                                       "Cond1", "Cond1", "Cond1", "Cond1", "Cond1", "Cond1",
                                       "Cond2", "Cond2", "Cond2", "Cond2", "Cond2", "Cond2",
                                       "Cond2", "Cond2", "Cond2", "Cond2", "Cond2", "Cond2",
                                       "Cond2", "Cond2", "Cond2", "Cond2", "Cond2", "Cond2",
                                       "Cond2", "Cond2", "Cond2", "Cond2", "Cond2", "Cond2",
                                       "Cond2", "Cond2", "Cond2", "Cond2", "Cond2", "Cond2",
                                       "Cond1", "Cond1", "Cond1", "Cond1", "Cond1", "Cond1",
                                       "Cond1", "Cond1", "Cond1", "Cond1", "Cond1", "Cond1",
                                       "Cond1", "Cond1", "Cond1", "Cond1", "Cond1", "Cond1",
                                       "Cond1", "Cond1", "Cond1", "Cond1", "Cond1", "Cond1",
                                       "Cond1", "Cond1", "Cond1", "Cond1", "Cond1", "Cond1",
                                       "Cond2", "Cond2", "Cond2", "Cond2", "Cond2", "Cond2",
                                       "Cond2", "Cond2", "Cond2", "Cond2", "Cond2", "Cond2",
                                       "Cond2", "Cond2", "Cond2", "Cond2", "Cond2", "Cond2",
                                       "Cond2", "Cond2", "Cond2", "Cond2", "Cond2", "Cond2",
                                       "Cond2", "Cond2", "Cond2", "Cond2", "Cond2", "Cond2"))

mockmodel <- (bf(score ~ 1 + group + condition + group:condition)) +
lf(phi ~ 0 + group)

library(brms)

mockbetareg <- brm(mockmodel,
data = mockdata,
family = Beta(),
control = list(adapt_delta = 0.8, max_treedepth = 10),
chains = 4,
cores = 4,
iter = 8000,
warmup = 2000)
mockbetareg

get_prior(mockmodel,
data = mockdata,
family = Beta(),
control = list(adapt_delta = 0.8, max_treedepth = 10),
chains = 4,
cores = 4,
# inits = 0,
iter = 8000,
warmup = 2000)

In essence then, my question is simply how to place some meaningful priors on the components of the model. I am not new to working with priors in general - I’ve done it succesfully and I think quite nicely with normally distributed data. But I don’t want to dive in on this sort of data and make a mistake/not understand what I’m doing - so your help is much appreciated!

Let me know if anything else is helpful!

@bbbales2 I just wondered if you had managed to take a look at the dataset/model I presented?

If it is possible to simply specify the priors directly for mu and phi and not have to prespecify them on the link function then I think I can get on okay. I would be interested in what you meant by this though:

The issue is the linear parts of the linear models ( a x + b and whatnot) are unconstrained. You need to pass them through a link function to constrain them.

One final thing I was wondering that might relate specifically to the type of model/data above. I am not saying these are the priors I intend to use, but just to give a hypothetical possibility. If we imagine I specify a prior on the intercept that gives some plausibility to all mu values between 0 and 1. If I then place a prior on the group effect centered on 0 but allowing for values as much as +/- 0.5 from that intercept, would it mean that the model will a priori be considering values totally outside the possible beta distribution as plausible (e.g., Intercept + Group = -0.4, or +1.4, even though the predicted value can actually only span between 0 and 1), or does it automatically constrain itself to possible beta values? Perhaps it might then give more plausibility to extreme values close to 0 or 1 if the intercept was already very high or low.

Of course, if you have any other thoughts on the prior specification process for this sort of model I would be very happy to hear it!

Sorry about that! Please feel free to ping me whenever. Especially if it’s been more than a day.

So something handy you can do to understand what’s going on is make_stancode. Something like this:

make_stancode(mockmodel,
              data = mockdata,
              family = Beta())

Then you can see exactly the Stan code that is being used:

// generated with brms 2.11.4
functions {
}
data {
  int<lower=1> N;  // number of observations
  vector[N] Y;  // response variable
  int<lower=1> K;  // number of population-level effects
  matrix[N, K] X;  // population-level design matrix
  int<lower=1> K_phi;  // number of population-level effects
  matrix[N, K_phi] X_phi;  // population-level design matrix
  int prior_only;  // should the likelihood be ignored?
}
transformed data {
  int Kc = K - 1;
  matrix[N, Kc] Xc;  // centered version of X without an intercept
  vector[Kc] means_X;  // column means of X before centering
  for (i in 2:K) {
    means_X[i - 1] = mean(X[, i]);
    Xc[, i - 1] = X[, i] - means_X[i - 1];
  }
}
parameters {
  vector[Kc] b;  // population-level effects
  real Intercept;  // temporary intercept for centered predictors
  vector[K_phi] b_phi;  // population-level effects
}
transformed parameters {
}
model {
  // initialize linear predictor term
  vector[N] mu = Intercept + Xc * b;
  // initialize linear predictor term
  vector[N] phi = X_phi * b_phi;
  for (n in 1:N) {
    // apply the inverse link function
    phi[n] = exp(phi[n]);
  }
  for (n in 1:N) {
    // apply the inverse link function
    mu[n] = inv_logit(mu[n]);
  }
  // priors including all constants
  target += student_t_lpdf(Intercept | 3, 0, 10);
  // likelihood including all constants
  if (!prior_only) {
    for (n in 1:N) {
      target += beta_lpdf(Y[n] | mu[n] * phi[n], (1 - mu[n]) * phi[n]);
    }
  }
}
generated quantities {
  // actual population-level intercept
  real b_Intercept = Intercept - dot_product(means_X, b);
}

So like there’s no prior on b. We can add one:

make_stancode(mockmodel,
              data = mockdata,
              family = Beta(),
              prior = set_prior("normal(0, 1.1)", class = "b"))

And now (cutting a lot of the output) we get:

...
// priors including all constants
target += normal_lpdf(b | 0, 1.1); // Here's our prior!
target += student_t_lpdf(Intercept | 3, 0, 10);
...

So does looking at the model code help you with your questions? If your question is where stuff shows up, you can just change the priors and see how that changes the model code.

Thanks very much @bbbales2 - yes seeing the model this way is really helpful.

I’m thinking that to answer my question about whether the priors might make the model entertain implausible values I can also run a prior predictive check and the implied mu and phi values, so I will try to do that tomorrow.

Thanks for your help!

Yeah! And if you can’t figure out how to get stuff out with brms, you can also do stuff like:

data = make_standata(mockmodel,
              data = mockdata,
              family = Beta(),
              prior = set_prior("normal(0, 1.1)", class = "b"))

To get a list of data that can be used to run the Stan model outside of brms (where you can do your experiments or whatnot). It is useful.

@bbbales2 are you sure about this? I have just run a ‘prior predictive check’, and specified e.g., that the mean of the beta distribution should be around 0.5 by passing it a prior on that parameter of beta(3, 3) - which gives a mu of 0.5. However, if I look at the summary table from brms it seems clear that this prior is being taken as the prior for the values on the logit transformed data, as it displays as exactly .5 as the estimate in the summary values, which have been logit transformed. Additionally, if you run the marginal_effects on the model, it plots as a mu value of around 0.625, which is exactly the inverse logit of 0.5.

It therefore seems like the priors are in fact being used to determine the values of the parameters after they have been put through a link function, rather than being the values on the identity scale that I am trying to set the priors for.

Hmm, I am suspicious. Gimme code and I’ll have a look!

Sure, and thank you! Please note that I had to put some priors on the ‘phi’ parameter that are are certainly not what is desirable, but just as a means of making the thing run. You do get some divergent transitions, largely from the phi parameter I think, but the mu numbers lined up so closely with what I would expect if it was using the priors on the transformed parameters that I thought that must be what is going on, even if the model is not running perfectly.

I am going to try attaching a script.beta prior example.R (5.2 KB)

So looking at the Stan code, the priors are applied here:

  // priors including all constants
  target += normal_lpdf(b[1] | 0, 0.25);
  target += normal_lpdf(b[2] | 0, 0.25);
  target += normal_lpdf(b[3] | 0, 0.125);
  target += beta_lpdf(Intercept | 3, 3);
  target += cauchy_lpdf(b_phi[1] | 0, 5);
  target += cauchy_lpdf(b_phi[2] | 0, 5);

And all these things go towards computing mu and phi:

  // initialize linear predictor term
  vector[N] mu = Intercept + Xc * b;
  // initialize linear predictor term
  vector[N] phi = X_phi * b_phi;

Which then we pass through the link functions and then send them on to the lpdf:

  for (n in 1:N) {
    // apply the inverse link function
    phi[n] = exp(phi[n]);
  }
  for (n in 1:N) {
    // apply the inverse link function
    mu[n] = inv_logit(mu[n]);
  }
  // likelihood including all constants
  if (!prior_only) {
    for (n in 1:N) {
      target += beta_lpdf(Y[n] | mu[n] * phi[n], (1 - mu[n]) * phi[n]);
    }
  }

So there is a beta(3, 3) prior on the Intercept, but that is still being passed through a link function before it goes to the likelihood term.

Sorry, I know I’m being very dense here, but what does this mean in terms of whether the prior as I specify it is being understood as the values for the actual beta distribution mean value (which with this prior should be 0.5 directly on beta scale from 0 to 1), or as values that will ultimately be understood as referring to the logit transformed beta scale that is displayed in the summary table and in the posterior distribution?

I am not very good at reading the stan code, so though I can certainly see that the priors are being used, I don’t understand exactly how. Given that the transformed intercept comes out with a mean of 0.5, and the other values come out with intercepts and variation around them exactly matching the priors, it seems that the model is not taking the priors to be on the true scale, but on the transformed scale with which the summaries and beta-distribution are represented. Is that right?

Nah nah it’s okay these things are basically always confusing.

So if we have:

parameters {
  real theta;
}
model {
  real p = inv_logit(theta);

  y ~ bernoulli(p);
}

I think the lingo is p is on the probability scale and theta is on the logit scale.

So the priors are specified on the logit scale, not the probability scale. That line up with what you’re saying? Looking again over this I might have been misunderstanding what you were saying.

I was thinking about the probability scale as the transformed scale, but that isn’t a standard thing or anything.

Yes that is what I meant! For me it is very simple to think of what the priors should be on the probability scale that will be seen in the actual data, and also that I will use to present/plot the data. However, if stan takes the priors as being for the logit scale data, I am struggling to work out how to tell it what I want. Do you have any idea on how I might go about tackling this?

Haha, alright my bad.

Well maybe talk out a little more about what you want your prior to do. I think the general advice is do prior predictives. If you want stuff centered around 50%, that corresponds to 0 on the logit scale.

For my proper analysis I would probably specify some slightly different priors as the effects are anticipated to be very large for some parameters, but for the purposes of the example data I gave I would be quite satisfied with the current priors if they were on the probability scale. So what I am wondering is whether there is a function or something I can do in R, or something I can pass into stan, that would give it logit scale priors that would correspond exactly to these priors when it is transformed back to the probability scale. Do you see what I’m saying?

No, there’s no tooling for doing this.

Well I’ll push back on this. The only thing in the model that is on the probability scale is mu.

mu is a function of the parameters, but there’s not a 1-to-1 transform or anything. It’s not that mu started on the probability scale, got transformed to the logit scale, got a prior, and went back. It started on the logit scale with a prior and then was transformed in a way that probably isn’t invertable to mu, and mu is on the probability scale.

Intercept is constrained to be 0 to 1, but it’s not on the probability scale that matters for evaluating the beta likelihood.

1 Like

Thanks for your response and apologies for the delay - I was stranded with the weather in London and no computer.

Yes I do think I see what you’re saying, but ultimately all the mu-related parameters can be transformed into ‘mu’ and span between 0-1 when transformed in that way, and those are the mu values that make sense to think about for me intuitively (I can think very easily in terms of what changes each parameter might bring about on the ‘probability’ scale but much less on the logit scale). However, if there is no way to pass that info in then the answer to my question is just that I will have to work out the priors I want using the logit scale.