Priors for a novice

I’m trying to move beyond default priors in brms and thus did a lot of reading during the last weeks since I’m a novice.

In general, I’ve got the impression that there are many arguments against using very vague and especially flat priors (this one I found very informative: https://betanalpha.github.io/assets/case_studies/weakly_informative_shapes.html). And yet, when I look at published articles using Bayes in my area (education and psychology), most authors argue that they had not a lot of prior information about the parameters, so that they rather used very vague priors such as N(0, 10000).

My question is, whether the actual response scales that we use in the study do not already provide a lot of information that we can use. For instance, my response variables can only take values between 0 and 4, because they represent scores on such a likert-type scale. So my understanding is that a regression coefficient for a given predictor is probably close to 0 and cannot be larger or smaller than 4/-4. Or am I misunderstanding this? So based on this, I would think that a prior of the form N(0,4) should be a useful prior? Would this be considered a very informative prior?

PS: I’ve posted a lot of questions during the last weeks, but I’m afraid there is still no end to it since I’m trying to get familiar with bayesian analysis and brms more specifically having this forum (and of course the www) as my only help… Hence, many thanks for all your help, especially Paul’s!

4 Likes

Hi Pia!

Generally, I’d say you are on the right track. When I started out learning Bayesian Stats and Stan, the prior was fairly mysterious to me. The more I worked on real applications, the more I felt comfortable to apply reasonable priors – and the more I realized just how ridiculous those seemingly uninformative priors often are.

And yet, when I look at published articles using Bayes in my area (education and psychology), most authors argue that they had not a lot of prior information about the parameters, so that they rather used very vague priors such as N(0, 10000).

I think people are a) too lazy to think about their prior expectation about model parameters, b) don’t realize that they do have quite a lot expertise (domain knowledge) to formulate reasonable priors, c) fear that other might deem their analysis too “subjective” – so better be as vague as possible, d) lack the ability to translate their intuitions to a prior distribution, and so on… while writing this list I realized, that there are a ton of potential reasons not to use reasonable prior… Also, for me it was mainly d) in the beginning.

My question is, whether the actual response scales that we use in the study do not already provide a lot of information that we can use.

This is a really good start. You might also want to check out this paper. Generally, you’d want to think about the response variable’s scale, but also about the scale of the predictor variable: Changing the units of the predictor variable should also change your prior belief about it’s regression coefficient – at least if your prior is fairly informative.

So based on this, I would think that a prior of the form N(0,4) should be a useful prior? Would this be considered a very informative prior?

I think it is definitely much more useful than a N(0, 10000) prior. I still don’t think that it is very informative – at least not for a regression coefficient. With N(0,4) you imply to not be that surprised to see a regression coefficient of 5 or greater:

> pnorm(5, 0, 4, lower.tail = F)
[1] 0.1056498

That means that you assume there is a 10% chance that a one unit change in that particular predictor variable could have a positive “effect” that is basically at least the whole range of your response variable. And you give the same a priori chance to a negative effect on that same predictor. Well, that could be true, if the units of the predictor are very samll – or it really is a “that could change everything” type of predictor. But it’s not really informative, I’d say.

Also, you could just do rnorm(1, 0, 4) a bunch of times and ask yourself “Would I be surprised if I’d this number in the regression output table? As a mean/median, or lower/upper CI?”. Or just do hist(rnorm(1000, 0,4)). Now, this is all a bit silly for the Normal, but I think it is quite useful if you are dealing with non-normal priors (think of priors on the scale of your response variable…).

Now, the really proper way of doing all this is via “prior predictive checks”. And for this, you might want to check this paper. There are a lot of functions in bayesplot and brms that can help you develop a workflow that lets you choose reasonable priors.

PS: I’ve posted a lot of questions during the last weeks, but I’m afraid there is still no end to it since I’m trying to get familiar with bayesian analysis and brms more specifically having this forum (and of course the www) as my only help… Hence, many thanks for all your help, especially Paul’s!

I always recommend Statistical Rethinking by @richard_mcelreath. Also check out his lectures on youtube.

Hope this helps.

8 Likes

Hi Max,

This helps a lot! Many thanks for your useful advice. I think reason c) for using relatively vague priors is very much in my mind. Even though I know that it is basically impossible to have a regression coefficient of 5, I would be hesitant to use anything more precise because others might critique me for it. But then I probably just need to be able to justify it well enough and should be OK. Thanks also for pointing out the resources, I will check them out during the next days.

I hope it’s OK if I follow up on this later on in case something else comes up.

Tbh, using a N(0,4) or N(0,10) prior will probably not make a huge difference in the posterior distribution (I’m coming from social science myself…). I think the more you think about it and work with and use it, the more confident you will become and the more you’ll be able to meet such criticism with good answers.

I can only agree with what @Max_Mantei said and add one thing: Plot your priors.

@richard_mcelreath doesn’t do it that much in his first edition, but in the second edition, I believe he puts a larger emphasis on this. Here I’ve taken an example from his book and used both map2stan and brms to say something about the priors used (the source can be found here).

I would strongly recommend you do do something similar in your case: You want to remove absurd values, but allow some extreme values.

4 Likes

Anytime you’re reading a write up of a Bayesian analysis that mentions subjectivity in a negative light or retreats to “noninformative” as if that were some ideal then you should take the rest of the analysis with a grain of salt.
Priors in applied fields are almost always too diffuse, often by orders of magnitude. One can take this pessimistically and be sad about the current state of the field or one can take it a bit more optimistically and recognize the opportunity for improving things. I personally oscillate between the two.

What we tried to argue in https://arxiv.org/abs/1708.07487, which@max_mantei posted above, is that in order to determine the overall utility of your prior you need to not just visualize it you need to visualize the consequences of it in the context of your entire model. Moreover, in a complex model these consequences are often correlated across multiple parameters, which means that you cannot just look at prior densities individually. You have to consider their collective behavior.

To demonstrate what I mean consider a log regression of, say, number of units sold as a function of heterogeneous intercepts for varying consumer groups and marketing campaigns,

y \sim Poisson(\alpha_{0} + \alpha_{cons}[consumer] + \alpha_{market}[market]).

What kind of priors do we want for \alpha_{cons} and \alpha_{market}? Well, what kind of change heterogeneity might we expect in each effect?

Let’s say that a 10% effect is reasonable in both – in other words variations between 0.9 * (baseline intensity) and 1.1 * (baseline intensity) are reasonable. In that case the scale for the priors should be \log(1.1) \approx 0.1, which could be encoded in a prior like

\alpha_{cons} \sim \mathcal{N}(0, 0.1)!

Now 10% effects are huge in most applied settings yet the induced prior scale of 0.1 is way, way, way, smaller than even what people who espouse informative priors recommend!

But that’s just one effect. What happens when we add multiple effects together? The variance of the log intensity grows! In fact it grows with the square root of the number of effects we add together. So if

\alpha_{cons} \sim \mathcal{N}(0, 0.1) \alpha_{market} \sim \mathcal{N}(0, 0.1)

then

\log \lambda = \alpha_{cons} + \alpha_{market} \sim \mathcal{N}(0, 0.1 \cdot \sqrt{2}).

This might not seem particularly important but if we’re adding many effects together then this inflation of the variance of the log intensity can lead to priors are that are far too diffuse. More intuitively, the more independent effects you have the more your prior will admit at some of them to have large values which cases larger intensities. One thing that we are recognizing is how important the behavior of the joint prior is, and how independent priors have to be used very carefully in high dimensional problems.

Anyways, besides trying to work through all of this stuff analytically what can you do to identify poorly chosen priors? You can work through the prior predictive distribution and its intermediaries. In this example you can sample parameters configurations from the joint prior and then start by examining each independently. Then you can work through your model, computing \lambda = \exp(\alpha_{cons} + \alpha_{market}) for each sample and then examine the distribution of values. If your prior admits too many extreme intensities then you know you have to reevaluate your choices! Finally you can go all the way through your observational model and sample data. Simulated data from the prior predictive distribution is often much easier for qualitative experts to reason about and provides a useful way to elicit domain expertise from even those people are are reticent to admit that they actually know anything!

These steps are common to many GLM models but in general which distributions you examine will depend on the specific context of your bespoke model. The more experience you have the more comfortable you will become, although one is always learning new techniques within this general strategy.

For more see https://betanalpha.github.io/assets/case_studies/principled_bayesian_workflow.html, or https://arxiv.org/abs/1904.12765 for a more applied take.

15 Likes

Agree with everything in this thread about prior predictive simulation. Really essential in my experience. 2nd edition of my book (and lectures) contains many examples. You can access it all here: https://github.com/rmcelreath/statrethinking_winter2019

8 Likes

Thank you all. This gives me a lot to work through and think about!

Hi all,

I have now a couple of additional questions and would be grateful if someone can help.

First, I think I have a relatively good grasp of specifying priors for the regression coefficients now but I struggle with the priors for the intercept. How do you approach those?

Second, I have one ordinal variable for which I use the cumulative family with probit link. Do I have to specify the regression coefficient prior for that ordinal predictor in a different way and does it matter whether only the predictor, only the response, or both are ordinal?

Third, in some of the resources you provided it is stressed that the joint prior rather than the individual ones is important. How do I know what the joint prior is and how can I sample from it in brms (or more generally, what do I with it?)

Thank you

Hello again,

Sorry for being impatient but I’d like to ask another question before the previous ones have been addressed. I have a question for priors on standard deviations of group-level (‘random’) effects. According to set_prior: Prior Definitions for 'brms' Models in brms: Bayesian Regression Models using 'Stan'

These parameters are restricted to be non-negative and, by default, have a half student-t prior with 3 degrees of freedom and a scale parameter that depends on the standard deviation of the response after applying the link function. Minimally, the scale parameter is 10. "

Is there a reason why the minimum scale parameter is 10? Do I just reduce this in order to have a more specific prior, such as using the actual sd of the response?

Thanks.

Note that what is often called “intercept” in the output of an ordered regression model are in fact the cut points, which are used to map the continuous latent variable onto the observed response categories. Therefore, not everything that applies to the “typical intercept” necessarily also applies to “intercepts” in ordered regression models.

Modeling of ordinal outcomes and predictors are somewhat independent issues.
However, if you are interested in the associations between latent variables, packages like lavaan (and it’s Bayesian extension blavaan) could be of interest.

About joint priors: I think this is a reference to prior predictive checks. The idea here is to check if the model (which includes the priors) makes predictions that are reasonably similar to the observed data before the model parameters were estimated (one way to do this is to use the piors_only option in brms).

2 Likes

Hi, I believe this has changed in brms recently and @paul.buerkner now uses sample_prior="only", if one wants to draw only from the priors to do prior predictive simulations.

thanks for pointing this out!

Thanks you two, that is very useful. I was not aware of the sample_prior=“only” option, so that’s great info.

Any ideas about the prior for the sd of group-level effects? Thanks
Edit: I have just run a model using the default priors for sd of group-level effects and sigma for family specific parameters (student_t(3, 0, 10)) and the posterior estimates and est.error are absurd! I have a scale of 0 to 4 in both outcomes and response but get estimates for sd and sigma of 10 and above. Could that be because of the too broad priors?

Did you get these estimates when estimating the posterior distribution or with sample_prior = "only"?

That’s with sample_prior = only.

EDIT: Oh right! OK, I think I get it now. So the estimates do not take the actual data into account but it’s only based on the prior? That makes sense given the name. I just understood before that this would only have a consequence for the predictive checks rather than also influencing the model estimates. So I would have to run the model twice, once considering only the prior and once without that function and then I can plot both of them to compare?

Slightly off question topic, but regarding your walk through of the height regression:

Suppose you assumed heights were Gamma(\mu, k) distributed. Instead of using the sigma to compute uncertainty, would you combine your intercept estimate with the shape estimate to compute the variance around your \mu estimate?

In general, when using exponential families, do you compute the first two moments of your posterior distribution using the Population & Family level effects? Apologies if this is a vague question.

Sorry to say, I’ve never tried it out myself. But if you do, please put up an example online where you try out different approaches so we all can learn something from it :)

The standard approach is to reparameterize the Gamma family of probability density functions, \mathcal{G}(\alpha, \beta), in terms of a mean parameter, \mu = \alpha / \beta and a dispersion parameter, \phi = 1 / \beta. You then fit a linear model for the log of the mean parameter,

\log \mu = \alpha + \beta \cdot x + \ldots

and the standard deviation is given by \sqrt{ \phi \cdot \mu}.

The same strategy applies to any of the members of the exponential family defines on constrained spaces – reparameterize in terms of a mean parameter and build your linear model on that new parameter.

3 Likes

Months later, I am still working on some of my priors and have difficulties understanding those in a cumulative probit model, which I hope someone of you can help me with.

I have this model: as.ordered(y) ~ 1 + x + mo(z)
with family = cumulative(link = “probit”)

As I understand, these are the priors I need:

  • threshold (class=intercept) which is a student-t(3,0,10) default prior in brms
  • regression coefficients (class=b) for x and mo(z), that is a continuous predictor and a monotonic predictor, which I think have flat default priors in brms
  • simplex parameter (class=simo) for mo(z), which has a dirichlet default prior with α =1.

This is my understanding of these priors:
-threshold coefficients are z-scores which can be used to calculate the odds of being in a given category or a higher category.

  • a dirichlet prior with α =1 means that one assumes that moving from one category to the next of predictor z will have the same effect on y no matter whether you move from category 1 to 2, or 2 to 3, or 3 to 4 etc.
    -regression coefficients can be interpreted as the difference in z score associated with each one-unit difference in the predictor variable. So say the coefficient for x is .15, this would mean that a one unit increase in x is associated with an increase of .15 standard deviations in y? Both my continuous and my monotonic predictor are on a scale from 0 to 4. Would it thus make sense to use the same prior or is there a difference between priors for continuous and monotonic effects? And does a prior of N(0,1) sound reasonable similar to when using a standardised outcome?

Hope someone can tell me whether my understanding is correct and give some advice regarding priors for the regression coefficients. Thanks!

1 Like