Choosing weakly informative priors for population-level effects in a Poisson GLMM

I am using brms in R to fit a generalized linear mixed model using a Poisson distribution with log link.

The model takes count data that ranges from 0 to 3 as response variable, and two z-transformed (standardized) continuous predictors and their interaction as population-level (fixed) effects.

I wanted to use priors for the population-level effects that do not give importance to completely implausible values for the coefficients (I guess this can be called weakly informative priors?). Kind of the same aim advocated by Gabry et al. (2019), but without the data-visualization part.

For example, if this was a linear model, I would set the prior for all fixed effects as a normal distribution with mean = 0 and sd = 1.5, meaning that I am pretty (95%) sure that a one standard-deviation increase/decrease in the predictors will not be associated to more than a 3-unit increase/decrease in the response variable — after all, this is the entire range of my response variable!

I am having a hard time applying the same reasoning to a Poisson log model, because a one-unit increase in the predictor is associated to a multiplicative (not additive) change in the response variable. Following the line of thought above, absolutely any value would be plausible, from minus infinity (ln(0/3) = -\infty) to infinity (ln(3/0) = \infty).

My question is: how can I choose a reasonable prior in this case? Or do I — and, by extension, anyone fitting Poisson log models with a response variable that includes zeroes — have no choice but to set a flat prior?

I would do prior predictive checks. So set priors and sample only from the priors. Then plot them on the outcome scale to see that they are sane. Since we employ a \log() link things will be acting strange, so plotting helps out understanding the implications of your priors.

1 Like

Thanks for the reply, @torkar!

I have tried prior predictive checks with different combinations of priors as advocated by @jonah et al. and indeed they helped me find a range of plausible priors. It was particularly useful in showing that the default brms prior for the random effect was much too broad.

I guess what I am most interested in is justifying the choice of a particular distribution beyond the fact that it simulates plausible data. In this example…

… I would be able to articulate what I am assuming by a normal(0, 1.5) prior, and justify it. I am saying that, for a given outcome of 0, if I increase any predictor by one sd unit, the outcome will not be greater than 3 because that’s the range of the observed data. I believe even a future reader who is completely unfamiliar with Bayesian statistics would be able to understand (and agree or disagree with) this.

What I’m finding so hard about the Poisson log model is that I cannot phrase the consequence of assuming a range of coefficients in the same way. Say I decide to use a normal(0, 3) prior for the fixed effects — which yields reasonable values in prior predictive checks. exp(3) = 20.1, so that’s like saying that I’m 95% sure that a one sd increase in the predictors will not multiply/divide the outcome by more than 20.1 × 2 = 40.2 times. But multiplying 0.0001 by 40 is completely different from multiplying 2 by 40, so how can I justify this specific value a priori?

(Sorry if this is kind of off-topic — I’m realizing that maybe this has more to do with interpretation of Poisson coefficients than with choice of priors per se!)

Hmm, I remember a year or two ago @avehtari introduced me to the horror of setting priors in the log (Poisson) world. I guess he has some smart things to say about the above (I hear what you’re saying but Aki will be able to explain it nicely I believe :)

1 Like

I would agree with @torkar that plotting is one solution to this.
The crux of the non identity links is that stuff ceises to be linear, which leads to the problem you describe here:

The conclusion of that observation is, that coefficients (and their priors) in a model with a log/logit link (or any non-identity link that looses linearity) can only be understood in the context of the entire model.
I remember having a similar question but apparently I never asked it here myself. But I remember reading an answer to that question that was a from a book and stated the same thing.

When I worked on an analysis with a similar problem I had prior values for the mean of the data but not the effect sizes, so what I did was transform the mean values to the logit scale and then, starting from the default priors, iterate over the other priors until the pp_check plot matches my prior knowledge.
Ultimately, my intuition is on the outcome scale. I know that it is unreasonable for the outcome to be bigger then eg. 100. So I tune the priors until the pp_check shows a curve that matches this expectation.

An even better way seems to be the new adjustr package, though I haven’t tried this one myself: https://corymccartan.github.io/adjustr/index.html

Thanks for these nice tips, @scholz!

I completely agree with what you said about the coefficient priors only making sense in the context of the entire model. I think this is precisely what I am finding the most difficult: I need to have a starting point in the response variable to evaluate whether or not a potential multiplicative effect would make sense.

Just to check that I understood your approach correctly: what you suggest is that I look at potential coefficient values assuming the effect they would have on the average outcome?

If so, I believe this approach makes a lot of sense and I can definitely think of how I can apply it to my case. \overline{y} = 0.28 in my data. All my predictors are centered around the mean, so I think it makes sense that the intercept is close to this value. Being very (very) conservative, I would not expect \hat{y} to ever be more than 50. This means a maximum multiplicative effect on \overline{y} of 178, or 5.2 on the log scale. The opposite effect would be 0.28/178 = 0.001, but I’m less sure about this part because I find it plausible that \hat{y} is as close to zero as possible.

Following this line of thought, seems like normal(0, 5.2) would be a nice choice of prior for the coefficients. This is pretty close to the student_t(df, 0, 5)*, with 3 < df < 7, suggested by @andrewgelman et al. for logistic and log models here and here (although more informative, I guess, because normals are stricter).

(*They actually suggest student_t(df, 0, 2.5) for predictors standardized to have sd = .5, but mine are standardized at sd = 1)

Of course I’ll want to visualize prior predictive checks to see that the scale of the simulated outcome is reasonable, but if the justification above makes any sense, I’m already thrilled, because that’s what I was finding the hardest!

@torkar just a note to say that I accidentally found the post you’re talking about (I think): Use of `sample_prior='only'` in brms

I’m having the same problem with priors that I thought were fairly tight — certainly much tighter than what I’m used to seeing in my field — generating NAs in almost 100% of the pp_checks, and that’s how I came to find your post. Lots of useful stuff there!

2 Likes

Yes that’s the one! :)

1 Like

Maybe you are remembering this blog post by @anon75146577 https://statmodeling.stat.columbia.edu/2018/09/12/against-arianism-2-arianism-grande/

1 Like

Yeah, that blog post by @anon75146577 was a joy to read. But this link, where you explain things very nicely, together with @richard_mcelreath’s examples in his book, taught me to always plot priors no matter what:

@jocateme, @Solomon’s work on the 2nd edition of Richard’s book also has more examples on prior predictive checks:

5 Likes

Thank you both for such excellent suggestions!

The post by @anon75146577 is incredibly useful. This quote, in particular, is music to my ears:

Firstly, it is always good practice to model relative-risk rather than risk. That is to say that if we observe y_i disease cases in a population where we would ordinarily expect to see E_i disease cases, then we should model the counts as something like y_i\sim \text{Poisson}(E_i\lambda) rather than y_i\sim \text{Poisson}(\tilde{\lambda}). […] All of these numbers are assuming that E_i\approx 1 and need to be adjusted when the expected counts are very large or very small.

1 Like