How to specify a flat improper prior for intercepts

Windows 7, brms version 2.16.1.

I’m experimenting with different priors in binary logistic modeling of simple mock data in order to learn how different priors affect results. One relevant set of priors to experiment with is a set of flat priors for every single parameter, which I expect to yield results exactly equal to a frequentist model.

But while “specifying” flat improper priors for betas is easy (brm does it by default if you don’t specify anything else), I don’t know how to specify a flat improper prior for the Intercept:

1000 -> N
x1 <- rbinom(N, 1, 0.5)
x2 <- sample(c(rep(0, round(.97*N)), rep(1, N-round(.97*N))))
x3 <- sample(c(rep(1, round(.985*N)), rep(0, N-round(.985*N))))
linpred <- -0.5 +x1 -x2 +x3
y <- rbinom(N, 1, plogis(linpred))
d <- data.frame(y, x1, x2, x3)
(flatpriors <- get_prior(y ~ x1 + x2 + x3, family = bernoulli, data = d))
# ^ This shows that Betas have flat priors while the Intercept has a student prior.
flatpriors$prior[which(flatpriors$class == "Intercept")] <- "(flat)"
flatprior.mod <- brm(y ~ x1 + x2 + x3, family = bernoulli, data = d, prior = flatpriors)

Error: The prior '(flat)' is invalid.

How do I explicitly specify flat improper priors?

So, note that “flat” priors are not a mathematically defined concept and they can be informative in a bad way because they generally indicate some fairly insane beliefs about potential effect sizes.

To answer your question though, here is an example of what an attempt at “flat” priors might look like in the context of brms.

# Specify a simple gaussian model improper 'flat' priors
bad_priors_example <- bf(
  y ~ x + z,
  center = F,
  family = gaussian(link = "identity")
)

# Specify (bad) Priors for the model
bad_priors <-
  prior(normal(0, 10000), class = "b", coef = "Intercept") +
  prior(normal(0, 10000), class = "b") +
  prior(normal(0, 10000), class = "sigma")

# Fit the model
bad_priors_example_fit <- brm(
  formula = bad_priors_example,
  prior = bad_priors,
  data = df,
  cores = 4,
  chains = 4,
  iter = 2000,
  warmup = 1000,
  save_pars = save_pars(all = TRUE),
  sample_prior = "yes" # could also be set to "only"
)

The key here is to set center = FALSE in the formula object which allows you to specify a prior on the actual population level intercept.

1 Like

Yet when I call prior_summary() on any existing model for which priors were not specified on the betas, those priors are described literally and explicitly as “(flat)”. How are those “flat” priors handled mathematically when the model is being fit? Does it just mean that the estimates are based on the likelihood function alone and are therefore maximum-likelihood estimates rather than Bayesian posterior means?

UPDATE: I seem to have “figured it out”. You just specify the prior in question as “” rather than “(flat)”.

Will of course welcome any further input nonetheless.

Try running

hist(rnorm(10e5, 0, 10000))

and you’ll see that while this may seem like a “flat” prior, it actually places the bulk of the probability squarely in the tails.

“Flat” priors are effectively saying that you think an effect size >100 on the log scale is just as likely as an effect size of zero if not more so and I can think of few if any realistic cases where such an assumption is warranted.

@betanalpha has a great explanation of prior models and the basics of probability theory here Writing - betanalpha.github.io that are worth reading through.

Edit: edited for consistency following @torkar’s suggestion below

3 Likes

But the example above used an identity link? Wouldn’t it be more fair to simply use:

hist(rnorm(10e5, 0, 10000))

Of course, that plot is also ridiculous… :)

1 Like

All my experimentation concerns qualitative outcomes (binary to quaternary) with a logit link function.

I don’t intend to actually use flat priors in the final analysis, but at this point I want to gain a better understanding of how different types of weak prior influence the logit coefficients especially of parameters on which the data itself has fairly little to say (the likelihood is weak).

As for priors yielding U-shaped distributions in the probability space, I’ve had the same argument on this forum before. Then, as now, I appeal to an authority:

Agresti, Alan (2013). Categorical Data Analysis. 3rd ed. Hoboken, New Jersey: John Wiley & Sons.

In which case b \sim N(0, 2) is probably a pretty diffuse prior since logit coefficients > +/-5 are basically unicorns.

If I recall, the approach suggested in BDA3 (chapter 7, I think) is to use a T distribution and vary the \nu parameter from around 3 which gives you some pretty thick tails to a higher value that approximates a normal distribution (see the image I threw together to illustrate this below) so that might be something worth considering as it would allow you to vary both the shape and width and examine how sensitive the estimates are to the shape of the prior.

2 Likes

Those plots are beautiful! What did you draw them with?

Chapter 16 seems to be where the BDA3 folks discuss logistic regression and unstable logit coefficients. They indeed present t distributions (as well as Cauchy and Gaussian ones) as reasonable priors for such coefficients (pages 416 to 419). I presently lean toward Gaussian ones with \sigma very much in the neighborhood you suggest, simply because my audience is statistically unsavvy and likely to find normal priors the easiest to understand.

The ggdist package is very useful for conveying prior assumptions and what those assumptions mean. The vignette here has some examples Slab + interval stats and geoms • ggdist and here’s some example code from a project I just presented

# required libraries: tidyverse, tidybayes, brms, palettetown, latex2exp

main_model_priors <-
  prior(student_t(4, 0, 3), class = "b", coef = "Intercept") +
  prior(normal(0, 2), class = "b") +
  prior(normal(0, 2), class = "gamma") +
  prior(cauchy(0, 2.5), class = "sd") +
  prior(exponential(0.8), class = "phi") +
  prior(lkj(4), class = "cor") +
  prior(normal(0, 1), class = "z")

priors_plot <- main_model_priors %>% 
  parse_dist() %>% 
  marginalize_lkjcorr(K = 2) %>% 
  mutate(
    distrib = case_when(
      prior == "student_t(4, 0, 3)" ~ TeX(r'($\alpha\, \sim \, Student\, T(4, 0, 3)$)', output = "character", italic = T),
      prior == "normal(0, 2)" ~ TeX(r'($\gamma_{k} \sim N(0, 2),\, \beta_{k} \, \sim \, N(0, 2)$)',  output = "character"),
      prior == "exponential(0.8)" ~ TeX(r'($\phi \sim Exponential(0.8)$)',  output = "character"),
      prior == "cauchy(0, 2.5)" ~ TeX(r'($\sigma_{j} \sim Cauchy_{+}(0, 2.5)$)',  output = "character"),
      prior == "lkj(4)" ~  TeX(r'($\Omega_{j}\, \sim \, LKJ(4)$)', output = "character"),
      prior == "normal(0, 1)" ~  TeX(r'($z_{j} \, \sim \, Normal(0, 1)$)', output = "character")
    )
  ) %>% 
  ggplot(., aes(dist = .dist, args = .args)) +
  facet_wrap(vars(distrib), scales = "free", labeller = label_parsed) +
  stat_dist_halfeye(
    aes(fill = prior),
    n = 10e2,
    fill_type = "gradient",
    show.legend = F
  ) +
  coord_flip() + 
  # Set the fill parameter for each group
  scale_fill_manual(values = palettetown::pokepal("Suicune", spread = 10)) +
  theme_bw(
    plot.margin = margin(5, 5, 5, 5, "mm"),
    strip.background = element_rect(fill = NA),
    strip.text = element_text(family = "serif", size = 14),
    axis.text.y = element_blank(), 
    axis.ticks = element_blank()
  )+
  labs(
    x = "",
    title = "",
    y = "Log Odds"
  ) 

# Render the priors plot to a file
ggsave(
  filename = "Main_Prior_Plot.png",
  plot = priors_plot,
  device = "png",
  path = getwd(),
  width = 15,
  height = 15,
  units = "in",
  dpi = "retina",
  type = "cairo",
  limitsize = FALSE
)

I’ve had some success using visualization to communicate priors to non-statistical/frequentist audiences in terms of assumptions about the universe of possible effect sizes (i.e., in the social sciences we can usually assume a priori that \beta_{k} < |5| and small effects are more likely than large ones).

2 Likes

I discuss tail shapes for one-dimensional prior density functions in Section 3 of my prior modeling case study, Prior Modeling. My discussion of “flat” prior models can also be found in Section 4.2.3 of the same document.

Let me also comment on the question fo whether or not a Bayesian analysis gives “equivalent” results to a frequentist analysis when the prior if flat. This is a common message but actually isn’t all that well-defined.

What is true is that for a flat prior density function in a fixed parameterization of the model configuration space the posterior density function will have the same shape as the realized likelihood function (up to normalization). Any change of parameterization, however, breaks this equivalence because the posterior density function picks up a Jacobian correction while the likelihood function doesn’t not. Another way of thinking about this is that concepts like “flat” and “uniform” can’t be defined universally but rather have to be defined within a fixed parameterization.

That said the outputs of frequentist and Bayesian analyses won’t necessarily be the same just because the posterior density function and likelihood function are the same – frequentist methods used the likelihood function differently from how Bayesian methods use the posterior density function!

Frequentist methods return parameter estimators (point estimators, interval or set estimators also known as confidence intervals or confidence sets, etc) that may not even depend on the likelihood function. The most common estimator that does utilize the likelihood function is the maximum likelihood estimator that returns the location of the peak, but the behavior of this estimator is useful only in the asymptotic limit. The likelihood function can also be used to construct interval/set estimators (using the curvature of the likelihood function at the peak or the shape of the entire likelihood function) but the coverage of these estimators can only be guaranteed in that same asymptotic limit.

Formal Bayesian methods integrate over the posterior density function to give posterior expectation values, which is an entirely different operation than the estimator construction in frequentist methods. Sometimes, however, certain expectation values and frequentist estimators will give the same answer when the prior model is flat. For example if the realized likelihood function is symmetric then the posterior mean will equal the maximum likelihood point. If the likelihood function is exactly bell-shaped then posterior credible intervals based on quantiles will equal certain likelihood-based confidence intervals.

4 Likes

This is basically the Bernstein-von Mises theorem, correct? Conditional on some pretty sweeping assumptions and with the exception of cases where we know a priori it breaks down (i.e., infinite parameter models) bayesian and maximum likelihood estimators will eventually converge to the same mean (the issue that asymptotically, we’re all dead notwithstanding).

This explanation is interesting given how pervasive the claim that “flat priors give you point estimates nearly identical to the equivalent frequentist (MLE) model” seems to be.

Sort of. The Berstein-von Mises theorem considers the asymptotic limit where the likelihood function and posterior density function concentrate around the “true” value so strongly that basically all characterizations are equivalent.

Preasymptotically, however, there can also be circumstances where certain posterior expectation values and frequentist estimators “accidentally” yield the same results.

Sadly many errors are pervasive in statistical discourse!

Perhaps the most pervasive is that everything is asymptotic, and hence well approximated by normal models and the like. The challenge is that all of these consequences fall apart when looking at more realistic settings where asymptotics aren’t a useful approximation.

2 Likes

There’s deep insight in this passage, and it’s baked right into the core of Stan itself when Stan forces you to declare your parameters in a parameters block. This isn’t some syntactic quirk to placate Stan’s compiler and tell it what variable names to expect later on; rather it represents the choice of parameterization for the model. For any future readers who don’t quite see what’s going on here, I’ve tried to preserve the important intuition while stripping away as much math as possible here Jacobian adjustments explained simply

4 Likes