Better priors (non-flat) for GAMs (brms)

TLDR; What are some possible heuristics to set non-flat priors for general additive models (GAMs)?

I am trying to run a somewhat complex shifted log-normal GAM, but I have problem understanding, and thus setting, appropriate priors. (This is motivated by the fact that the model doesn’t converge well, which seems to be related to the default, flat priors).

library(brms)

formula <- brms::brmsformula(mpg ~ s(wt), family = shifted_lognormal())

brms::get_prior(formula, data = mtcars)
#>                 prior     class  coef group resp dpar nlpar bound       source
#>                (flat)         b                                        default
#>                (flat)         b swt_1                             (vectorized)
#>  student_t(3, 3, 2.5) Intercept                                        default
#>     uniform(0, min_Y)       ndt                                        default
#>  student_t(3, 0, 2.5)       sds                                        default
#>  student_t(3, 0, 2.5)       sds s(wt)                             (vectorized)
#>  student_t(3, 0, 2.5)     sigma                                        default

Created on 2021-06-15 by the reprex package (v2.0.0)

In the toy example above, there is one fixed parameter (swt_1) which I assume is related to the smooth term (s(wt)). However, I am not sure what does it mean (does it control the wigliness of the smooth term? or some other features of the splines?)

In particular, how can set somewhat better default priors without running the model with flat priors first, i.e., based for instance on the range / SD of the response or predictors?

For instance, running the model with flat priors gives me a negative coefficient for the swt_1 parameter, which I’m not sure how to interpret…

summary(brms::brm(formula, data = mtcars, refresh = 0, iter = 500))
Warning: 1 of 1000 (0.0%) transitions ended with a divergence.
This may indicate insufficient exploration of the posterior distribution.
Possible remedies include: 
  * Increasing adapt_delta closer to 1 (default is 0.8) 
  * Reparameterizing the model (e.g. using a non-centered parameterization)
  * Using informative or weakly informative prior distributions 

 Family: shifted_lognormal 
  Links: mu = identity; sigma = identity; ndt = identity 
Formula: mpg ~ s(wt) 
   Data: mtcars (Number of observations: 32) 
Samples: 4 chains, each with iter = 500; warmup = 250; thin = 1;
         total post-warmup samples = 1000

Smooth Terms: 
           Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sds(swt_1)     0.31      0.33     0.01     1.23 1.01      333      428

Population-Level Effects: 
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept     2.66      0.17     2.35     2.94 1.01      527      496
swt_1        -1.87      0.78    -3.33    -0.21 1.00      426      299

Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma     0.20      0.05     0.13     0.30 1.01      544      676
ndt       4.44      2.14     0.39     7.99 1.01      529      547

Samples were drawn using sample(hmc). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

I can set a prior based on this model:

priors <- c(set_prior('normal(-1.87, 0.78)', class = 'b', coef = "swt_1")) %>%
  brms::validate_prior(formula, data = mtcars)

but I would like a way to find reasonable priors “from the data”.

Any thoughts and suggestions are welcome!

1 Like

Hi, that’s a good question!

I’ll start with:

that’s usually a bad idea. Priors should be derived from domain expertise or other considerations that you can make without taking the actual data into account e.g. experiment design, data from a pilot/prior experiment that you don’t use in this new model, etc. Deriving priors from data runs the risk of being onverconfident in your inferences. Although we often make some decisions about priors only after we collected that data (e.g. because the model does not fit), one should always try to check if the prior would be defensible without seeing the data.

However, you almost always have at least some prior information. E.g. in the shifted lognormal model, all the coefficents correspond to logarithm of multiplicative changes to the mean. In many cases, assuming that groups or (min/max values of a continuous predictors) differe less than say 20-fold are reasonable, so having max_predictor_difference * coefficent < log(20) is a good rule of thumb, which can be well represented by something like normal(0, 1.5 / max_predictor_difference) as log(20) / 2 ~= 1.5. That’s already quite narrow prior and we only needed to know the predictors (i.e. design), not the data!

Some guidance (although some of it a bit outdated) can be found at Prior Choice Recommendations · stan-dev/stan Wiki · GitHub and the most general way to set priors is to use prior predictive checks (discussed e.g. in the Bayesian workflow preprint or visualisation preprint). You can often run prior predictive checks in brms by setting prior_only = TRUE, but it is often instructive to try to write your own simulation code in R to verify you understand what brms is actually doing :-).

To get to your specific case - brms borrows the parametrization of smooth terms from mgcv. There is an IMHO good explanation on how that wors at Random effects and penalized splines are the same thing - Higher Order Functions but notably, the b_swt_1 parameter is basically the linear trend of wt, although wt values are somehow centered and rescaled - I don’t understand exactly how, but you can look at the result of

dd <- make_standata(brms::brmsformula(mpg ~ s(wt), family = shifted_lognormal()), data = mtcars)
dd$Xs
plot(dd$Xs, mtcars$wt)

to see it is just a linear transformation, the resulting plot is:

Best of luck in setting your priors!

3 Likes

Hi Martin,

Thanks a lot for your detailed answer - it’s very helpful.

These practical rules of thumb to avoid flat priors are imo very useful (and somewhat hard to find advice on).

That’s indeed a very important piece of information!

It should be now easier for me to understand and interpret the parameters of my models. Thanks again :)

1 Like

I appreciate that this is now an older posting and that @martinmodrak has provided an answer that solved @DominiqueMakowski’s original question. But I thought I’d briefly comment on this aspect of the question

In the toy example above, there is one fixed parameter (swt_1 ) which I assume is related to the smooth term (s(wt) ). However, I am not sure what does it mean (does it control the wigliness of the smooth term? or some other features of the splines?)

in case others arrive here while looking for info on priors for smooths in brms.

With these penalised splines we have terms in the spline basis expansion that are affected by the wiggliness penalty and terms that aren’t. For the default second derivative penalty, the linear and constant functions are in the null space of the penalty meaning that they aren’t affected by the penalty (the have 0 second derivative). Due to identifiability (sum to zero) constraints imposed on the basis expansions of all splines in one of these models so we can have an intercept term, the constant function is removed from the span of functions representable by the basis. That leaves the linear function only that is unpenalized.

When we cast these splines in random effect form, the functions in the penalty null space get treated as fixed effects terms in the model, while the basis functions in the range space (those affected by the penalty) are treated as random effect terms. In this fully Bayes approach we now need a prior on the otherwise unpenalized linear basis function, plus a prior for each smoothing parameter (penalty) associated with each smooth (usually this will be 1, but can be more if you are using tensor products or some of the fancier bases). In the random effect form we don’t put priors on the smoothing parameter(s) but on the variances (standard deviations) of the random effect. The variance of the random effect is inversely proportional to the smoothing parameter (smaller variances = larger smoothing parameters = less wiggly smooth functions).

Hence, in your model you have sds(swt_1), which is the variance (standard deviation) parameter for the wiggly basis functions - the further this is away from 0 the more wiggly the estimated smooth - and you have the swt_1 population (fixed) effect which is for that linear basis function.

brms knows about all this, and pulls the linear term back into the basis when you plot the estimated smooths or prior draws etc.

But you can pop a gaussian or t prior on the swt_1 term, for example, and another separate prior on the sds term, and then simulate from the prior distribution of the model (with prior_only = TRUE) and plot the prior draws using conditional_smooths() to see samples of smooths drawn from your prior. You should be looking at the amount of wigglines of the prior smooths to see if that about on average matches plausible range of wiggliness for the non-linear relationships you envisage based on your prior knowledge. You can also look at the values on the y axis to see if the implied effect sizes are reasonable given your prior expectations or not. Then you can adjust the priors you use until you effect sizes roughly in line with expectations and the same for the wigglinesses of the smooths implied by the prior.

8 Likes

@paul.buerkner this post from @ucfagls might be worth incorporating into brms doc/vignettes around splines somewhere.

2 Likes

Good idea!

1 Like