Tuning priors for linear mixed model

I’m trying to tune the priors for a multivariate longitudinal model as part of prior predictive checking. However, I’m finding this a very difficult process. Here’s a toy univariate example however using the sleepstudy dataset from the lmer package:

library(lme4)
library(brms)

fit_prior <- brm(
    Reaction ~ Days + (Days | Subject),
    data = sleepstudy, chains=4,
    prior = c(set_prior("normal(0, 10)", class = "b")),
    sample_prior = "only")
> prior_summary(fit_prior)
                     prior     class      coef   group resp dpar nlpar bound       source
              normal(0, 10         b                                                 user
              normal(0, 10         b      Days                               (vectorized)
 student_t(3, 288.7, 59.3) Intercept                                              default
      lkj_corr_cholesky(1)         L                                              default
      lkj_corr_cholesky(1)         L           Subject                       (vectorized)
     student_t(3, 0, 59.3)        sd                                              default
     student_t(3, 0, 59.3)        sd           Subject                       (vectorized)
     student_t(3, 0, 59.3)        sd      Days Subject                       (vectorized)
     student_t(3, 0, 59.3)        sd Intercept Subject                       (vectorized)
     student_t(3, 0, 59.3)     sigma                                              default

pp_check(fit_prior, ndraws = 200)

This will typically give me a PPC like so:

So, maybe this is ok, but its exploring a very wide space for the response variable, and I don’t expect Y to ever be negative. So I’ll try being alot more informative about sigma and sd:

fit_prior2 <- brm(
    Reaction ~ Days + (Days | Subject),
    data = sleepstudy, chains=4,
    prior = c(set_prior("normal(0, 10)", class = "b"),
              set_prior("student_t(3, 0, 1)", class = "sigma"),
              set_prior("student_t(3, 0, 5)", class = "sd")),
    sample_prior = "only")

pp_check(fit_prior2, ndraws = 200)

Is this actually better? I’m not really sure. Yes it tends less to negative values, but now each line is very narrowly up and down and none look close to the shape of the real y. Can anyone advise here? I’ve been playing with it for some time varying any prior I can find, but never seem to get it quite right.

1 Like

Is it really true that there are no lines close to the shape of the real y in that second example?

I can’t really see well enough to tell. But is that important? I consider the high loops to be unrealistic given the nature of the data (a study on sleep deprivation in 18 individuals over 10 days - all the response variables being tightly grouped across all individuals at all times is highly unrealistic)

If the aim of the priors in this case is to prevent estimating negative values, then why not use a likelihood that explicitly prevents this from happening (e.g., log-normal or gamma)? The priors are affecting the whole joint posterior, so I imagine that it will be hard to tweak specific priors to induce a specific posterior shape.

It seems like perhaps the prior predictive check is showing you that the default priors work well (at least they seemed to if we ignore the negative estimates) but that the likelihood is permitting values that aren’t reasonable, not the priors per se

2 Likes

No the aim is the constrain the parameter space to realistic values, negative values was just an easy example of that. I’d also like to out-rule for example +5,000 which is also totally unrealistic value. I’ve considered lognormal family, but I still have prior tuning problems, and also my co-authors/the field want linear slopes as an output which lognormal won’t allow me (I dislike this reasoning, but for now at least its not a point I want to push).

But isn’t this the point of informative priors? I want the priors to limit the predicted range to outrule totally impossible values (based on subject matter expertise). - or +5,000 are never gonna happen in real data. Or to put that another way, I find both the distributions produced by the prior predictive checks above to not look like any real dataset I’d expect to see for the setting.

Well, it’s a pretty standard linear mixed model. I spent some time thinking about each parameter and trying different things and maybe I figured something out. It seems that the key prior controlling this behaviour is actually the sd for the group level slope, (and maybe to a lesser extent the group level intercept). Thinking about the example above this makes sense (to me at any rate). The default priors for these parameters from fit_prior above are:

     student_t(3, 0, 59.3)        sd      Days Subject                       (vectorized)
     student_t(3, 0, 59.3)        sd Intercept Subject                       (vectorized)

This allows really wide values for subject level slopes and intercepts. But in the real world for data in this example (reaction times in sleep deprived people over ten days ), I would never expect to see slope of this magnitude in an individual. If I did, I’d know something was wrong with the data and go back to check what happened. So bearing this in mind, I did a new PPC as follows:

fit_prior2 <- brm(
    Reaction ~ Days + (Days | Subject),
    data = sleepstudy, chains=4,
    prior = c(set_prior("normal(0, 10)", class = "b"),   #effect size
              set_prior("student_t(3, 0, 5)", class = "sd", group = "Subject", coef = "Days")),
    sample_prior = "only")

pp_check(fit_prior2, ndraws = 200)

Resulting in:

To me, at least, this looks much better than the two earlier attempts. It mostly cuts out negative values of Y, and mostly restricts positive values under 1000, and the shape of the curves look more realistic to my eye.

It sounds then like you will need to be using some fairly specific informative priors. I’d encourage you to be pretty deliberate in separating prior predictive checks from model construction. If you’re looking at the prior predictive checks and then modifying the priors to look like your observed data over and over again, then you’re using your data to create “informative” priors that match your data rather than creating informative priors that match your expectations. This can be dangerous, particularly if you have small data where the priors can exert more influence on the posteriors.

Ideal cases for priors are often that the majority of results can fall in a certain range but we at least entertain the idea that we could be wrong about the true parameter values. In this respect, longer tails on some of your prior predictions may be desirable in allowing the data to potentially go against your expectations, which is what we want to see in good science: if we’re wrong and we observe data showing us that we’re wrong, then we should adjust our beliefs.

The fact that the random effects influence the dispersion of your predictive estimates makes a lot of sense. It seems like your concerns from the plots is that there is a high amount of variability in what the priors believe can happen, so if you were to tighten those ranges, then you observe less of that variability and more concentration around central tendencies.

I think your goal here is really not necessarily prior predictive checks against your data but some simulation-based calibrations. As I noted, I think you want to be careful about tuning your priors to match your data, so the way around that is to simulate data under your assumed data generation process and then check that your priors are appropriate to this simulated data. This involves prior predictive checks but then also inspection of the posterior to see that your final model recovers the parameters as intended. While the priors are important, the ultimate point of Bayesian modeling is to obtain a reasonable and appropriate posterior. Simulation-based calibration lets you tweak the priors without relying on the data and lets you verify the more important aspect of all of this: do the priors lead to reasonable posteriors with data like what you observed?

I’d encourage thinking about models in a more holistic fashion where the entirety of the model aims to recreate the data generation process. You can have the perfect priors, but if your data are generated from a skew normal or gamma distribution, then your model is not going to necessarily capture this information and the posterior predictive checks may be fairly off. Obviously, you have other limitations here with respect to your co-authors, though I’d encourage some serious discussion around the intended aims and goals of the analysis. When I’ve advocated for less traditional statistical approaches, I often run the desired analyses and then my suggested alternative to show co-authors the relative advantages of the latter (though I’ve also been wrong before and it’s a helpful exercise to think about data analysis from multiple perspectives).

Though, brms offers quite a few different convenience functions/arguments for transforming linear predictions. While that can take some thinking about what’s appropriate and when, I’ll just say that the conditional_effects() plots in brms are very intuitive in this respect