Prior for logistic regression when having knowledge on outcome

first, I wanted to thank all the contributors to Stan and rstan(arm) and related packages for this great work! I’m quite new to Bayesian analysis, but for my current research I’d like to apply Bayesian methods. I have a binary outcome (event of “falls” - yes/no - of people with dementia during their hospital stay) and various independent variables, and now I’m struggling with which priors I should use.

I’m using rstanarm to fit my model. If I know that in general the prevalence of falling for dementia patients in hospitals is about 30% - how do I use this knowledge to define my priors? So best prior knowledge I have is about my outcome variable. My guess is that I would use the default (=normal) priors for the independent variables, but I would specify a different prior-intercept in the stan_glm()-call… Is this correct? And if so, how does a “30% prevalence rate” translate into a prior-distribution for my logistic regression model?


Presuming you know that from past data rather than the y you have now, then you can use the fact that the rstanarm runs with the covariates centered. So, you can interpret the intercept as the log-odds of the outcome conditional on all covariates are at their means, which is not the same thing but often not so different than the log-odds of the outcome unconditionally. So, you can set prior_intercept accordingly, perhaps with prior_intercept = normal(location = qlogis(0.3), scale = 0.5, autoscale = FALSE).

Thanks! The prior knowledge comes from previous research and literature reviews, it’s not based on the data I collected. But now I think I’m much more clear about the choice of prior distributions for logistic regressions, and that I have to “think” on the log-scale rather than in probabilities or odds ratios.

Two follow-up questions:

  1. You said “runs with the covariates centered” - is centering automatically done by rstanarm or would I need to center my covariates before fitting the model?
  2. If I also have prior knowledge for certain covariates, I would use rstan instead of rstanarm, do define my model in a way that I can specify prior distribution for each term separately?

Centering is done internally by rstanarm and in the output, the intercept is internally shifted back so that it corresponds to the expected log-odds of the outcome when the original covariates are zero. You can specify a vector of locations and scales (and degrees of freedom or other hyperparameters) for the priors in rstanarm to convey different priors within the same parametric family. In the brm function in the brms R package, you can specify different prior families, and of course you can do that (or even multivariate priors) if you use RStan directly.

Thanks a lot, that did really help! For my studies, results between stan_glm and glm are not completely different, however, I feel more comfortable with the Bayesian way than frequentist way of inference.