I am currently working on my first Bayesian GLMM and am a bit overwhelmed. I have a pair of basic but important questions about interpretation of the posterior distribution under default priors and about the different default priors for the intercept vs. the beta weights of a regression model. Any information (or helpful resources) would be greatly appreciated.
My first question is whether the posterior distribution when using default priors is a proper probability distribution? For instance, if I use a flat (default) prior, is the posterior such that I may make a probability statement like “given the model and data, there is a .X probability that Y”? I am asking this as the flat prior is improper, and I myself am unsure of when an improper prior gives way to a proper posterior. To be clear, the sort of probability statements I have in mind are of the sort obtained via the “hypothesis” function in brms.
My second question has to do with the differences between the prior parameterization of the intercept term as compared to the beta weights/population level effects. This is probably naive to ask but: Why are the two different types of term given different default prior parameterizations in brms (function: brm)? Specifically, the prior for the intercept is a student’s T, while the prior for the fixed effects terms is a flat prior.
I can’t know how much of a newcomer you are, and your questions are somewhat general/philosophical, so I’ll do my best and try to be helpful. The meanings, interpretations of bayesian inference can be overwhelming, but after a while you’ll see they actually come more naturally than other approaches.
In general there are no “default priors” for bayesian inference, the defaults for priors are an implementation decision and are (probably carefully) chosen based on what works for most/many problems. Of course some priors are better than others (in some simple models, for instance, priors conjugate to the likelihood may allow analytical calculations of the posterior), but with a few exceptions there should usually be an inference decision about which priors are appropriate.I don’t know about brms, but when implementing a model directly in Stan language, no prior specification implies a uniform over al possible values, so you can see that even within the Stan environment the default can be “inconsistent” between interfaces.
That said, the posterior is a proper probability distribution, it allows you to make a statement like: under this model these are the probabilities of some value of the parameter \theta given the data y (usually it will be a continuous distribution, so the interpretation would be in terms of probability densities). p( \theta | y ) alone is already that statement, and using Bayes rule to write p( \theta | y ) = p( y | \theta) p( \theta) / p(y) is just how we are actually able to compute it. Because the denominator (p(y), the probability of the data) cannot/needs not be known it is usually left out, and instead the posterior is a distribution proportional to the numerator that is sampled (or approximated) using methods like MCMC.
Although that is implicitly true, in my opinion you are right to have the concern of whether the terms on the right ensure the posterior is indeed a proper probability distribution. It is possible to do the right side “wrong” (e.g. an arbitrary function in the place of the likelihood or prior) and get something that is not a proper posterior distribution, but that’s not the case with a uniform distribution. To your point, if you are using a uniform prior the probability of a parameter will be constant under its support (of length l) and zero otherwise, so your posterior will be:
If you have uniform prior over all real axis, you’ll have trouble writing that constant for p(\theta); however, since we are working with proportionality you can drop the term by noting that it doesn’t change the probability of the parameter whatever its value, so you are just left with p(\theta|y) \propto p(y|\theta) – in a roundabout way you recover MLE if you are using an optimization method to get a point estimate, but if you compute a distribution from it you are doing proper bayesian inference.
The first answer became quite long, so the second one goes separately.
As I mentioned above, this is probably more of an implementation decision, in fact, if asked I might have done the opposite: uniform prior for the intercept (since it can be a priori it could be anything) and normal-like for fixed effects (reflects the belief that there’s most likely zero slope). I’d guess that the opposite works too if you think that the intercept may be readily estimated from the data alone, so it’s easy to place a more informative prior on it, while uniform priors on the slopes are just the opposite of the belief I describe above, and it’s fair as well – it’s a belief that the slope can be any value. You could just as well have all terms be uniform or normal distributions, those would imply different prior beliefs about what the parameters should be, and depending on what you are actually modeling some choice may be more appropriate.
As @caesoma already mentioned, please note that “default” prior is not a well-defined term. However, I will say that if said “default” prior is proper (i.e. is integrable) then the posterior is proper (except for a set of measure zero, for those who like me must care about such details).
If your prior is improper (i.e. not integrable) then all bets are off. Your posterior may or may not be a proper distribution. In a regression setting you usually need conditions on the rank of the design matrix to ensure propriety under “flat” priors. In my opinion, it’s better to use proper priors unless you really know what you’re doing. Besides, by using proper priors you get a nice, generative model to boot.
I can’t comment to the specific priors implemented in brms, but maybe @paul.buerkner can. Also, it seems to me that the different choices for intercept and (fixed) effects don’t matter much after you centre your covariates, but I might be way off here. @andrewgelman can advise, if not too busy.
I agree. Just one thing. Stan has default models in the sense that the target function in Stan is constructed by adding terms. The default model, if you have no model block, is uniform in all parameters. Adding “target +=” and “~” statements adds terms to the target function. So it’s accurate to say that the uniform prior is default in Stan. I agree with Max that this is not a good default for many statistical analyses. We’ve talked about having a switch in Stan that would by default include uniform(0,10) priors to all parameters just to keep things bounded, but I guess the consensus is that this would be too confusing for users. But when I build models now I will usually include some default priors representing reasonable ranges of parameter values.