Multiple priors for the same parameter


I have multiple sources available to construct informative priors for a single parameter. I was wondering whether it is logically correct to directly specify multiple priors in Stan model.

I tried specifying multiple priors in Stan and there was no error during the modeling. To check, I ran the model with individual prior 1, prior 2 and both prior 1+prior 2, and the results show that specifying two priors on the same parameter does yield results between those from individually specifying prior 1 and prior 2. Does Stan implicitly do a weighted mixture of multiple priors for the same parameter like 1:1 here?

To be more specific, I have more multiple priors for a parameter of interest (e.g., mean) from different information sources available, and I implemented in Stan as:

model {
 mu ~ normal(mu0, sigma0)       // prior from information source 1
 mu ~ uniform(a, b)             // prior from information source 2
 y ~ normal (mu, sigma)         // likelihood

Could anyone tell me if this is correct to do like this?

Many thanks.

Hi! :)

I don’t know if what you are doing is “correct” - I would need more context and domain expertise for that. But you might want to check out this blog post.


Hi Max,

Thank you very much for your prompt reply. I have editted my post to give more specifics.

It appears that the post by Gelman does apply multiple priors for the same parameters to some extent, i.e. an extra prior on the combination of parameters of interest. This appears to be relevant to my problem, but I am not sure whether the way I am doing here by putting multiple priors on the exact same parameter rather than on individual parameters and a combination of them is appropriate.

Thanks again!

Hi, I would wait for others to chime in, but logarithmic pooling is a good way to combine multiple priors into one.

It seems the main papers that talk about logarithmic pooling are pay walled, so drop me a line via private message so I can share the material, if you’re interested.

I’m resurrecting this old tweet, because it’s relevant to what I want to do. If one adds a uniform distribution and a normal to the same parameter, is this essentially adding the convolution of the two priors? And doing a mollification in this case?

And related to the blog post that @Max_Mantei refers to. If one does

  target += normal(a + 5*b | 4.5, 0.2);  \\ strong prior information on a + 5*b

isn’t necessary a Jacobian adjustment?

ok, it’s not the convolution because the new distribution has support only over the uniform. Not sure what it is…
In the Bayes rule, one is multiplying those two priors, but that is not the same as the product of two random variables with different distribution. So I’ll be happy if someone can tell me what is this equivalent to.

I wrote up a long-winded response to this, only to realise I don’t really know what I’m talking about. So I will defer to @betanalpha and @andrewgelman to provide a proper answer. I will however point out that Poole & Raftery (2000) provide one possible solution to your problem.

You can add multiple pieces of information about a parameter to a Bayesian model. Rather than thinking of these as multiple priors, it makes sense to think of these as multiple data sources.

To put it another way, “prior distribution” has two meanings:

  1. A probability distribution representing a piece of information that is external to the data you are using in your data model.
  2. The marginal probability distribution of the parameters in your model.
    If you have multiple pieces of prior information, then meanings 1 and 2 above are different. In the Bayesian formalism, “prior distribution” has meaning 2.

I’m not sure about the example at the top of this thread, because it’s very rare that these uniform distributions make sense, so let me construct another example.

first piece of prior info: theta ~ normal(0, 10);
second piece of prior info: theta ~ normal(2, 4);

Here it could make sense to think of the first piece of prior info as the prior distribution and the second piece as data, thus:
y ~ normal(theta, 4);
where y=2 is specified in data.

This is mathematically and computationally the same target function and the same posterior as specifying the two priors, but it now points toward a generative model and is consistent with Bayesian theory.


Pretty sure this simply yields a truncated normal. Seems like it’s exactly how we specify a truncated normal in Stan, where the “uniform prior” is declared implicitly via the bounds, and then a “normal prior” is added via a sampling statement.


Thanks!!! This is really interesting and I never thought about it this way!

I conveniently just put out a case study on prior modeling that discusses these issues in detail, Prior Modeling. See in particular Section 4.1.4 and Section 5.

A probability density function is more formally defined as a Radon-Nikodym derivative that updates one probability distribution into another. Given two two probability distributions \pi and \rho defined over the same space (that play nicely with each other) we can recover expectations with respect to \pi from expectations with respect to \rho by

\mathbb{E}_{\pi}[f] = \mathbb{E}_{\rho} \left[ \frac{ \mathrm{d} \pi} { \mathrm{d} \rho } \cdot f \right].

Where the function \frac{ \mathrm{d} \pi} { \mathrm{d} \rho } is the Radon-Nikodym derivative or density between \pi and \rho. Intuitively the density function up-weights f in regions where \pi allocates more probability than \rho and down-weights f in regions where \pi allocates less probability than \rho.

Radon-Nikodym derivatives can also be chained. If we have three distributions \pi, \rho, and \phi then

\mathbb{E}_{\pi}[f] = \mathbb{E}_{\rho} \left[ \frac{ \mathrm{d} \pi} { \mathrm{d} \rho } \cdot f \right] = \mathbb{E}_{\phi} \left[ \frac{ \mathrm{d} \pi} { \mathrm{d} \rho } \cdot \frac{ \mathrm{d} \rho} { \mathrm{d} \phi } \cdot f \right].

Equivalently we have chain rule like result,

\frac{ \mathrm{d} \pi} { \mathrm{d} \phi }(x) = \frac{ \mathrm{d} \pi} { \mathrm{d} \rho }(x) \cdot \frac{ \mathrm{d} \rho} { \mathrm{d} \phi }(x).

Whenever we multiply density functions we’re actually updating one distribution into an intermediate distribution and then into a third final distribution. Critically, however, this requires that we multiply a density between \pi and \rho and a density between \rho and \phi; the densities have to be chained together consistently. Trying to multiply two densities between \pi and \phi, or even a density between \pi and \phi and \rho and \phi, doesn’t make any mathematical sense.

The usual “probability density functions” that we encounter are actually the Radon-Nikodym derivatives between probability distributions over the real numbers and the uniform “Lebesgue” measure. Multiplying two of these conventional density functions together doesn’t really make any sense because as discussed just above they are density functions with respect to the same base distribution. This is why naive products often have weird, unexpected consequences.

Another place where Radon-Nikodym derivatives appear is in Bayes’ Theorem: a properly normalized version of the likelihood function can be interpreted as the density function between the posterior distribution and the prior distribution. As I discuss in the case study multiplying two density functions together can then be interpreted as two independent likelihood functions, although again this isn’t as useful as it might sound. Because one would be modeling a heuristic likelihood function directly, and not a full observational model that can be evaluated on some heuristic observation to give a likelihood function, the underlying assumptions and their consequences are difficult if not impossible to validate. The information encoded in a likelihood function and the domain expertise encoded in a prior model are not the same!