Multiple priors for the same parameter

alanfeng · September 11, 2019, 7:33pm

Hello:

I have multiple sources available to construct informative priors for a single parameter. I was wondering whether it is logically correct to directly specify multiple priors in Stan model.

I tried specifying multiple priors in Stan and there was no error during the modeling. To check, I ran the model with individual prior 1, prior 2 and both prior 1+prior 2, and the results show that specifying two priors on the same parameter does yield results between those from individually specifying prior 1 and prior 2. Does Stan implicitly do a weighted mixture of multiple priors for the same parameter like 1:1 here?

To be more specific, I have more multiple priors for a parameter of interest (e.g., mean) from different information sources available, and I implemented in Stan as:

model {
 mu ~ normal(mu0, sigma0)       // prior from information source 1
 mu ~ uniform(a, b)             // prior from information source 2
 y ~ normal (mu, sigma)         // likelihood
}

Could anyone tell me if this is correct to do like this?

Many thanks.

Max_Mantei · September 11, 2019, 8:39pm

Hi! :)

I don’t know if what you are doing is “correct” - I would need more context and domain expertise for that. But you might want to check out this blog post.

alanfeng · September 11, 2019, 9:49pm

Hi Max,

Thank you very much for your prompt reply. I have editted my post to give more specifics.

It appears that the post by Gelman does apply multiple priors for the same parameters to some extent, i.e. an extra prior on the combination of parameters of interest. This appears to be relevant to my problem, but I am not sure whether the way I am doing here by putting multiple priors on the exact same parameter rather than on individual parameters and a combination of them is appropriate.

Thanks again!

maxbiostat · September 12, 2019, 1:59am

Hi, I would wait for others to chime in, but logarithmic pooling is a good way to combine multiple priors into one.

It seems the main papers that talk about logarithmic pooling are pay walled, so drop me a line via private message so I can share the material, if you’re interested.

bnicenboim · October 22, 2021, 7:23am

Hi,
I’m resurrecting this old tweet, because it’s relevant to what I want to do. If one adds a uniform distribution and a normal to the same parameter, is this essentially adding the convolution of the two priors? And doing a mollification in this case?

And related to the blog post that @Max_Mantei refers to. If one does

  target += normal(a + 5*b | 4.5, 0.2);  \\ strong prior information on a + 5*b

isn’t necessary a Jacobian adjustment?

bnicenboim · October 22, 2021, 7:40am

ok, it’s not the convolution because the new distribution has support only over the uniform. Not sure what it is…
In the Bayes rule, one is multiplying those two priors, but that is not the same as the product of two random variables with different distribution. So I’ll be happy if someone can tell me what is this equivalent to.

maxbiostat · October 22, 2021, 11:37am

I wrote up a long-winded response to this, only to realise I don’t really know what I’m talking about. So I will defer to @betanalpha and @andrewgelman to provide a proper answer. I will however point out that Poole & Raftery (2000) provide one possible solution to your problem.

andrewgelman · October 22, 2021, 11:49am

You can add multiple pieces of information about a parameter to a Bayesian model. Rather than thinking of these as multiple priors, it makes sense to think of these as multiple data sources.

To put it another way, “prior distribution” has two meanings:

A probability distribution representing a piece of information that is external to the data you are using in your data model.
The marginal probability distribution of the parameters in your model.
If you have multiple pieces of prior information, then meanings 1 and 2 above are different. In the Bayesian formalism, “prior distribution” has meaning 2.

I’m not sure about the example at the top of this thread, because it’s very rare that these uniform distributions make sense, so let me construct another example.

first piece of prior info: theta ~ normal(0, 10);
second piece of prior info: theta ~ normal(2, 4);

Here it could make sense to think of the first piece of prior info as the prior distribution and the second piece as data, thus:
y ~ normal(theta, 4);
where y=2 is specified in data.

This is mathematically and computationally the same target function and the same posterior as specifying the two priors, but it now points toward a generative model and is consistent with Bayesian theory.

jsocolar · October 22, 2021, 11:54am

Pretty sure this simply yields a truncated normal. Seems like it’s exactly how we specify a truncated normal in Stan, where the “uniform prior” is declared implicitly via the bounds, and then a “normal prior” is added via a sampling statement.

bnicenboim · October 22, 2021, 12:07pm

Thanks!!! This is really interesting and I never thought about it this way!

betanalpha · October 29, 2021, 7:34pm

I conveniently just put out a case study on prior modeling that discusses these issues in detail, Prior Modeling. See in particular Section 4.1.4 and Section 5.

A probability density function is more formally defined as a Radon-Nikodym derivative that updates one probability distribution into another. Given two two probability distributions \pi and \rho defined over the same space (that play nicely with each other) we can recover expectations with respect to \pi from expectations with respect to \rho by

\mathbb{E}_{\pi}[f] = \mathbb{E}_{\rho} \left[ \frac{ \mathrm{d} \pi} { \mathrm{d} \rho } \cdot f \right].

Where the function \frac{ \mathrm{d} \pi} { \mathrm{d} \rho } is the Radon-Nikodym derivative or density between \pi and \rho. Intuitively the density function up-weights f in regions where \pi allocates more probability than \rho and down-weights f in regions where \pi allocates less probability than \rho.

Radon-Nikodym derivatives can also be chained. If we have three distributions \pi, \rho, and \phi then

\mathbb{E}_{\pi}[f] = \mathbb{E}_{\rho} \left[ \frac{ \mathrm{d} \pi} { \mathrm{d} \rho } \cdot f \right] = \mathbb{E}_{\phi} \left[ \frac{ \mathrm{d} \pi} { \mathrm{d} \rho } \cdot \frac{ \mathrm{d} \rho} { \mathrm{d} \phi } \cdot f \right].

Equivalently we have chain rule like result,

\frac{ \mathrm{d} \pi} { \mathrm{d} \phi }(x) = \frac{ \mathrm{d} \pi} { \mathrm{d} \rho }(x) \cdot \frac{ \mathrm{d} \rho} { \mathrm{d} \phi }(x).

Whenever we multiply density functions we’re actually updating one distribution into an intermediate distribution and then into a third final distribution. Critically, however, this requires that we multiply a density between \pi and \rho and a density between \rho and \phi; the densities have to be chained together consistently. Trying to multiply two densities between \pi and \phi, or even a density between \pi and \phi and \rho and \phi, doesn’t make any mathematical sense.

The usual “probability density functions” that we encounter are actually the Radon-Nikodym derivatives between probability distributions over the real numbers and the uniform “Lebesgue” measure. Multiplying two of these conventional density functions together doesn’t really make any sense because as discussed just above they are density functions with respect to the same base distribution. This is why naive products often have weird, unexpected consequences.

Another place where Radon-Nikodym derivatives appear is in Bayes’ Theorem: a properly normalized version of the likelihood function can be interpreted as the density function between the posterior distribution and the prior distribution. As I discuss in the case study multiplying two density functions together can then be interpreted as two independent likelihood functions, although again this isn’t as useful as it might sound. Because one would be modeling a heuristic likelihood function directly, and not a full observational model that can be evaluated on some heuristic observation to give a likelihood function, the underlying assumptions and their consequences are difficult if not impossible to validate. The information encoded in a likelihood function and the domain expertise encoded in a prior model are not the same!

Topic		Replies	Views
Ideas and doubts for a simple model General	5	341	February 15, 2024
Priors on derived quantities of model Modeling	4	854	March 2, 2022
Composing Stan models (posterior as next prior) General	28	7377	February 6, 2020
Priors for Standard deviation parameters Modeling	7	1095	October 19, 2021
Help with specification of hierarchical priors? Modeling	9	1270	December 4, 2018

Multiple priors for the same parameter

Related topics