# Prior and posterior parameter plot

I’m working on following plot showing prior and posterior marginal distributions.

Posteriors are shown in grey and the orange distributions are priors that have specified prior distributions. But I’m not sure what I should call the yellow distributions, which have no directly specified prior distributions. They are obtained by sampling from the a-priori known model structure but without conditioning on data/likelihood. Are these to be regarded as priors or as posteriors? Or is there a better term?

I think they should be regarded as priors, exactly because there’s no conditioning on data/likelihood. I imagine they are some sort of hierarchical parameters derived from the prior (hyper-)parameters, right?

Thanks for you replay @mcol. Yes there is a hierarchical structure, why I’m not entirely convinced these should be regarded as priors. These distribution due reveal new information that was not know a-priori.

Lets consider this dummy example:

data {
int<lower=0> N;
vector[N] y;

real a_mu; real a_sigma;
real b_mu; real b_sigma;
real c_mu; real c_sigma;
real y_sigma;
}

parameters {
real a;
real b;
}

transformed parameters {
real c = a + b;
}

model {
a ~ normal(a_mu, a_sigma);
b ~ normal(b_mu, b_sigma);
c ~ normal(c_mu, c_sigma);

if (N>2) y ~ normal(c, y_sigma);


Lets give ‘a’ and ‘b’ a normal(5, 1) prior and ‘c’ a normal(7, 1) prior and simulate ‘y’ as y=rnorm(10, 12, 1). This produces following distributions.

So for parameter ‘c’ there is orange colored prior specified as normal(7,1), but ‘c’ is also dependent on ‘a’ and ‘b’ and thus we get this yellow colored distribution if sampling without data. It has been conditioned on the model structure, but not on data. ‘Posterior w/o data’ seems like the wrong term though.

It is still a prior. But your ‘prior’ for c is kind of ‘invalid’.

What would your plot look for

parameters {
real c;
}
model {
c ~ normal(0,1);
c ~ normal(4,1);
}


Yeah, your are right, the model structure is included in prior p(\theta). For parameter ‘c’, in my last figure, the ´Posterior w/o data´ show the actual prior and what I call ‘Prior’ should be called something else.

But, for an end-user facing application I think there are three distinguishable phases: 1. prior specification on individual parameters, capturing domain knowledge 2. conditioning on a model structure, capturing modelling expertise 3. conditioning on data

1 Like

If one removes the “prior” on c the model becomes

parameters {
real a;
real b;
}

transformed parameters {
real c = a + b;
}

model {
a ~ normal(a_mu, a_sigma);
b ~ normal(b_mu, b_sigma);
}


In this case c is just a function that maps a point (a, b) to the point a + b.

Now a and b both live in spaces equipped with probability distributions, specified by the probability density functions normal(a_mu, a_sigma) and normal(b_mu, b_sigma) respectively. Consequently we can push forward the distributions over the points a and b through the function c to define a probability distribution on the output values. This distribution is completely determined by the distributions on a and b and the function c and so nothing else needs to be specified. If this is a bit too abstract – Stan will sample values of the parameters,

\tilde{a}, \tilde{b} \sim \pi(a, b) = \pi(a) \cdot \pi(b)

and then generate samples from the output space by evaluating c at those samples,

\tilde{c} = c(\tilde{a}, \tilde{b}).

Importantly, Stan does not generate the samples \tilde{a}, \tilde{b} by using exact random number generators. Instead the lines

a ~ normal(a_mu, a_sigma);
b ~ normal(b_mu, b_sigma);


defines the joint probability density function

\pi(a, b) = \mathcal{N}(a \mid a_{\mu}, a_{\sigma}) \cdot \mathcal{N}(b \mid b_{\mu}, b_{\sigma})

from which correlated samples are generated with a Markov chain. This is important because it clarifies what happens when you try to add a prior density over c.

c ~ normal(c_mu, c_sigma)


you’re not sampling values for a, b, c independently and then somehow throwing out values that are inconsistent with c = a + b. Instead the line modifies the joint distribution \pi(a, b) in a weird way, trying to compromise between the marginal normal behaviors for a and b and the summed behavior through c = a + b. You will only see the consequences of this change when generating samples through Stan, not when trying to generate them in R with rnorm.

To summarize, a Stan program just defines a joint density over the parameters. Adding sampling statements for transformed parameters just changes this joint density, and often in a weird and counter-intuitive way.

That said, examining the consequences of a prior model is a critical part to any robust Bayesian workflow. In this case you could fit the model without the line c ~ normal(c_mu, c_sigma) and plot the distribution for c. If this distribution is inconsistent with domain expertise then the prior model for a and b is insufficient and you can play around with changes to that prior model that ensure better behavior for the distribution of c. In this case, playing around with a_mu, a_sigma, b_mu, b_sigma) until you get something compatible with your domain expertise for c.

4 Likes

thanks @betanalpha, now it’s a bit clearer for me. Still uncertain about the terms though, does something like “input prior”, “joint prior”, and “posterior” make sense?

Made this Shinyapp on the ‘c = a + b’ toy example (It has to compile first so will take a while to get started):
https://rokka.shinyapps.io/shiny_priors

And yes, it’s a bit weird example. I’ll post something when I got further on my actual case study.

Typically, we have parameters \theta and data y, with p(\theta) being the prior, p(y \mid \theta) and the joint p(y, \theta) is proportional to the posterior by Bayes’s rule, p(\theta \mid y) \propto p(y, \theta) when y is fixed.

If the data y is empty, then p(y \mid \theta) is just 1, so the posterior reduces to the prior, p(\theta \mid y) = p(\theta)..

I still feel there is a need to plot two kinds of prior distributions. Let \theta=\{a,b,c\}. Then “Input prior” in the figure below are the individual distributions p(a), p(b) and p(c) (a \sim \mathcal{N}(a_{\mu}, a_{\sigma}), …); “Joint prior” is then p(\theta) which I think could be written as p(a\,|\,\theta), p(b\,|\,\theta), p(c\,|\,\theta) for the plotted marginal distributions; and “Posterior” is p(\theta)p(\theta\, |\, y)

This toy example is a bit contrived with its contradicting priors, but such situation can occur also in a more sensible model. For example for a multivariate distribution with \Omega \neq I.

I think this example is just plain out weird. If you make a \sim \text{normal}(\mu_a, \sigma_a), b \sim \text{normal}(\mu_b, \sigma_b) and c \sim \text{normal}(\mu_c, \sigma_c) but also c = a + b, all this does is to say that c has two measures - an induced or pushforward one and a “natural” one -, namely:
\pi_1(c) = \mathcal{N}(c \mid \mu_a + \mu_b, \sqrt{\sigma_a^2 + \sigma_b^2}) and \pi_2(c) =\mathcal{N}(c \mid \mu_c, \sigma_c). Since under the hood you have

target += normal_lpdf(c | a_mu + b_mu, sqrt(square(a_sigma) + square(b_sigma));
target +=  normal_lpdf(c | c_mu, c_sigma);


it turns out that the resulting induced measure, \pi^\prime(c) = \pi_1(c)\pi_2(c), is also a normal. But this could easily not be the case. In summary, I don’t understand what you mean by “joint” prior.

See this discussion for more details.

Yes, the total prior density on c can be given with those two density statments you provided. However, that would leave a and b non-identifiable.

I think @betanalpha advice, a few posts above, to update the priors on a and b until also the a plotted “pushforward” distribution of c agree with prior knowledge is the best way to go, at least in an interactive workflow.

1 Like

Ultimately there is no such thing as “input priors” or “individual” priors, just the “joint prior” and its consequences. Although we often think about prior specification parameter by parameter (due to how Bayesian inference is often taught, that approach has mathematical and practical limitations.

The mathematical problem is that the parameterization of the model configuration space is not unique, and concepts like “independence” depend on the choice of parameterization. For example in this case one might start by modeling prior density functions for a and b independently or by constraining their sum with a prior density function over c = a + b. Because there is no unique parameterization of the model configuration space there is no unique “independent prior” or “input prior”.

The practical problem is that seemingly well-behaved marginal prior densities over individual parameters can have awkward if not outright pathological consequences. It’s not enough to think about prior densities \pi(a) and \pi(b) – we have to consider the induced prior model over any function f(a, b) that might be relevant to the entire model. This requires us to think about prior specification in the context of the entire model,

\pi(y, \theta) = \pi(y \mid \theta) \, \pi(\theta).

For example, let’s say that the observational model factorizes as

\pi(y \mid a, b) = \pi(y \mid f(a, b), g(a, b)).

In this case the consequence of the prior model doesn’t manifest most obviously in the prior density \pi(a, b). Instead it manifests best in the induced prior density for \pi(f, g). Defining a prior model through the density \pi(a, b) is fine, one just needs to make sure that the induced prior density \pi(f, g) is also well behaved.

In practice this can be accomplished by generatively sampling from the full model, i.e. from sampling parameter values and propagating them all the way to simulated data, and evaluating the distribution of everything in between. For example,

• Sample from prior: \tilde{a}, \tilde{b} \sim \pi(a, b)
• Evaluate intermediate functions: \tilde{f} = f(\tilde{a}, \tilde{b}), \tilde{g} = g(\tilde{a}, \tilde{b}).
• Examine sampling distribution: \tilde{f}, \tilde{g} \sim \pi(f, g).
• Simulate data and examine: \tilde{y} \sim \pi(y \mid \tilde{f}, \tilde{g}).

For example in a logistic regression one would want to examine not only ensembles of simulated binary outputs but also the ensemble of latent probabilities.

The last step is a prior predictive check, but there is as of yet a good name for the intermediate checks of the distribution over f and g. “Joint prior check”, “prior consequences check”, “prior pushforward check”, “intermediate generative checks”, etc all give the same flavor but none are quite at natural as “prior predictive”. Suggestions welcome.

1 Like

Thanks for taking the time to formulate such an informative answer. Two alternative terms that comes to mind are “induced prior check” or “prior model check”.

In my use case, prior knowledge comes from different kinds of sources (databases, user and expert domain knowledge, and possible also as pooling hyperparameters from earlier fit models), unstructured information that tends to exist in the domain of the individual parameter. There is a pre-processing conducted in “translating” this information into a probability density distributions – for example there can be numerical value and a qualitative assessment of the uncertainty, information that needs to be translated into a probability density distribution. Therefore, there is a need to also show these individual parameter distributions so that it can be assessed if assumptions made in the translation step are reasonable. It was this translation step I tried to catch with the term “input priors”.

In some cases, there will be no observational data/likelihood. Ideally, the only difference between the “prior model” and the “posterior model” is that the “prior model” is more uncertain (with distributions that are wide enough to hold the true posterior, but tight enough to be useful for decision making in case of no observational data).

Formulating a model structure where all prior information are specified on the individual parameter level (\pi(a,b) in your post) is probably a good practice to avoid counter-intuitive consequences. However, my intention to introducing hyperparameters that pools information from earlier fitted models will probably only make sense on higher level parameters/functions (\pi(f,g) in your post). But that’s a topic for another time.

The problem of dealing with conflicting induced prior information is an old one. For example, see this paper on dealing with the induced priors and “natural” priors in deterministic models. If this piques your interest, I have a lot of code (and text) on the methodology behind this “coherisation” process, logarithmic pooling. I obviously think it’s quite useful, but @betanalpha might beg to differ.

1 Like

I think there are a few slightly topics being discussed here.

On one hand you have the question “does my prior model sufficiently regularize my observational model?”. This question is important because it identifies a context and, in particular, for what parts of the prior model one needs to elicit nontrivial domain expertise; in other words your prior model doesn’t need to contain all of your domain expertise, but rather just enough to ensure reasonable inferences and decisions. That context then defines which prior pushforwards you might want to examine.

Then there is the question of pooling different, possibly inconsistent, sources of domain expertise. One option is to define multiple prior densities, each based on different sources of domain expertise, and simply multiply them together. This defines a probabilistic equivalent of a logical AND, resulting in an overall prior density that concentrates only on those model configurations consistent with all of the individual sources of domain expertise. Another option is to add the prior densities together, defining a probabilistic equivalent of a logical OR and yields an overall prior density concentrating over the model configurations supported by any of the individual prior densities. Pooling methods (or perhaps Poole-ing methods in the context of the last linked paper…) are somewhere in between these two extremes.

Personally I prefer the OR approach, although I work with less formal aggregation methods. I focus my elicitations on extremes – have each source of domain expertise identify where model configurations become extreme and then work out thresholds reasonably consistent with all of those extremes. It’s a necessarily lossy process but it avoids lots of the weird behaviors that can arise when you take each source of domain expertise too seriously.