Perhaps this is the core confusion – we cannot always assume that we know the analytic distribution of all the marginals in the prior. Suppose I have observable quantity Y and two parameters \theta_1, \theta_2 and a model something like

Y \sim t_{5}(0, \theta_{1}), \quad \theta_{1} \sim \text{Cauchy}_{+}(0, \theta_{2}), \quad \theta_{2} \sim N_{+}(0, 1)

where t_{5}(0, \theta_{1}) is a t-distribution with \nu = 5 degrees of freedom, mean 0, and scale parameter \theta_{1}; and \text{Cauchy}_{+}(0, \theta_{2}) is a Cauchy distribution with location parameter 0, scale parameter \theta_{2} truncated to [0, \infty). We truncate the normal distribution in the same way (Note that this truncation does not result in any extra probability mass/density at 0 – we normalise the >0 part of the distribution so that the area under the density function integrates to 1).

We have specified a joint model p(Y, \theta_{1}, \theta_{2}) by specifying the conditional distributions for Y and \theta_{1} and the marginal for \theta_{2}: p(Y, \theta_{1}, \theta_{2}) = p(Y \mid \theta_{1}) p(\theta_{1} \mid \theta_{2}) p(\theta_{2}).

Sampling p(Y, \theta_{1}, \theta_{2}) either via MCMC (using `sample_prior = TRUE`

in `brms`

) or by standard Monte Carlo sampling (i.e. using `rnorm`

to sample \theta_{2} \sim p(\theta_{2}), then using `rt(df = 1, ...)`

to sample p(\theta_{1} \mid \theta_{2}) etc) yields, for example, samples of \theta_{1} from it’s marginal distribution p(\theta_{1}).

So we have samples from the marginal p(\theta_{1}), but when writing out our model we only specified the conditional p(\theta_{1} \mid \theta_{2}). To find the analytic marginal p(\theta_{1}) we need to do a touch of integration, as p(\theta_{1}) = \int p(\theta_{1} \mid \theta_{2}) p(\theta_{2}) \text{d}\theta_{2}. For the model we have written down above, this integration is painful enough that it might as well be non-analytic (meaning we do not “know” the mathematical form of the marginal distribution for \theta_{1}). So, for visualisation purposes, we use our samples from the prior and a KDE to estimate/plot \hat{p}(\theta_{1}). This problem is not present for the “top most” parameter \theta_{2} because we specify its prior unconditionally.

“computing the likelihood of all parameter values” as written would involve an infinite amount of computation – for a model with a single continuous parameter, there are an infinite number of likelihood values. The posterior is a mathematical object: we start from our joint model p(Y, \theta_{1}, \theta_{2}), and then we collect/observe our data Y. We then apply the laws of conditional probability:

p(\theta_{1}, \theta_{2} \mid Y) = \frac{
p(Y, \theta_{1}, \theta_{2})
} {
p(Y)
} =
\frac{
p(Y \mid \theta_{1}) p(\theta_{1} \mid \theta_{2}) p(\theta_{2})
} {
p(Y)
}

and usually we can’t compute p(Y) (for exactly the same reason we couldn’t compute the marginal p(\theta_{1}) in the prior), so we drop the equality for proportionality:

p(\theta_{1}, \theta_{2} \mid Y) \propto p(Y \mid \theta_{1}) p(\theta_{1} \mid \theta_{2}) p(\theta_{2}).

We like the RHS of this expression because we can actually evaluate it (it’s just our model, even though it’s just proportional to the posterior), and we have fancy algorithms (the MCMC algorithm inside Stan, for example) that generate samples from p(\theta_{1}, \theta_{2} \mid Y) just using the RHS of the above expression.

Here you are choosing to assign half of the samples to be exactly 0. This is a modelling choice. For continuous quantities like \mu it doesn’t make a lot of sense, as the probability \Pr(\mu =0) is 0 by definition. When we truncate a continuous quantity, we renormalise the distribution function over the non-truncated region to ensure the density integrates to 1. For example, when truncating a standard normal to [0, \infty), we have (as you correctly note) removed half of the probability mass from (-\infty, 0]. In a typical half-normal distribution we renormalise by multiplying the density function by 2 over [0, \infty).

As an experiment, try sampling possible `sd`

s for `mu[i]`

from various distributions (gamma, inverse gamma, chi-squared, half-normal) and look at the corresponding histograms of `mu`

. It will look far from normal.

\mu has location/mean-parameter \theta, but it’s actual mean/expected value is something else because of the truncation.