For transparency this is duplicate of a question I asked on Stackoverflow here. I wasn’t convinced by the only answer that I recieved there so was hoping someone here might have some insight or might be able to point me towards some relevant literature.

I am confused as to how to interpret the population parameters of a Bayesian Hierarchical model.

This is incredibly artificial but to demonstrate the point let’s say we wanted to fit the following model + priors:

Where i is the subject index and j is a group index.

Now lets say our “question of interest” that we want to answer is “what is the population-mean of the \mu_j's ?”.

Naively I would assume that because they follow a Log-Normal distribution that the mean of the distribution is then exp(\mu + \tau^2/2) (wiki) where we just sub in the posterior samples of \mu and \tau to get an uncertainty distribution.

But this doesn’t feel right to me…

For example the posterior of \tau is not necesarily a Log-Normal distribution anymore (that was just our prior), so why would the posterior population distribution of \mu_j still be a Log-Normal distribution ? But if its not a Log-Normal distribution what do the values of \mu and \tau even mean anymore then; do they even have an interpretation if the underlying distributions are different in the posterior ?

Any guidance as to where my mental model has gone wrong would be greatly appreciated.

If I’m understanding you correctly here; you’re correct that the hyperposteriors of σ and τ are not log-normal and don’t have any specific form. That doesn’t mean that you couldn’t obtain an estimate of the mean of μ_{j} by operating on the samples as you propose, either in post-processing or in the generated quantities block.

The way I think about it, which might be a bit informal, is the the log-normal prior distribution of μ_{j} that you’ve implemented in this sort of case is essentially part of the generative model. There is no information that you could provide to this model that would contradict it. The model implements the assumption that the individual μ are drawn from a log-normal distribution, and the estimated μ are conditional on that assumption. The manifest distribution of the μ is not necessarily log-normal, but the hyperparameters σ and τ are necessarily of a log-normal distribution. You could, for example, obtain the mean of the actual values of μ across the grouping levels in this study, and this would not be equivalent to the mean of μ_{j} (but would be drawn towards it by shrinkage).

Thank you for your thoughts! In particular the idea of viewing the distribution of \mu_j as part of the generative model makes some intuitive sense to me.

I just wanted to add though that I’m not sure that it is appropriate to calculate the mean of \mu_j in post processing as that won’t give you an uncertainty estimate only a point estimate. For example if you had infinite observations but on only 2 groups then the posterior samples of \mu_1 and \mu_2 would essentially be constant and thus mean(\mu_1, \mu_2) would also be a constant. But ultimately we only have 2 groups or 2 observations from our population distribution so it doesn’t make sense to me to have no variability of our population mean.

I don’t see any reason why only a point estimate of the mean of \mu_{j} would be available, in your example.

\mu_{j} is a log-normal distribution so its mean is, as you stated, a function of its location and scale. The location and scale are sampled, so you can obtain a posterior distribution for the mean of \mu_{j} by generating it from the location and scale from each sample. In your comment you’ve moved from the mean of \mu_{j} (the distribution of group effects), to the mean of some group level estimates of \mu, which are different quantities.

Just to zoom out for a moment: it is true in general for MCMC methods that you can propagate posterior uncertainty through arbitrary functions of parameters by computing those functions iteration-wise over the posterior samples. In your case, you have draws from the joint posterior distribution for \mu and \tau, and some function M =g(\mu, \tau) = e^{\mu + \frac{\tau^2}{2}}, and you want draws from the posterior distribution for M, which is given by computing g iteration-wise over \mu and \tau.

Apologies I had interpreted this to mean that you were proposing that you calculate the mean for \mu_j in post processing via calculating \text{mean}_k(\mu_{1k}, \mu_{2k}, \mu_{3k}). But yes I see and agree with you in terms of calculating the population mean using the samples for \tau and \mu.

As an aside I just wanted to mention that someone replied to the original stack exchange thread with a fairly compelling mathematical answer for why the correct approach is to use the samples for \tau and \mu to calculate the population mean of the \mu_j's.