Hierarchial models: internal inconsistency in estimates of standard deviation

Hello to all,

I am working on simulated data to arrive at the most efficient formulation of a hierarchial model. This is a general observation I made and I’d be glad to find an explanation:

Assume I have an observation-level parameter theta[i] that varies for each observation i (1 to 100). I assume theta[i] is drawn from population-level parameters mu_theta, sigma_theta. Since 100 is a large enough number and I am working on an extremely basic model, the estimate of sigma_theta should roughly be equal to the standard deviation of the vector (theta[1], theta[2], … , theta[100]). However, the latter is consistently much smaller than the former, i.e. the estimated observation-level parameters are more closely spaced than indicated by the estimates of the population-level parameters (hence I call it an “internal inconsistency”)

Is there an explanation for this phenomenon? I also observe that the estimate of sigma_theta is consistently correct- I can confirm this since my data is simulated. However, the observation estimates theta[i] are closer together, compared to sigma_theta and also the true values. I can understand why the estimates of theta[i] are closer together (due to the disciplining by priors, correlations in hierarchial models, etc.), compared to the true parameters. But, ought they not be consistent with sigma_theta which is being estimated at the same time?

Lastly, I observe this both for centred and non-centred parametrizations and different choices of priors.

Since you are working on simulated data, can you show your code for data simulation and model fitting in a reproducible example?

2 Likes

Thanks for your reply. I have just discovered a theoretical reason behind this:

As compared to the population distribution of a parameter, the posterior distribution of the parameter at observation level is different, depending on the data collected for each observation. In the extreme cases of large observation-level data (no observation-level data), we should observe that the observation-level posterior converges to a distribution with “true” value as mean (coincides with the population-level distribution).

In case of the former, the standard deviation (SD) of observation-level parameters coincides with that of the population distribution. In case of the latter, SD should be ~zero (since all observations share the same population distribution). Hence, realistic estimates of SD should lie between 0 and SD of the population (this indicates goodness of fit in some way). I’m now at peace with the phenomenon :)

I’m a little confused by the language you’re using to talk about these parameters but I’m assuming what you are calling “population level” is a prior. In general, posteriors are usually more concentrated than priors in well-formulated models. If posteriors matched priors, we wouldn’t need to perform posterior inference!

1 Like

Thank you for the comment. I shall try to explain myself better.

In fitting a hierarchial model, let theta be a univariate parameter of interest. To model heterogeneity across individuals (say 100 individuals), assume that theta_i (parameter theta for individual i) is drawn from a normal with parameters mu_theta and sigma_theta, i.e.:

theta_i ~ Normal(mu_theta, sigma_theta)

Assume we have uninformative priors for mu_theta and sigma_theta.

Ultimately, in a hierarchial model, we derive posteriors for mu_theta, sigma_theta, and theta_i (i varying from 1 to 100). I am now focusing on the mean of the posteriors- call them mu_theta_hat, sigma_theta_hat, theta_i_hat. My observation was the following:

SD(theta_i_hat) < sigma_i_hat

The standard deviation across individual estimates is significantly lower than the estimated standard deviation of the distribution from which these parameters are drawn.

Hope that makes it clearer.

Can you be more specific about what you mean by “uninformative”? If you take very broad but proper priors like BUGS’s inv_gamma(epsilon, epsilon), you will see just the effect you mention, because most of the prior mass is well above small data values. @andrewgelman talks about this in the following paper and also in the Bayesian Data Analysis book: