I estimated a multilevel model as the following y_{it} = \beta_{0i} + x'\beta + \varepsilon_{it} . I modelled individual specific parameters as follows \beta_{0i} using the non-centered parameterization as follows:

parameters {
// Define parameters to estimate
// Population intercept (a real number)
real beta_0;
// Level-2 random effect
real u_0jkz[Nj];
real<lower=0> sigma_u0jk;
}
transformed parameters {
// Varying intercepts
real beta_0j[Nj];
real u_0jk[Nj];
for (s in 1:Nj) {
u_0jk[s] = sigma_u0jk * u_0jkz[s];
}
for (j in 1:Nj) {
beta_0j[j] = beta_0 + u_0jk[j];
}
}
model {
// Prior part of Bayesian inference
// Random effects distribution
u_0jkz ~ normal(0,1);
sigma_u0jk ~ student_t(3, 0, 10);
}
}

By putting the half t-distributed prior sigma_u0jk ~ student_t(3, 0, 10); I thought I restricted the \sigma values from becoming very large (as stated in Prior distributions for variance parameters in hierarchical models from Andrew Gelman).
However, the posterior mean of sigma_u0jk is around 100. Nevertheless, the predictions resulting from this model (which is the main goal of this model are pretty good). Can someone explain why this happens?

A half t distribution with only three degrees of freedom is much, much heavier tailed than you might think. To build some intuition try fitting your model without any data and see what the prior mean is. And at the same time look at the marginal quantiles, such as the 10% and 90% quantiles, to get a feel for the breadth of the distribution.

@betanalpha Thanks for your reply. Do you maybe know why several papers refer to this distribution in hierarchical models, for example even the brms package also uses this distribution? Furthermore, I do not really know how to interpret the model with such large standard deviation in random parameter. Since predictive accuracy is fine and also plots of the posterior predictive densities are fine, so the model on itself should be fine right?

EDIT: I also tried the model with different kind of prior specifications (such as inverted gamma(0.1, 0.1) and normal(0,1)) distributions and found that both the WAIC as well as the predictive accuracy in terms of RSME of the model using the t-distribution with three degrees of freedom was best

Unfortunately many people followed a recommendation from Gelman (in a paper discussion) without really considering its overall implications.

Others will disagree with me on this point, but I think that predictive accuracy is a poor way to judge a model. In this case the heavier tails will by definition increase predictive accuracy (because your predictive distribution will diffuse to cover more possibilities) but you will also lose precision in subsequent predictive tasks.

But let me flip it around and offer a hypothetical question. Why were you so worried about the large standard deviation you saw in your fit? If it’s because it conflicts with your domain expertise then your prior isn’t consistent with your domain expertise and you might want to move to a lighter tailed prior that is more consistent with your domain expertise. The inverted Gamma that you used is no good – a better approach is to increase the degrees of freedom of the Student-t density.

@betanalpha Thanks for your great answer! Really appreciated!

The main goal of my multilevel model is to be able to make predictions for individuals at a certain moment in time. Also, I want the model to be as widely applicable as possible (so for as many individuals as possible). This is also why I opted for a bayesian multilevel model. I was not really worried about the large standard deviation, because I do not have real domain expertise, but it just stand out when assessing convergence of my model.

Do you then also consider the WAIC criterion to be not really useful? Also not when the main goal is making predictions?

Lastly, could you maybe explain a little bit more what you mean with this part?

you will also lose precision in subsequent predictive tasks.

I took this to mean that by allowing your data to have greater influence over this model (as opposed to imposing stronger priors) you will inevitably have better within-sample prediction as the data is having a greater say in your parameter estimates. This within-sample prediction accuracy can be at the expense of overfitting and poor out-of-sample performance.

Domain expertise can be surprisingly simple. Once you set reasonable units even mild uncomfortable about infinity can be surprisingly powerful information.

In general I do not find WAIC (or any other information criterion) or LOO to be helpful. If I’m making predictions I first want to determine what I want to predict and how I want to use those predictions before evaluating the predictions. For example, I might need only a rough sense of location and can ignore the tails of the predictive distribution.

Larger posterior uncertainty means wider predictive distributions means more consistency with the observed data because the wide distribution will overlap with most observations! But that isn’t super helpful because those uncertain predictions will be hard to utilize in practice.

In my application I am using point estimations of the posterior predictive distributions to calculate RMSE/MAE accuracy measures. These are considerably lower than in my model (with the large standard deviations in the levels) than using prior with shorter tails. Therefore, I conclude that using my approach is defensible. Do you have any opinion about that? Am I missing something?