Hierarchical models are often problematic, but assuming that any problems are due to hierarchal structure just leads people down unproductive rabbit holes. In practice one shouldn’t attempt any of the usual strategies for moderating the degeneracies inherent to hierarchical models (in particular normal hierarchal models) without first verifying that those degeneracies are the problem.
In this case running the fit doesn’t show any divergences nor E-FMI warnings, which indicates that there likely isn’t any hierarchical structure at fault here. This is consistent with the fact that with 50 observations in each factor level, and a relatively small measurement variability, the realized likelihood function should be narrow enough to inform all of the factor level parameters strongly enough that the centered parameterization assumed in the model is optimal. It’s straightforward to verify by removing the hierarchal priors entirely and running again – the same problems arise!
So what else could be going on? The effective sample size warnings should be ignored until all other diagnostics are clear, which means that we prioritize the Rhat warnings. Because these show up for all of the parameters it appears that the four Markov chains run by default in Stan aren’t overlapping much at all!
That could be due to metastability, for example metastability caused by multiple modes in the posterior distribution, or it could be due to slow exploration caused by poor adaptation. Because there are also maximum treedepth warnings the latter is more suspect.
Looking at the number of leapfrog steps across all four chains
steps <- do.call(rbind, get_sampler_params(mdl.traitonly, inc_warmup=FALSE))[,'n_leapfrog__']
counts <- as.data.frame(table(steps))
colnames(counts) <- c("Leapfrog Steps", "Counts")
print(counts, row.names=FALSE)
we can see that the sampler is utilizing very long numerical trajectories. This suggests either a very degenerate posterior density function or poor adaptation.
To investigate the adaption further we have to look at the adaptation Hamiltonian Monte Carlo configuration for each chain. In RStan we can look at the adapted step sizes with
sapply(1:4, function(c) get_sampler_params(mdl.traitonly, inc_warmup=FALSE)[[c]][,'stepsize__'][1])
which shows some variation between the chains.
We also want to look at the adapted inverse metric elements,
info <- c()
for (c in 1:4) {
chain_info <- get_adaptation_info(mdl.traitonly)[[c]]
info <- c(info, as.list(strsplit(sub("\n$", "", sub("^#.*\n# ", "", chain_info)), ", ")))
}
inv_metric_elem <- data.frame(info)
colnames(inv_metric_elem) <- sapply(1:4, function(c) paste("Chain", c))
inv_metric_elem
This reveals even stronger variations in the adaptation arrived at from each Markov chain.
This provides more evidence that Stan is having trouble adapting Hamiltonian Monte Carlo well enough, and each trajectory is exploring only the very local neighborhood around each initial point. This is consistent with the trace plots, for example
unpermuted_samples <- extract(mdl.traitonly, permute=FALSE)
par(mfrow=c(2, 2))
for (c in 1:4) {
plot(1:1000, unpermuted_samples[,c,2], type="l", lwd=1, col=c_dark,
main=paste("Chain", c),
xlab="Iteration", xlim=c(1, 1000),
ylab="mu_grand", ylim=c(0, 20))
}
Okay, so why is the adaptation ending up in such a bad sampler configuration? In this case I think that, perhaps ironically, it’s due to the glut of data! With so much data the posterior is extremely narrow, and with a population scale of 10 for the individual species and study parameters the narrow posterior is likely to concentrate outside of the [-2, 2] initialization window that Stan uses by default. This means that Stan is trying to initialize very far away from the posterior typical set which makes the adaptation problem really, really hard.
Simulating new data with a measurement variability of 10 instead of 1 provides supporting evidence for this. The larger variability results in wider likelihood functions, and hence wider posterior density functions, that more strongly overlap with the default initialization window.
Here we can initialize around the point estimates, for example
init_mu_grand <- mean(trt.dat$yTraiti)
init_muSp <- mean(trt.dat$yTraiti[trt.dat$species == n])) - init_mu_grand
init_muStudy <- trt.dat$yTraiti[trt.dat$study == n])) - init_mu_grand
which should drastically reduce the adaptation difficulty, at least if my hypothesis is correct. For more complicated observational models this prescient initialization might not be possible.