I am having trouble getting the most basic of hierarchical linear models to work without divergent transitions, and I am hoping someone can get me un-stuck!
I have data on around 1,000 individuals each with some features I want to predict (e.g. performance on 10 different tasks) and some predictive variables (e.g. height, age, sex, etc.). Each individual is only observed once, but there are multiple (around 10) features per individual, all of which are predicted by the same design matrix. Here is my attempt at a hierarchical linear model.
\left[\begin{matrix}y_{feat\{1\}} \\ y_{feat\{2\}} \\ \vdots \end{matrix}\right] \sim N\left(\left[\begin{matrix}X\beta_{feat\{1\}} \\ X\beta_{feat\{2\}} \\ \vdots\end{matrix}\right], \sigma\right) \\ \beta_{feat} \sim MVN(\beta_{ind}, \Sigma) \\
data {
int<lower=0> n_individuals;
int<lower=0> n_features;
int<lower=0> n_predictors;
matrix[n_individuals, n_predictors] x;
vector[n_individuals * n_features] y;
}
parameters {
real<lower=0> sigma;
vector<lower=0>[n_predictors] individuals_sigma;
vector<lower=0>[n_predictors] features_sigma;
cholesky_factor_corr[n_predictors] features_L_Omega;
vector[n_predictors] individuals_z;
matrix[n_predictors, n_features] features_z;
}
transformed parameters {
// non-centered parameterization
vector[n_predictors] individuals_beta = individuals_sigma .* individuals_z;
matrix[n_predictors, n_features] features_beta = rep_matrix(individuals_beta, n_features) + diag_pre_multiply(features_sigma, features_L_Omega) * features_z;
}
model {
// Priors
target += std_normal_lpdf(sigma);
target += std_normal_lpdf(individuals_sigma);
target += std_normal_lpdf(features_sigma);
target += lkj_corr_cholesky_lpdf(features_L_Omega | 2);
target += std_normal_lpdf(individuals_z);
target += std_normal_lpdf(to_vector(features_z));
// Likelihood
{
vector[n_individuals * n_features] mu;
for (i in 1:n_features) {
mu[(i-1)*n_individuals+1:i*n_individuals] = x * features_beta[:,i];
}
target += normal_lpdf(y | mu, sigma);
}
}
Running the model with null data drawn from a standard normal distribution results in divergent transitions for about 50% of the samples (although Rhat values are all less than 1.03).
n_predictors = 2
n_individuals = 1000
n_features = 10
data = {
'n_predictors': n_predictors,
'n_individuals': n_individuals,
'n_features': n_features,
'x': numpy.concat([numpy.ones((n_individuals,1)), numpy.random.normal(size=(n_individuals,n_predictors-1))], axis=1),
'y': numpy.random.normal(size=n_individuals*n_features),
}
fit = model.sample(data=data)
print(fit.diagnose())
Checking sampler transitions treedepth.
Treedepth satisfactory for all transitions.
Checking sampler transitions for divergences.
851 of 1000 (85.10%) transitions ended with a divergence.
These divergent transitions indicate that HMC is not fully able to explore the posterior distribution.
Try increasing adapt delta closer to 1.
If this doesn't remove all divergences, try to reparameterize the model.
Checking E-BFMI - sampler transitions HMC potential energy.
E-BFMI satisfactory.
Rank-normalized split effective sample size satisfactory for all parameters.
The following parameters had rank-normalized split R-hat greater than 1.01:
sigma, individuals_sigma[1], individuals_sigma[2], features_sigma[1], features_sigma[2], features_L_Omega[2,1], individuals_z[1], individuals_z[2], features_z[1,1], features_z[2,1], features_z[1,2], features_z[2,2], features_z[1,3], features_z[2,3], features_z[2,4], features_z[2,5], features_z[1,7], features_z[2,7], features_z[2,8], features_z[1,9], features_z[2,9], features_z[1,10], features_z[2,10], features_beta[1,10]
Have I made a novice mistake in implementing the model? Is this not the correct model for the data I’m describing? Is there a parameterization trick that would make the model robust to null data with weakly informative priors? Any help or insight is much appreciated!