Divergent transitions in linear hiearchical model with null data

I am having trouble getting the most basic of hierarchical linear models to work without divergent transitions, and I am hoping someone can get me un-stuck!

I have data on around 1,000 individuals each with some features I want to predict (e.g. performance on 10 different tasks) and some predictive variables (e.g. height, age, sex, etc.). Each individual is only observed once, but there are multiple (around 10) features per individual, all of which are predicted by the same design matrix. Here is my attempt at a hierarchical linear model.

\left[\begin{matrix}y_{feat\{1\}} \\ y_{feat\{2\}} \\ \vdots \end{matrix}\right] \sim N\left(\left[\begin{matrix}X\beta_{feat\{1\}} \\ X\beta_{feat\{2\}} \\ \vdots\end{matrix}\right], \sigma\right) \\ \beta_{feat} \sim MVN(\beta_{ind}, \Sigma) \\

data {
    int<lower=0> n_individuals;
    int<lower=0> n_features;
    int<lower=0> n_predictors;
    matrix[n_individuals, n_predictors] x;
    vector[n_individuals * n_features] y;
}
parameters {
    real<lower=0> sigma;
    vector<lower=0>[n_predictors] individuals_sigma;
    vector<lower=0>[n_predictors] features_sigma;
    cholesky_factor_corr[n_predictors] features_L_Omega;
    vector[n_predictors] individuals_z;
    matrix[n_predictors, n_features] features_z;
}
transformed parameters {
    // non-centered parameterization
    vector[n_predictors] individuals_beta = individuals_sigma .* individuals_z;
    matrix[n_predictors, n_features] features_beta = rep_matrix(individuals_beta, n_features) + diag_pre_multiply(features_sigma, features_L_Omega) * features_z;
}
model {
    // Priors
    target += std_normal_lpdf(sigma);
    target += std_normal_lpdf(individuals_sigma);
    target += std_normal_lpdf(features_sigma);
    target += lkj_corr_cholesky_lpdf(features_L_Omega | 2);
    target += std_normal_lpdf(individuals_z);
    target += std_normal_lpdf(to_vector(features_z));

    // Likelihood
    {
        vector[n_individuals * n_features] mu;
        for (i in 1:n_features) {
            mu[(i-1)*n_individuals+1:i*n_individuals] = x * features_beta[:,i];
        }
        target += normal_lpdf(y | mu, sigma);
    }
}

Running the model with null data drawn from a standard normal distribution results in divergent transitions for about 50% of the samples (although Rhat values are all less than 1.03).

n_predictors = 2
n_individuals = 1000
n_features = 10
data = {
    'n_predictors': n_predictors,
    'n_individuals': n_individuals,
    'n_features': n_features,
    'x': numpy.concat([numpy.ones((n_individuals,1)), numpy.random.normal(size=(n_individuals,n_predictors-1))], axis=1),
    'y': numpy.random.normal(size=n_individuals*n_features),
}
fit = model.sample(data=data)
print(fit.diagnose())
Checking sampler transitions treedepth.
Treedepth satisfactory for all transitions.

Checking sampler transitions for divergences.
851 of 1000 (85.10%) transitions ended with a divergence.
These divergent transitions indicate that HMC is not fully able to explore the posterior distribution.
Try increasing adapt delta closer to 1.
If this doesn't remove all divergences, try to reparameterize the model.

Checking E-BFMI - sampler transitions HMC potential energy.
E-BFMI satisfactory.

Rank-normalized split effective sample size satisfactory for all parameters.

The following parameters had rank-normalized split R-hat greater than 1.01:
  sigma, individuals_sigma[1], individuals_sigma[2], features_sigma[1], features_sigma[2], features_L_Omega[2,1], individuals_z[1], individuals_z[2], features_z[1,1], features_z[2,1], features_z[1,2], features_z[2,2], features_z[1,3], features_z[2,3], features_z[2,4], features_z[2,5], features_z[1,7], features_z[2,7], features_z[2,8], features_z[1,9], features_z[2,9], features_z[1,10], features_z[2,10], features_beta[1,10]

Have I made a novice mistake in implementing the model? Is this not the correct model for the data I’m describing? Is there a parameterization trick that would make the model robust to null data with weakly informative priors? Any help or insight is much appreciated!

Have you looked at a pairs plot of the draws? The bayesplot::mcmc_pairs() function is very helpful to diagnose sampling issues.

The pairs plots showed divergences, but they did not seem to be clearly following any pattern. I tweaked the model to use a centered parameterization for individuals_beta. After this the pairs plot showed a funnel degeneracy between the intercept term and individuals_sigma. I removed the individuals_sigma parameter from the model and replaced it with a standard normal prior on individuals_beta. After that the divergent transitions went away.

TLDR; don’t try to estimate the scale or variance of the group-level means.

Thanks for posting back the response. The general rule is that you can’t estimate something from one example. You can, but then you just get one data point and a prior. So I assume you mean you had this:

\beta_k \sim \text{normal}(\mu, \sigma)

and then rather than giving \mu, the group level mean, a fixed hyperprior, used something like:

\mu \sim \textrm{normal}(\nu, 1)

\nu \sim \textrm{exponential}(1)

This isn’t going to do anything for you because there’s only one \mu. In most of these cases you can just marginalize out the \nu and wind up again with a fixed prior on \mu.