Divergent transitions in linear hiearchical model with null data

benkay86 · April 16, 2025, 4:42am

I am having trouble getting the most basic of hierarchical linear models to work without divergent transitions, and I am hoping someone can get me un-stuck!

I have data on around 1,000 individuals each with some features I want to predict (e.g. performance on 10 different tasks) and some predictive variables (e.g. height, age, sex, etc.). Each individual is only observed once, but there are multiple (around 10) features per individual, all of which are predicted by the same design matrix. Here is my attempt at a hierarchical linear model.

\left[\begin{matrix}y_{feat\{1\}} \\ y_{feat\{2\}} \\ \vdots \end{matrix}\right] \sim N\left(\left[\begin{matrix}X\beta_{feat\{1\}} \\ X\beta_{feat\{2\}} \\ \vdots\end{matrix}\right], \sigma\right) \\ \beta_{feat} \sim MVN(\beta_{ind}, \Sigma) \\

data {
    int<lower=0> n_individuals;
    int<lower=0> n_features;
    int<lower=0> n_predictors;
    matrix[n_individuals, n_predictors] x;
    vector[n_individuals * n_features] y;
}
parameters {
    real<lower=0> sigma;
    vector<lower=0>[n_predictors] individuals_sigma;
    vector<lower=0>[n_predictors] features_sigma;
    cholesky_factor_corr[n_predictors] features_L_Omega;
    vector[n_predictors] individuals_z;
    matrix[n_predictors, n_features] features_z;
}
transformed parameters {
    // non-centered parameterization
    vector[n_predictors] individuals_beta = individuals_sigma .* individuals_z;
    matrix[n_predictors, n_features] features_beta = rep_matrix(individuals_beta, n_features) + diag_pre_multiply(features_sigma, features_L_Omega) * features_z;
}
model {
    // Priors
    target += std_normal_lpdf(sigma);
    target += std_normal_lpdf(individuals_sigma);
    target += std_normal_lpdf(features_sigma);
    target += lkj_corr_cholesky_lpdf(features_L_Omega | 2);
    target += std_normal_lpdf(individuals_z);
    target += std_normal_lpdf(to_vector(features_z));

    // Likelihood
    {
        vector[n_individuals * n_features] mu;
        for (i in 1:n_features) {
            mu[(i-1)*n_individuals+1:i*n_individuals] = x * features_beta[:,i];
        }
        target += normal_lpdf(y | mu, sigma);
    }
}

Running the model with null data drawn from a standard normal distribution results in divergent transitions for about 50% of the samples (although Rhat values are all less than 1.03).

n_predictors = 2
n_individuals = 1000
n_features = 10
data = {
    'n_predictors': n_predictors,
    'n_individuals': n_individuals,
    'n_features': n_features,
    'x': numpy.concat([numpy.ones((n_individuals,1)), numpy.random.normal(size=(n_individuals,n_predictors-1))], axis=1),
    'y': numpy.random.normal(size=n_individuals*n_features),
}
fit = model.sample(data=data)
print(fit.diagnose())

Checking sampler transitions treedepth.
Treedepth satisfactory for all transitions.

Checking sampler transitions for divergences.
851 of 1000 (85.10%) transitions ended with a divergence.
These divergent transitions indicate that HMC is not fully able to explore the posterior distribution.
Try increasing adapt delta closer to 1.
If this doesn't remove all divergences, try to reparameterize the model.

Checking E-BFMI - sampler transitions HMC potential energy.
E-BFMI satisfactory.

Rank-normalized split effective sample size satisfactory for all parameters.

The following parameters had rank-normalized split R-hat greater than 1.01:
  sigma, individuals_sigma[1], individuals_sigma[2], features_sigma[1], features_sigma[2], features_L_Omega[2,1], individuals_z[1], individuals_z[2], features_z[1,1], features_z[2,1], features_z[1,2], features_z[2,2], features_z[1,3], features_z[2,3], features_z[2,4], features_z[2,5], features_z[1,7], features_z[2,7], features_z[2,8], features_z[1,9], features_z[2,9], features_z[1,10], features_z[2,10], features_beta[1,10]

Have I made a novice mistake in implementing the model? Is this not the correct model for the data I’m describing? Is there a parameterization trick that would make the model robust to null data with weakly informative priors? Any help or insight is much appreciated!

erognli · April 16, 2025, 6:08am

Have you looked at a pairs plot of the draws? The bayesplot::mcmc_pairs() function is very helpful to diagnose sampling issues.

benkay86 · April 16, 2025, 8:00pm

The pairs plots showed divergences, but they did not seem to be clearly following any pattern. I tweaked the model to use a centered parameterization for individuals_beta. After this the pairs plot showed a funnel degeneracy between the intercept term and individuals_sigma. I removed the individuals_sigma parameter from the model and replaced it with a standard normal prior on individuals_beta. After that the divergent transitions went away.

TLDR; don’t try to estimate the scale or variance of the group-level means.

Bob_Carpenter · April 16, 2025, 9:38pm

Thanks for posting back the response. The general rule is that you can’t estimate something from one example. You can, but then you just get one data point and a prior. So I assume you mean you had this:

\beta_k \sim \text{normal}(\mu, \sigma)

and then rather than giving \mu, the group level mean, a fixed hyperprior, used something like:

\mu \sim \textrm{normal}(\nu, 1)

\nu \sim \textrm{exponential}(1)

This isn’t going to do anything for you because there’s only one \mu. In most of these cases you can just marginalize out the \nu and wind up again with a fixed prior on \mu.

Topic		Replies	Views
Divergent transitions for hierarchical model for binomial proportion Modeling	1	343	February 21, 2023
Divergent transitions in hierarchical model Modeling fitting-issues	26	1909	November 7, 2019
Divergent transitions in a multivariate multilevel regression model Modeling fitting-issues	17	1411	August 25, 2018
Choosing correct non-centered parametrization Modeling techniques , specification	9	774	October 2, 2020
Hierarchical model stubborn after adding multivariate prior Modeling hierarchical-model	4	225	May 9, 2024

Divergent transitions in linear hiearchical model with null data

Related topics