Divergences when sampling from seemingly unrelated regressions

I have some data of 200 study participants belonging to two groups who completed a motor task twice, once for each hand. We derived 20 features from each task’s recording.
The purpose is to identify which features differ between the 2 groups, and secondly whether there are any differences between the features derived from the 2 different hand replications.

The data is setup as 1 row per task (so 400 rows of 200 participants x 2 replicates) with 20 columns.
I want to model each feature as a function of group & hand, so I’m using the seemingly unrelated regressions example from the user guide with just a single intercept as the linear predictor being group and hand specific.
I’ve already added the Onion initialisation as it wasn’t sampling well without it (kept throwing the error about rejected proposals because of a cholesky factor being 0).

With the onion it now samples, but I get both the treedepth warning (21%) and divergences (5%), along with many of the correlation matrix having extremely poor Rhat (235 params have Rhat > 1.1).
Are there any more optimisations I can use?
NB: I’ve had to hand-copy the model definition as it’s being run on a secure server so there might be typos.

functions {
  array[] vector create_shapes(int K, real eta) {
    real alpha = eta + (K-2) / 2.0;
    array[2] vector[K-1] shape;
    
    shape[1, 1] = alpha;
    shape[2, 1] = alpha;
    for (k in 2:(K-1)) {
      alpha -= 0.5;
      shape[1,k] = k / 2.0;
      shape[2,k] = alpha;
    }

    return shape;
  }

  matrix lkj_onion(int K, row_vector l, vector R2, data array[] vector shape) {
    matrix[K,K] L = rep_matrix(0, K, K);
    {
      int start = 1;
      int end = 2;

      L[1,1] = 1.0;
      L[2, 1] = 2.0 * R2[1] - 1.0;
      L[2,2] = sqrt(1.0 - square(L[2,1]));
      for (k in 2:(K-1)) {
        int kp1 = k+1;
        row_vector[k] l_row = segment(l, start, k);
        real scale = sqrt(R2[k] / dot_self(l_row));
        L[kp1, 1:k] = l_row[1:k] * scale;
        L[kp1, kp1] = sqrt(1.0 - R2[k]);
        start = end + 1;
        end = start + k-1;
      }
    }
    return L;
  }
}
data {
  real<lower=0> eta;
  int<lower=0> N_obs;
  int<lower=0> N_feats;
  int<lower=0> N_groups;
  int<lower=0> N_hands;
  array[N_obs] vector[N_feats] y;
  array[N_obs] int<lower=0> group;
  array[N_obs] int<lower=0> handid;
}
transformed data {
  array[2] vector[N_feats-1] shapes = create_shapes(N_feats, eta);
}
parameters {
  // Onion parameters
  row_vector[choose(N_feats, 2) -1] l;  // do NOT init with 0 for all elements
  vector<lower=0, upper=1>[N_feats-1] R2;  // first element is not really a R^2 but is on (0,1)

  array[N_groups, N_hands] vector[N_feats] alpha;
  vector<lower=0>[N_feats] sigma;
}
transformed parameters {
  // Onion LKJ, implies Lcorr ~ lkj(eta)
  cholesky_factor_corr[N_feats] Lcorr = lkj_onion(N_feats, l, R2, shapes);
}
model {
  for (i in 1:N_groups) {
    for (j in 1:N_hands) {
      alpha[i,j] = normal(0, 1);
    }
  }
  
  sigma ~ cauchy(0, 2.5);

  // Onion, implies Lcorr ~ lkj_corr_cholesky(eta)
  R2 ~ beta(shapes[1], shapes[2]);
  l ~ std_normal();

  array[N_obs] vector[N_feats] mu;
  for (i in 1:N_obs) {
    mu[i] = alpha[group[i], handid[i]];
  }
  y ~ multi_normal_cholesky(mu, diag_pre_multiply(sigma, Lcorr));
}
generated quantities {
  matrix[N_feats, N_feats] Rho;
  Rho = multiply_lower_tri_self_transpose(Lcorr);
}

Sorry this has taken so long to get to and that you had to copy this all over by hand!

I don’t see anything that can be easily optimized in that model. Have you tried a simpler model where you take independent errors across the regressions to see if that fits? I only suggest that because it can help diagnose problems elsewhere in the model that get obfuscated when things get complicated.

I would print the shapes that get generated to make sure that algorithm is correct. For example, the first result assigning shape[1, 1] to alpha looks suspect given that the other values are using 0.5 * k. You can also unroll the loops in create_shapes but it doesn’t really matter for efficiency as it’ll just be done once. (You don’t need the .0 everywhere except when the other argument to a binary operator is an integer.)

I’d also test that the shapes getting generated are consistent with the R2 value. It’d be easier to just define shape1 and shape2 directly in transformed data rather than trying to pack into a 2D structure.

I’d try a tighter prior than a Cauchy on sigma.

No need to apologize at all! Was just hoping if I’d done anything obviously wrong someone would be able to notice.

I tried 2 simpler models: 1 with independent errors and one with with a subset of the 20 features - I used 5 - and both worked fine.

This makes me think the problem is due to the size of the covariance matrix which at 20x20 = 400 has as many entries as I have observations (200 participants x 2 replicates).
Does that sound like the likely cause, in which case there’s little I can do?

I’ll have a look at your comments regarding the shapes, although I got this onion initialisation code from an older post on the forum which others said they had used successfully.

What would be a sensible tighter prior on sigma?

Yes, covariance matrices can be challenging to fit if you don’t have a lot of data. They usually have a lot of posterior uncertainty.

That’s the right experiment to do and the right conclusion. Often when the data size is small, the solution is regularization/priors of some kind.

Rather than sigma ~ cauchy(0, 2.5);, try a normal(0, 1) if you think the values are in the (-3, 3) range or a normal with a different scale. The problem with the Cauchy is that when there’s not much data, it will lead to very extreme draws, which can then cause numerical issues.

I also find it easier to reason about the beta parameterized with mean (a probability) and total count, i.e., alpha / (alpha + beta) and (alpha + beta)—we have that parameterization built-in.