Divergent transitions with proportion covariate

Hi Stan community,

I have been working with a model that I know works well based on other datasets (observed and simulated). I am running into issues with divergent transitions when I am fitting the model with a covariate that is a proportion that ranges from 0 to 1 with an interesting distribution see figure below. Besides the newly fitted covariate the model and datasets are identical so the divergence issues are likely caused by the proportional covariate slope or how it interacts with a specific part of the model. Based on the irregular distribution of the data it might be that Stan has difficulty adjusting a proper step size to correctly explore sparser and denser parameter space.

I fit the model with standardised data with ~5000 observations. I tried looking into the pair plots but didn’t find major irregularities. Next I tried to re-parametrise the slope parameter as described in the stan user guide (see model below). I fitted a simplified version of the model in brms, which did not have any divergent transitions. This could of course be due to some missing complexity that interacted with the slope parameter of the proportional covariate, or that this should be able to be remedied by some model specification. I unfortunately cannot use the brms model because it is not possible to set some of the specifications I need in this model. But I am hopeful that this can be resolved with changing some model specifications like different priors or another method of re-paramterisation. If anyone has any recommendations that I could try out it is much appreciated!

All the best,

Corné


data {
  // Number of clusters (an integer)
   int<lower=0> No; // number of observations 
   int<lower=0> Ni; //number of individuals
   int<lower=1> n_trials; // Number of trials
   int<lower=1> n_groups; // Number of groups
  // Clusters identifiers (an integer)
   int<lower=0> individual[No];  //  Individual ID repeated obs
   int<lower=0> opponent1[No];  //  Individual ID opponent repeated obs
   int<lower=0> opponent2[No];  //  Individual ID opponent repeated obs
   int<lower=1> trialID[No];  //  TrialID of each observation
   int<lower=1> groupID[No];  //  GroupID of each observation
 // Continuous variables
    vector[No] xjk;  // observed covariate (opponent trait)
    vector[No] z;  // phenotypic observations
 }
 parameters {
// Define parameters to estimate
  // Parameters for model of phenotype
     real mu_z; //intercept
     
     real mu_psi; 
     real <lower=0> sigma_psi; 
     real psi_raw;

   // Random effects
   matrix[Ni,3] zI; //(intercepts and slopes, res_impact for each individual)
   vector<lower=0>[3] sigma_I; // sd  intercepts, slopes, res_impact
   cholesky_factor_corr[3] L;  // factor to estimate correlations
  // Other random effects
  real <lower=0> sigma_trial;
  real <lower=0> sigma_group;
  vector[n_trials] zTrial_I;// 
  vector[n_groups] zGroup_I;
  // Residual dispersion
   real<lower=0> sigma_e;
 }
 transformed parameters{
    real psi = mu_psi + sigma_psi * psi_raw; // Reparamatrise psi

    matrix[Ni,3] I; //  Unscaled blups intercept and slope and res_impact for each individual
    I  = zI * diag_pre_multiply(sigma_I, L)'; // get the unscaled value
    vector[n_trials] trial_I  =  zTrial_I * sigma_trial; // Get the unscaled values for random effect
    vector[n_groups] group_I  =  zGroup_I * sigma_group; 
    vector[No] e_z; // predicted values for phenotype
    
// Estimate Mui, Psii and epsiloni
     e_z  = mu_z  +  I[individual,1] + (psi + I[individual,2]).*xjk + 
               I[opponent1,3] + I[opponent2,3] +
               trial_I[trialID] + group_I[groupID];
}
model {
   // Priors fixed effects
    mu_z ~ normal(0,1);
    psi_raw ~ normal(0,1);
    mu_psi    ~ normal(0, 1);
    sigma_psi ~ exponential(3);

   // Priors random effects
    to_vector(zI) ~ normal(0,1);
    to_vector(zTrial_I) ~ normal(0, 1);
  	to_vector(zGroup_I) ~ normal(0, 1);
    to_vector(sigma_I) ~ exponential(3);
    sigma_trial ~ exponential(3);
    sigma_group ~ exponential(3);
    L ~ lkj_corr_cholesky(1.5);
    sigma_e ~ exponential(3);

   // Likelihood functions
    z ~ normal(e_z, sigma_e);// for phenotypic equation
}

How many? If the number of divergent transitions is really small, it might be solved by increasing adapt_delta argument. If increasing that doesn’t solve the problem, then you either reparameterization or more thought on priors may be needed.

I am fitting the model with 5 chains, 2000 warmup and 3000 sample iterations. I get about 1500-2500 divergent transitions after warmup. If I fit the model with adapt_delta = 0.99, the number of divergent transitions is reduced to ~500, but then have many transitions that exceed the maximum treedepth. I have also tried to transform the covariate with a log transformation and binned them into categories with 5 bins (i.e. categories 0-20) but to no avail. The prior I currently have on the slope is normal(0,1) because I standardise my response variable, and have tried fitting the model with a tighter prior normal(0,0.5), but that also did not help that much. Is there a different prior or other method I could try out given that I already tried to reparametrise the slope?

With this information, I would have already told that changing adapt_delta doesn’t help.

I didn’t have time to look at code more carefully, but maybe next week. Hopefully someone else has time to help before that

I was able to get this model running with the same covariate for two other variables of interest. This worked for the non-parametrised original model (see pasted below). The original response variable fit that I posted here about gets divergent transitions for both the model with the reparametrised slope parameter and the non-parametrised model. The other two response variables get 1000+ divergent transitions with the reparametrised slope model, likely indicating something is wrongly specified in that model. Although my problem is solved for the time being (at least for the other 2 variables), I’m still curious how to properly reparamatrise a slope for future reference as I followed the (although outdated) example provided in the stan user guide for a normally distributed parameter.

All the best,

Corné

data {
//Phenotype dataset
  // Number of clusters (an integer)
   int<lower=0> No; // number of observations for phenotypes or aantal rows
   int<lower=0> Ni; //number of individuals
   int<lower=1> n_trials; // Number of trials
   int<lower=1> n_groups; // Number of groups
  // Clusters identifiers (an integer)
   int<lower=0> individual[No];  //  Individual ID repeated obs
   int<lower=0> opponent1[No];  //  Individual ID opponent repeated obs
   int<lower=0> opponent2[No];  //  Individual ID opponent repeated obs
   int<lower=1> trialID[No];  //  TrialID of each observation
   int<lower=1> groupID[No];  //  GroupID of each observation
 // Continuous variables
    real xjk[No];  // observed covariate (opponent trait) measured with error
    real z[No];  // phenotypic observations
 }
 parameters {
// Define parameters to estimate
  // Parameters for model of phenotype
     real mu_z; //intercept
     real psi; // slope to partner phenotype

   // Random effects
   matrix[Ni,3] zI; //(intercepts and slopes, res_impact for each individual)
   vector<lower=0>[3] sigma_I; // sd  intercepts, slopes, res_impact
   cholesky_factor_corr[3] L;  // factor to estimate covariance int-slopes
  // Other random effects
  real <lower=0> sigma_trial;
  real <lower=0> sigma_group;
  vector[n_trials] zTrial_I;// standardised to speed up computation
  vector[n_groups] zGroup_I;
  // Residual dispersion
   real<lower=0> sigma_e;
 }
 transformed parameters{
    matrix[Ni,3] I; //  Unscaled blups intercept and slope and res_impact for each individual
    I  = zI * diag_pre_multiply(sigma_I, L)'; // get the unscaled value
    vector[n_trials] trial_I  =  zTrial_I * sigma_trial; // Get the unscaled values for random effect
    vector[n_groups] group_I  =  zGroup_I * sigma_group; 
    vector[No] e_z; // predicted values for phenotype
    // Estimate Mui, Psii and epsiloni
    for (o in 1:No) 
     e_z[o]  = mu_z  +  I[individual[o],1] + (psi + I[individual[o],2])*xjk[o] + 
               I[opponent1[o],3] + I[opponent2[o],3] +
               trial_I[trialID[o]] + group_I[groupID[o]];
}
model {
   // Priors fixed effects
    mu_z ~ normal(0,1);
    psi ~ normal(0,1);

   // Priors random effects
    to_vector(zI) ~ normal(0,1);
    to_vector(zTrial_I) ~ normal(0, 1);
  	to_vector(zGroup_I) ~ normal(0, 1);
    to_vector(sigma_I) ~ exponential(3);
    sigma_trial ~ exponential(3);
    sigma_group ~ exponential(3);
    L ~ lkj_corr_cholesky(1.5);

    sigma_e ~ exponential(3);

   // Likelihood functions
    z ~ normal(e_z, sigma_e);// for phenotypic equation
}
generated quantities{
real<lower=0> var_mu=sigma_I[1]^2; 
real<lower=0> var_psi=sigma_I[2]^2;
real<lower=0> var_epsilon=sigma_I[3]^2;
real<lower=0> var_trial = sigma_trial^2;
real<lower=0> var_group = sigma_group^2;
real<lower=0> var_res=sigma_e^2;
real<lower=0> var_P = var_mu + var_psi + (var_epsilon*2) + var_group + var_trial + var_res;
matrix[3, 3] Omega_P = L * L';              
}

I could be wrong, but I think the main problem with the original model is the hierarchical model for \psi.

The likelihood depends only on \psi, but in the original model, \psi = \psi_{mu} + \psi_{raw}\sigma_{\psi}. In the terminology I learned from Statistical Rethinking, this is a collider. I don’t think that by itself ensures divergences, but the normal hierarchical model has “degeneracies” (see this amazing article). In particular, here’s a quote from @betanalpha:

In other words if we observe only a small number of contexts then the population parameters will exhibit a funnel degeneracy no matter how much data we have within each context. For finite data the problem is even worse: the funnel degeneracy can manifest even for large K if there isn’t enough data within each context to sufficiently inform the individual likelihood functions.

Since \psi enters the likelihood in only one place, you essentially have a single context.