RStan divergence issue - assistance with experimental research

Alex_S · May 2, 2025, 4:05am

Hi,

The challenge: divergent transitions for nearly all posterior simulations.

Two main questions:

Is this less of a concern for my purposes?

I am estimating parameters for individual experimental subjects, and I am then comparing these parameters (e.g., including them in regression models). I am more interested in the difference between peoples parameters rather than that the posterior of their parameters needing to be precisely simulated.

In my code/model below, can you see any obvious issues I’ve missed that can be addressed to reduce divergent transitions?

I have already attempted changing the hierarchical model (e.g., the type of distributions, priors), achieving one that makes sense conceptually and has a Rhat near 1 running 4 chains (which I’m happy about). I’ve also done a few tries at varying iterations, varying adapt_delta, varying init_r, and a simulation (further systematic testing TBC), however, divergence persists to date. Otherwise, no errors msg for bulk ESS but too high tail ESS.

Further background

I am simulating posteriors for parameters that feature in a choice-model: I fit an equation that predicts the probability of choosing option A over option B (i.e. binary choice).

My RStan model has three main parameters for each individual and six parameters that describe the group-level priors (mean and sd).

The choices are for

A: a lottery or gamble that pays $0 (low) or $X (high) with some probability, or
B: a sure $amount.

The three main parameters are: Predi, RewExp and noise. The first two are of main interest as they form the key parameters in my choice model:

utility(option A) =

p($0) * ($0^Predi / ($0^Predi + RewExp^Predi) ) +
p($X) * ($X^Predi / ($X^Predi + RewExp^Predi) )

utility(option B) =

p($sure) * ($sure^Predi / ($sure^Predi + RewExp^Predi) )

As a result, there are a high number of parameters: 6 group level + 3 for each subject - we have 443 subjects so the total is 6 + 3 * 443 = 1336.

Start Stan code/model

// The variables below are created and filled by the data as lists 

data {
  int N; // number of trials total (across participants), integer
  int nsubj;  // number of subjects,  
  int choices[N]; // the binary choice vector
  real <lower=0> x1a[N];
  real <lower=0> x2a[N];
  real <lower=0> x1b[N];
  real <lower=0> p1a[N];
  real <lower=0> p2a[N];
  real <lower=0> p1b[N];
  int sid[N]; 
}

// There are three main parameters: noise, Predi, and RewExp

parameters {
  // group-level/priors
  real<lower=0> meannoise;
  real<lower=0> sdnoise;
  real<lower=0.1> meanPredi;
  real<lower=0> sdPredi;
  real<lower=0.01> meanRewExp;
  real<lower=0> sdRewExp;

  
  // individual-level parameters
  real<lower=0.1> noise[nsubj];
  real<lower=0.1> Predi[nsubj];
  real<lower=0.01> RewExp[nsubj];
}

model {
  real ua; // utility of the option a
  real ub; // utility of the option b
  
  // Group-level parameter distributions
  meannoise ~ uniform(-2.30,1.61);
  sdnoise ~ uniform(0,1.13);
  meanPredi ~ uniform(-2.30,2.3);
  sdPredi ~ uniform(0,2.86);
  meanRewExp ~ uniform(-2.30,2.3);
  
  // Hierarchy - this links the group/prior parameters to the individual ones.
  noise ~ lognormal(meannoise, sdnoise);
  Predi ~ lognormal(meanPredi, sdPredi);
  RewExp ~ lognormal(meanRewExp, sdRewExp);

  for (i in 1:N) {
      // calcualte the utility of the gamble and the alternative
      // then estimate the likelihood of choosing option A over B in each trial
      int t = sid[i];
      
            
      ua=0.0;
      ua += p1a[i]*((x1a[i]^Predi[t])/((x1a[i]^Predi[t])+(RewExp[t]^Predi[t])));
      ua += p2a[i]*((x2a[i]^Predi[t])/((x2a[i]^Predi[t])+(RewExp[t]^Predi[t])));
      
      ub=0.0;
      ub += p1b[i]*((x1b[i]^Predi[t])/((x1b[i]^Predi[t])+(RewExp[t]^Predi[t])));

      choices[i] ~ bernoulli_logit((ua-ub)/noise[t]); // this is logistic step
    }
}

End Stan code

// note

x1a and p1a are the low outcome and associated probability for option A;
x2a and p2a the high outcome and probability for option A; and
x1b and p1b the sure outcomes of option B.

Below I also include the R code I use for executing the Stan model above (called ‘ESVT_AS_2023_v7.stan’):

Start of R code

library(readr)
library(readxl)
library(tidyr)
library(ggplot2)
library(rmarkdown)
library(tidyverse)
library(writexl)
library(rstan)
library(bayesplot)

sample <- read_xlsx("<my excel data set>")

setwd("<my folder>")


### extract the variables from the "sample" 

N=nrow(sample)
nsubj=max(sample['sid'])
subjs=unlist(sample['sid'])
choices=unlist(sample['chosea'])
x1a=unlist(sample['x1a'])
x2a=unlist(sample['x2a'])
x1b=unlist(sample['x1b'])
p1a=unlist(sample['p1a'])
p2a=unlist(sample['p2a'])
p1b=unlist(sample['p1b'])
sid=unlist(sample['sid'])

VarList <- list(N=N,nsubj=nsubj,subjs=subjs,choices=choices,x1a=x1a,x2a=x2a,x1b=x1b,p1a=p1a,p2a=p2a, p1b=p1b, sid=sid)

ESVTfit_v7 <- stan(file='ESVT_AS_2023_v7.stan',

data=list(N=N,nsubj=nsubj,subjs=subjs,choices=choices,x1a=x1a,x2a=x2a,x1b=x1b,p1a=p1a,p2a=p2a, p1b=p1b,sid=sid),
 warmup=5000,
               chains=4,
               iter=50000,
               init_r=1,
               verbose=FALSE)

print(ESVTfit_v7)
ESVTfit_v7_ss <- extract(ESVTfit_v7,permuted=TRUE)


write.table(as.data.frame(ESVTfit_v7_ss),file="ESVT_2023_v7_bhm.csv",quote=F,sep=",",row.names=F)

traceplot(ESVTfit_v7, pars = c("meannoise", "meanPredi", "meanRewExp")) ## too look at convergence of mean parameters

End of R code

Your thoughts and assistance are much appreciated.

Cheers,

Alex

P.S. As a further aid in understanding the question - I would like to attached a fake, simulated, data set of individuals and choice - this is the type of dataset read in by ‘read_xlsx’ above - but I don’t seem to be able to.

EDIT by Aki: added backticks to show code in code blocks.

jsocolar · May 2, 2025, 4:41pm

This is still a big concern for your purposes.

Where you supply uniform priors, you need to declare the associated upper bounds when you declare the parameters. Alternatively, use a prior whose right tail decays more gradually than the uniform. Your divergences are likely due to the hamiltonian trajectories “falling off a cliff” when they sail off the edge of the uniform distribution.

Bob_Carpenter · May 2, 2025, 8:38pm

Hi, @Alex_S and welcome to the Stan forums.

If there are divergent transitions, the draws are probably biased. If you only need a rough answer, it might not matter, but the problem there is that you usually don’t know how much bias there is without getting the correct answer.

ESS is never too high—the higher the better.

I’m not sure what happened to this, but you can use LaTeX in the forums—it’s unusual to try to write units into probability distributions.

I think there are two problems with your code. The first is this conceptual error:

The declaration of meannoise is not consistent with use. If you want it to be uniform in that range, it should be declared as

real<lower=-2.3, upper=1.61> meannoise;

Stan requires the model to have support (finite log density) for any value of the parameters satisfying their declared constraints.

The second problem is that you have a centered hierarchical model:

parameters {
  real<lower=0.1> noise[nsubj];
  real<lower=0> meannoise;
  real<lower=0> sdnoise;

model {
  noise ~ lognormal(meannoise, sdnoise);

This tends to induce challenging posterior geometry for sampling when there is not a ton of data. Instead, I would recommend trying the non-centered parameterization, which is a bit more involved for a lognormal:

parameters {
  vector[nsubj] real log_noise_std;

transformed parameters {
  vector[nsubj] noise = exp(mean noise + sdnoise * log_noise_std);
model {
  log_noise_std ~ std_normal();

This implies noise ~ lognormal(meannoise, sdnoise).

You can make the stan code in the model block much much more efficient by (a) eliminating redundant calcuatlations (e.g., x1a[i]^Predi[t] is calculated twice), and (b) vectorizing so there’s a single sampling statement. But I’d suggest getting it working first.

I would recommend putting that redundant calculation into a function for readability and consistency, e.g.,

functions {
  real foo(real a, real b, real c, real d, real e) {
     real pow_b_c = b^c;
    return a * (pow_b_c) / (pow_b_c + d^e);
  }

Alex_S · May 3, 2025, 6:14am

Thank you Bob, jsocolar and Aki, these comments and adjustments are very helpful! I will proceed to implement, run, and post an update. And thanks for catching that, I meant too low ESS. Otherwise, please keep the comments coming if you or others see anything else that may help.
Cheers,
Alex

Alex_S · May 12, 2025, 12:13pm

Dear Bob_Carpenter,

I wish to provide an update: I implemented your suggested adjustments to my Stan code (hopefully correctly) and I am happy to say that this removed almost all the divergent transitions. I then also increased the adapt_delta to 0.99, which dealt with the rest (i.e., no divergences). My runs have quite a few transitions exceeding tree depth, but my understanding is that I can ignore this.

I do have a follow-up question, and please excuse my ignorance as I am still a novice when it comes to the mathematical details behind statistical distributions:

Which parameter do I take as my estimate of the persons ‘RewExp’, my parameter of interest?

In my Stan code (copied at the bottom) ‘RewExp’ is constructing from the following parameters:

// Group level:

‘meanRewExp’
‘sdRewExp’

// Individual level:

‘RewExp’
‘log_RewExp_std’

// by way of:

vector[nsubj] RewExp = exp(meanRewExp + sdRewExp * log_RewExp_std)

Is it the estimates for ‘RewExp’ that I take as my final estimate of the individuals RewExp, or do I use a transformed ‘log_RewExp_std’, or some calculation using the two?

Cheers,

Alex

Updated Stan code:

data {
  int N; // number of trials total (across participants), integer
  int nsubj;  // number of subjects,  
  int choices[N]; // the binary choice vector
  real <lower=0> x1a[N];
  real <lower=0> x2a[N];
  real <lower=0> x1b[N];
  real <lower=0> p1a[N];
  real <lower=0> p2a[N];
  real <lower=0> p1b[N];
  int sid[N]; 
}

parameters {
  // group-level
  real<lower=-4.56, upper=1.61> meannoise; // assuming mean is 0.01 to 5
  real<lower=0, upper=1.13> sdnoise; // according sds
  real<lower=-2.3, upper=2.3> meanPredi; // assuming mean is 0.1 to 10
  real<lower=0, upper=2.86> sdPredi;
  real<lower=-2.3, upper=4.5> meanRewExp; // assuming mean is 0.1 to 90
  real<lower=0, upper=25.98> sdRewExp;
  
  // based on 'discourse' advice:
  vector[nsubj] log_noise_std;
  vector[nsubj] log_Predi_std;
  vector[nsubj] log_RewExp_std;

}

// based on 'discourse' advice:

transformed parameters {
	vector[nsubj] noise = exp(meannoise + sdnoise * log_noise_std);
	vector[nsubj] Predi = exp(meanPredi + sdPredi * log_Predi_std);
	vector[nsubj] RewExp = exp(meanRewExp + sdRewExp * log_RewExp_std);
}

model {
  real ua; // utility of the option a
  real ub; // utility of the option b
  
  // Group-level:
  meannoise ~ uniform(-4.56,1.61); // assuming mean is 0.01 to 5
  sdnoise ~ uniform(0,1.13); // according sds
  meanPredi ~ uniform(-2.30,2.3); // assuming mean is 0.1 to 10
  sdPredi ~ uniform(0,2.86);
  meanRewExp ~ uniform(-2.30,4.5); // assuming mean is 0.1 to 90
  sdRewExp ~ uniform(0,25.98); 
  
  // based on 'discourse' advice:
  log_noise_std ~ std_normal();
  log_Predi_std ~ std_normal();
  log_RewExp_std ~ std_normal();
  
  for (i in 1:N) {

      int t = sid[i];
      
            
      ua=0.0;
      ua += p1a[i]*((x1a[i]^Predi[t])/((x1a[i]^Predi[t])+(RewExp[t]^Predi[t])));
      ua += p2a[i]*((x2a[i]^Predi[t])/((x2a[i]^Predi[t])+(RewExp[t]^Predi[t])));
      
      ub=0.0;
      ub += p1b[i]*((x1b[i]^Predi[t])/((x1b[i]^Predi[t])+(RewExp[t]^Predi[t])));

      choices[i] ~ bernoulli_logit((ua-ub)/noise[t]);
      }
}

Bob_Carpenter · May 16, 2025, 8:51pm

Exceeding tree depth won’t introduce bias, but will cause more random-walk like behavior.

Yes. If we have a random variable \Theta and data y, we always compute expectations in the same way, as

\widehat{\Theta} = \mathbb{E}[\Theta \mid y] \approx \frac{1}{M} \sum_{m < M} \theta^{(m)},

where we assume \theta^{(m)} \sim p_\Theta(\theta).

So you just define your transformed variable RewExp and then compute posterior means. This amounts to

\widehat{\textit{RewExp}} \\ \qquad \approx \frac{1}{M} \sum_{m < M} \textit{RewExp}^{(m)} \\ \qquad = \frac{1}{M} \sum_{m < M} \exp\!\left(\textit{meanRewExp}^{(m)} + \textit{sdRewExp}^{(m)} \cdot \textit{logRewExpstd}^{(m)}\right).

Alex_S · May 21, 2025, 1:42am

Hi Bob, Thank you again for your support and for explaining further. This feedback has been very helpful. Cheers, Alex

Topic		Replies	Views
Divergent transitions Modeling	12	939	July 17, 2019
High number of divergent transitions in ODE model with explicit constraints Modeling ecology , population-dynamics	55	3290	June 5, 2018
Divergent transitions in Beta model General fitting-issues , specification , performance , divergences	7	1281	February 7, 2022
Divergent transitions after warmup Modeling rstan , fitting-issues , performance	1	626	February 21, 2024
Divergent transitions - a primer General howto , divergences	6	14931	August 11, 2020

RStan divergence issue - assistance with experimental research

Related topics