RStan divergence issue - assistance with experimental research

Hi,

The challenge: divergent transitions for nearly all posterior simulations.

Two main questions:

  1. Is this less of a concern for my purposes?

I am estimating parameters for individual experimental subjects, and I am then comparing these parameters (e.g., including them in regression models). I am more interested in the difference between peoples parameters rather than that the posterior of their parameters needing to be precisely simulated.

  1. In my code/model below, can you see any obvious issues I’ve missed that can be addressed to reduce divergent transitions?

I have already attempted changing the hierarchical model (e.g., the type of distributions, priors), achieving one that makes sense conceptually and has a Rhat near 1 running 4 chains (which I’m happy about). I’ve also done a few tries at varying iterations, varying adapt_delta, varying init_r, and a simulation (further systematic testing TBC), however, divergence persists to date. Otherwise, no errors msg for bulk ESS but too high tail ESS.

Further background

I am simulating posteriors for parameters that feature in a choice-model: I fit an equation that predicts the probability of choosing option A over option B (i.e. binary choice).

My RStan model has three main parameters for each individual and six parameters that describe the group-level priors (mean and sd).

The choices are for

  • A: a lottery or gamble that pays $0 (low) or $X (high) with some probability, or
  • B: a sure $amount.

The three main parameters are: Predi, RewExp and noise. The first two are of main interest as they form the key parameters in my choice model:

utility(option A) =

p($0) * ($0^Predi / ($0^Predi + RewExp^Predi) ) +
p($X) * ($X^Predi / ($X^Predi + RewExp^Predi) )

utility(option B) =

p($sure) * ($sure^Predi / ($sure^Predi + RewExp^Predi) )

As a result, there are a high number of parameters: 6 group level + 3 for each subject - we have 443 subjects so the total is 6 + 3 * 443 = 1336.

Start Stan code/model

// The variables below are created and filled by the data as lists 

data {
  int N; // number of trials total (across participants), integer
  int nsubj;  // number of subjects,  
  int choices[N]; // the binary choice vector
  real <lower=0> x1a[N];
  real <lower=0> x2a[N];
  real <lower=0> x1b[N];
  real <lower=0> p1a[N];
  real <lower=0> p2a[N];
  real <lower=0> p1b[N];
  int sid[N]; 
}

// There are three main parameters: noise, Predi, and RewExp

parameters {
  // group-level/priors
  real<lower=0> meannoise;
  real<lower=0> sdnoise;
  real<lower=0.1> meanPredi;
  real<lower=0> sdPredi;
  real<lower=0.01> meanRewExp;
  real<lower=0> sdRewExp;

  
  // individual-level parameters
  real<lower=0.1> noise[nsubj];
  real<lower=0.1> Predi[nsubj];
  real<lower=0.01> RewExp[nsubj];
}

model {
  real ua; // utility of the option a
  real ub; // utility of the option b
  
  // Group-level parameter distributions
  meannoise ~ uniform(-2.30,1.61);
  sdnoise ~ uniform(0,1.13);
  meanPredi ~ uniform(-2.30,2.3);
  sdPredi ~ uniform(0,2.86);
  meanRewExp ~ uniform(-2.30,2.3);
  
  // Hierarchy - this links the group/prior parameters to the individual ones.
  noise ~ lognormal(meannoise, sdnoise);
  Predi ~ lognormal(meanPredi, sdPredi);
  RewExp ~ lognormal(meanRewExp, sdRewExp);

  for (i in 1:N) {
      // calcualte the utility of the gamble and the alternative
      // then estimate the likelihood of choosing option A over B in each trial
      int t = sid[i];
      
            
      ua=0.0;
      ua += p1a[i]*((x1a[i]^Predi[t])/((x1a[i]^Predi[t])+(RewExp[t]^Predi[t])));
      ua += p2a[i]*((x2a[i]^Predi[t])/((x2a[i]^Predi[t])+(RewExp[t]^Predi[t])));
      
      ub=0.0;
      ub += p1b[i]*((x1b[i]^Predi[t])/((x1b[i]^Predi[t])+(RewExp[t]^Predi[t])));

      choices[i] ~ bernoulli_logit((ua-ub)/noise[t]); // this is logistic step
    }
}

End Stan code

// note

  • x1a and p1a are the low outcome and associated probability for option A;
  • x2a and p2a the high outcome and probability for option A; and
  • x1b and p1b the sure outcomes of option B.

Below I also include the R code I use for executing the Stan model above (called ‘ESVT_AS_2023_v7.stan’):

Start of R code

library(readr)
library(readxl)
library(tidyr)
library(ggplot2)
library(rmarkdown)
library(tidyverse)
library(writexl)
library(rstan)
library(bayesplot)

sample <- read_xlsx("<my excel data set>")

setwd("<my folder>")


### extract the variables from the "sample" 

N=nrow(sample)
nsubj=max(sample['sid'])
subjs=unlist(sample['sid'])
choices=unlist(sample['chosea'])
x1a=unlist(sample['x1a'])
x2a=unlist(sample['x2a'])
x1b=unlist(sample['x1b'])
p1a=unlist(sample['p1a'])
p2a=unlist(sample['p2a'])
p1b=unlist(sample['p1b'])
sid=unlist(sample['sid'])

VarList <- list(N=N,nsubj=nsubj,subjs=subjs,choices=choices,x1a=x1a,x2a=x2a,x1b=x1b,p1a=p1a,p2a=p2a, p1b=p1b, sid=sid)

ESVTfit_v7 <- stan(file='ESVT_AS_2023_v7.stan',

data=list(N=N,nsubj=nsubj,subjs=subjs,choices=choices,x1a=x1a,x2a=x2a,x1b=x1b,p1a=p1a,p2a=p2a, p1b=p1b,sid=sid),
 warmup=5000,
               chains=4,
               iter=50000,
               init_r=1,
               verbose=FALSE)

print(ESVTfit_v7)
ESVTfit_v7_ss <- extract(ESVTfit_v7,permuted=TRUE)


write.table(as.data.frame(ESVTfit_v7_ss),file="ESVT_2023_v7_bhm.csv",quote=F,sep=",",row.names=F)

traceplot(ESVTfit_v7, pars = c("meannoise", "meanPredi", "meanRewExp")) ## too look at convergence of mean parameters

End of R code

Your thoughts and assistance are much appreciated.

Cheers,

Alex

P.S. As a further aid in understanding the question - I would like to attached a fake, simulated, data set of individuals and choice - this is the type of dataset read in by ‘read_xlsx’ above - but I don’t seem to be able to.

EDIT by Aki: added backticks to show code in code blocks.

This is still a big concern for your purposes.

Where you supply uniform priors, you need to declare the associated upper bounds when you declare the parameters. Alternatively, use a prior whose right tail decays more gradually than the uniform. Your divergences are likely due to the hamiltonian trajectories “falling off a cliff” when they sail off the edge of the uniform distribution.

Hi, @Alex_S and welcome to the Stan forums.

If there are divergent transitions, the draws are probably biased. If you only need a rough answer, it might not matter, but the problem there is that you usually don’t know how much bias there is without getting the correct answer.

ESS is never too high—the higher the better.

I’m not sure what happened to this, but you can use LaTeX in the forums—it’s unusual to try to write units into probability distributions.

I think there are two problems with your code. The first is this conceptual error:

The declaration of meannoise is not consistent with use. If you want it to be uniform in that range, it should be declared as

real<lower=-2.3, upper=1.61> meannoise;

Stan requires the model to have support (finite log density) for any value of the parameters satisfying their declared constraints.

The second problem is that you have a centered hierarchical model:

parameters {
  real<lower=0.1> noise[nsubj];
  real<lower=0> meannoise;
  real<lower=0> sdnoise;

model {
  noise ~ lognormal(meannoise, sdnoise);

This tends to induce challenging posterior geometry for sampling when there is not a ton of data. Instead, I would recommend trying the non-centered parameterization, which is a bit more involved for a lognormal:

parameters {
  vector[nsubj] real log_noise_std;

transformed parameters {
  vector[nsubj] noise = exp(mean noise + sdnoise * log_noise_std);
model {
  log_noise_std ~ std_normal();    

This implies noise ~ lognormal(meannoise, sdnoise).

You can make the stan code in the model block much much more efficient by (a) eliminating redundant calcuatlations (e.g., x1a[i]^Predi[t] is calculated twice), and (b) vectorizing so there’s a single sampling statement. But I’d suggest getting it working first.

I would recommend putting that redundant calculation into a function for readability and consistency, e.g.,

functions {
  real foo(real a, real b, real c, real d, real e) {
     real pow_b_c = b^c;
    return a * (pow_b_c) / (pow_b_c + d^e);
  }

Thank you Bob, jsocolar and Aki, these comments and adjustments are very helpful! I will proceed to implement, run, and post an update. And thanks for catching that, I meant too low ESS. Otherwise, please keep the comments coming if you or others see anything else that may help.
Cheers,
Alex