Low effective sample size after running Bayesian cognitive model in Stan

Hi all,

—I fitted a Bayesian hierarchical reinforcement learning model to my behavioral data and there were no model fitting problems . Specifically, I fitted this model using 4 chains , with 1,000 warm-up and 11,000 iterations per chain. Besides, I already employed both vectorization and non-centered parameterization in my Stan code. However, when I check the model output, I noted that for some parameters of interest, the number of effective sample size is relatively low (given that each chain has 11,000 iteration), therefore I am not sure whether those posterior estimations are reliable.
—( Rhat values and n_eff for all hyper parameters )
image
—( traceplot for all hyper parameters )
image
image
—( trankplot for all hyper parameters )
image
image
—May I know does anyone see any clues in the aforementioned plots that I have missed? Or any other errors I’ve missed here, or have any thoughts about what might be going on here? I would like to thank you and your patience in advance.

Given that the rhats are close to 1 and no other model misfit indicators (divergences, EBFMI, treedepth, LOO, etc), once you have a hundred or so effective samples you needn’t worry about estimating even tail properties of the posterior for a parameter. It is odd to see such low ESS/N_Samples ratios, but not unheard of in my experience of complex models. But maybe some experts like @andrewgelman and @avehtari might weigh in.

Also, did you opt for such a high number of post-warmup samples only to bump up the ESS, or were other model diagnostics failing at lower numbers?

1 Like

First, I don’t think you need 11,000 iterations per chain. Second, if you did need 11,000 iterations I think you’d want about that many warmup. What happened at the default of 1000 warmup and 1000 saved iterations? I’d guess that would be enuf. Also I recommend changing your print defaults to just give 1 or 2 decimal places!

Hi Mike,

Thank you very much for your reply, I am very appreciated. To be honest, I did indeed opt for such a high number of post-warmup samples only to bump up the ESS, but I am not sure what will happen if I only run the model with 1,000 warm-up and 2,000 iterations per chain (p.s. I used 4 chains).

I will re-run the model with 2,000 iterations and 1,000 warm-up and let you know what happen.

Thank you again

Hi Andrew,

Thank you very much for your reply, I am very appreciated too. To be honest, I haven’t run my model with 2,000 iterations and 1,000 warm-up per chain. I will do it now and let you know what happen.

Thank you again!

Hi Andrew & Mike,

—I hereby re-ran my model at the default of 1000 warmup and 1000 saved iterations (4 chains), but I got a warning message saying that ‘Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable’. So I fitted my model using 3,000 warmup and 3000 saved iterations (4 chains), and this time no warning messages. Followings are the model outputs:
—(Rhat values and n_eff)
image
—(traceplot)



—(trankplot)

—It seems that the aforementioned model diagnostics look acceptable. May I know your ideas or do you have any further suggestions?

Best,
Jisoo Kim

1 Like

Looks good now!

In terms of trusting the inference of the results at least. It does look like the default adaptation is struggling, which might be informative for the folks seeking to improve the algorithms. To that end, I should have asked before: I assume your model is hierarchical, so have you tried both centered and non-centered parameterizations?

Hi Mike,

You are right, I ran a Bayesian hierarchical reinforcement learning model, following is my Stan code (in case you are interested), I employed both non-centered parameterization and vectorization coding.

data {
  int<lower=1> N;
  int<lower=1> T;
  int<lower=1, upper=4> cue[N, T];
  int<lower=0, upper=1> action[N, T];
  real outcome[N, T];

}

transformed data {
  vector[4] initV; 
  initV = rep_vector(0.0, 4);
  
}

parameters {
  // in total 6 parameters of interest:
  // 1) xi 
  // 2) ep 
  // 3) b 
  // 4) pi 
  // 5) rhoRew 
  // 6) rhoPun 
  
  vector[6] mu_pr; // declare as vectors for vectorizing
  vector<lower=0>[6] sigma;
  
  vector[N] xi_pr; // each element ~ N(0, 1)
  vector[N] ep_pr; 
  vector[N] b_pr; 
  vector[N] pi_pr; 
  vector[N] rhoRew_pr; 
  vector[N] rhoPun_pr; 
  
}


transformed parameters {
  vector<lower=0, upper=0.2>[N] xi; // bounded between [0, 0.2]
  vector<lower=0, upper=1>[N] ep; // bounded between [0, 1]
  vector[N] b; // unbounded, namely (-Inf, Inf)
  vector[N] pi; // unbounded, namely (-Inf, Inf)
  vector<lower=0>[N] rhoRew; // bounded as [0, Inf)
  vector<lower=0>[N] rhoPun; // bounded as [0, Inf)
  
  // vectorization and non-centered parameterization
  xi = Phi_approx(mu_pr[1] + sigma[1] * xi_pr)*0.2;
  ep = Phi_approx(mu_pr[2] + sigma[2] * ep_pr);
  b = mu_pr[3] + sigma[3] * b_pr;
  pi = mu_pr[4] + sigma[4] * pi_pr;
  rhoRew = log(1 + exp(mu_pr[5] + sigma[5] * rhoRew_pr));
  rhoPun = log(1 + exp(mu_pr[6] + sigma[6] * rhoPun_pr));
  
  
}


model {
  // priors for these 6 hyper-parameters
  mu_pr[1] ~ normal(0, 1.0); // xi
  mu_pr[2] ~ normal(0, 1.0); // ep
  mu_pr[3] ~ normal(0, 2.0); // b
  mu_pr[4] ~ normal(0, 2.0); // pi
  mu_pr[5] ~ normal(0, 1.0); // rhoRew
  mu_pr[6] ~ normal(0, 1.0); // rhoPun
  
  sigma[1] ~ normal(0, 0.2); // xi
  sigma[2] ~ normal(0, 0.2); // ep
  sigma[3] ~ cauchy(0, 1.0); // b
  sigma[4] ~ cauchy(0, 1.0); // pi
  sigma[5] ~ normal(0, 0.2); // rhoRew
  sigma[6] ~ normal(0, 0.2); // rhoPun
  
  
  // Matt trick
  xi_pr ~ normal(0, 1.0);
  ep_pr ~ normal(0, 1.0);
  b_pr ~ normal(0, 1.0);
  pi_pr ~ normal(0, 1.0);
  rhoRew_pr ~ normal(0, 1.0);
  rhoPun_pr ~ normal(0, 1.0);
  
  for (s in 1:N) {
    // 1)In total I have 4 training conditions.
    // 2)Under each condition subjects have two alternative action choices: go and no-go.
    // 3)So, for each action per condition, I want to now its action weight and q-value.
    vector[4] actionWeight_go; 
    vector[4] actionWeight_nogo; 
    vector[4] qValue_go; 
    vector[4] qValue_nogo; 
    vector[4] sv; 
    vector[4] pGo; 
    
    actionWeight_go = initV;
    actionWeight_nogo = initV;
    qValue_go = initV; 
    qValue_nogo = initV;
    sv = initV; 
    
    for (t in 1:T) {
      actionWeight_go[cue[s, t]] = qValue_go[cue[s, t]] + b[s] + pi[s] * sv[cue[s, t]]; 
      actionWeight_nogo[cue[s, t]] = qValue_nogo[cue[s, t]]; 
      
      // the probability of making go responses
      pGo[cue[s,t]] = inv_logit(actionWeight_go[cue[s, t]] - actionWeight_nogo[cue[s,t]]);
      
      {
        pGo[cue[s, t]] *= (1 - xi[s]);
        pGo[cue[s, t]] += (xi[s]/2);
      }
      
      // Likelihood function
      action[s, t] ~ bernoulli(pGo[cue[s, t]]);
      
      // update the stimulus value
      if (cue[s, t] ==1) { // goToWin (0 or 1)
        if (outcome[s, t] == 1) {
          sv[cue[s, t]] += ep[s] * (rhoRew[s] * outcome[s, t] - sv[cue[s, t]]);
        } else if (outcome[s, t] ==0) {
           sv[cue[s, t]] += ep[s] * (rhoPun[s] * outcome[s, t] - sv[cue[s, t]]);
        }
      } else if (cue[s, t] == 2) { // goToAvoid (-1 or 0)
        if (outcome[s, t] == 0) {
          sv[cue[s, t]] += ep[s] * (rhoRew[s] * outcome[s, t] - sv[cue[s, t]]);
        } else if (outcome[s, t] == -1) {
           sv[cue[s, t]] += ep[s] * (rhoPun[s] * outcome[s, t] - sv[cue[s, t]]);
        }
      } else if (cue[s, t] ==3) { // nogoToWin (0 or 1)
        if (outcome[s, t] == 1) {
          sv[cue[s, t]] += ep[s] * (rhoRew[s] * outcome[s, t] - sv[cue[s, t]]);
        } else if (outcome[s, t] ==0) {
           sv[cue[s, t]] += ep[s] * (rhoPun[s] * outcome[s, t] - sv[cue[s, t]]);
        } 
      } else if (cue[s, t] == 4) { // nogoToAvoid (-1 or 0)
        if (outcome[s, t] == 0) {
          sv[cue[s, t]] += ep[s] * (rhoRew[s] * outcome[s, t] - sv[cue[s, t]]);
        } else if (outcome[s, t] == -1) {
           sv[cue[s, t]] += ep[s] * (rhoPun[s] * outcome[s, t] - sv[cue[s, t]]);
        }
      }
      
      // update the action values
      if (cue[s, t] == 1) { // goToWin (0 or 1)
        if (action[s, t] == 1 && outcome[s, t] == 1) {
          qValue_go[cue[s, t]] += ep[s] * (rhoRew[s] * outcome[s, t] - qValue_go[cue[s, t]]);
        } else if (action[s, t] == 1 && outcome[s, t] == 0) {
          qValue_go[cue[s, t]] += ep[s] * (rhoPun[s] * outcome[s, t] - qValue_go[cue[s, t]]);
        } else if (action[s, t] == 0 && outcome[s, t] == 1) {
          qValue_nogo[cue[s, t]] += ep[s] * (rhoRew[s] *outcome[s, t] - qValue_nogo[cue[s, t]]);
        } else if (action[s, t] == 0 && outcome[s, t] == 0) {
          qValue_nogo[cue[s, t]] += ep[s] * (rhoPun[s] *outcome[s, t] - qValue_nogo[cue[s, t]]);
        }
        
      } else if (cue[s, t] == 2) { // goToAvoid (-1 or 0)
        if (action[s, t] == 1 && outcome[s, t] == 0) {
          qValue_go[cue[s, t]] += ep[s] * (rhoRew[s] * outcome[s, t] - qValue_go[cue[s, t]]);
        } else if (action[s, t] == 1 && outcome[s, t] == -1) {
          qValue_go[cue[s, t]] += ep[s] * (rhoPun[s] * outcome[s, t] - qValue_go[cue[s, t]]);
        } else if (action[s, t] == 0 && outcome[s, t] == 0) {
          qValue_nogo[cue[s, t]] += ep[s] * (rhoRew[s] *outcome[s, t] - qValue_nogo[cue[s, t]]);
        } else if (action[s, t] == 0 && outcome[s, t] == -1) {
          qValue_nogo[cue[s, t]] += ep[s] * (rhoPun[s] *outcome[s, t] - qValue_nogo[cue[s, t]]);
        }
  
      } else if (cue[s, t] == 3) { // nogoToWin (0 or 1)
        if (action[s, t] == 1 && outcome[s, t] == 1) {
          qValue_go[cue[s, t]] += ep[s] * (rhoRew[s] * outcome[s, t] - qValue_go[cue[s, t]]);
        } else if (action[s, t] == 1 && outcome[s, t] == 0) {
          qValue_go[cue[s, t]] += ep[s] * (rhoPun[s] * outcome[s, t] - qValue_go[cue[s, t]]);
        } else if (action[s, t] == 0 && outcome[s, t] == 1) {
          qValue_nogo[cue[s, t]] += ep[s] * (rhoRew[s] *outcome[s, t] - qValue_nogo[cue[s, t]]);
        } else if (action[s, t] == 0 && outcome[s, t] == 0) {
          qValue_nogo[cue[s, t]] += ep[s] * (rhoPun[s] *outcome[s, t] - qValue_nogo[cue[s, t]]);
        }
        
      } else if (cue[s, t] == 4) { // nogoToAvoid (-1 or 0)
        if (action[s, t] == 1 && outcome[s, t] == 0) {
          qValue_go[cue[s, t]] += ep[s] * (rhoRew[s] * outcome[s, t] - qValue_go[cue[s, t]]);
        } else if (action[s, t] == 1 && outcome[s, t] == -1) {
           qValue_go[cue[s, t]] += ep[s] * (rhoPun[s] * outcome[s, t] - qValue_go[cue[s, t]]);
        } else if (action[s, t] == 0 && outcome[s, t] == 0) {
          qValue_nogo[cue[s, t]] += ep[s] * (rhoRew[s] *outcome[s, t] - qValue_nogo[cue[s, t]]);
        } else if (action[s, t] == 0 && outcome[s, t] == -1) {
          qValue_nogo[cue[s, t]] += ep[s] * (rhoPun[s] *outcome[s, t] - qValue_nogo[cue[s, t]]);
        }
      }
      
      
    } // end of the loop for the trial-level
    
  } // end of the loop for the subject-level
  
}


generated quantities {

  real<lower=0, upper=0.2> mu_xi;
  real<lower=0, upper=1> mu_ep;
  real mu_b;
  real mu_pi;
  real<lower=0> mu_rhoRew;
  real<lower=0> mu_rhoPun;
  
  mu_xi = Phi_approx(mu_pr[1])*0.2;
  mu_ep = Phi_approx(mu_pr[2]);
  mu_b = mu_pr[3];
  mu_pi = mu_pr[4];
  mu_rhoRew = log(1 + exp(mu_pr[5]));
  mu_rhoPun = log(1 + exp(mu_pr[6]));
  
}

To be honest, I haven’t tried the centered parameterizations, because I intuitively thought the non-centered parameterizations would give me more efficient samplings for the hierarchical model. I will try the centered parameterizations soon and let you know what happen. Also, if you have any suggestions in terms of improving/optimizing the coding, that would be great.

Best,
Kim

1 Like

Hi Mike,

Just keep you updated. I just tried the centered parameterizations. Specifically, I fitted my model at the default of 1,000 warmup and 1,000 saved iterations (4 chains) first, and again I got the warning message saying that Bulk Effective Size (ESS) is too low, indicating posterior means and medians may be unreliable. Next, I fitted my model using 3,000 warmups and 3,000 save iterations (4 chains), and this time no warning messages. Most importantly, the posterior estimations for parameters of interest remain the same for both centered and non-centered parameterizations, so there is no difference between them speaking of sampling efficiency.

Thank you very much for your help, really appreciated.

Cheers,
Kim

1 Like