Model fitted well for 500 warmup, and failed to fit for 1,000 warmup

Dear Stan community,

I am trying to fit a rescorla-wagner model to my data.
I managed to fit the model when using 1 chain, with 500 warmup and 2,000 iteration.

Surprisingly, this model cannot fit when I used 1 chain, with 1,000 warmup and 2,000 iteration.

Is this due to the rstan and stan version I am using?

rstan version: 2.26.13
stan version: 2.26.1

If you get different results with a shorter warmup and a longer one there is probably something wrong somewhere. At some point a longe warmup should become redundant, but will yield “good” algorithm parameters that should allow the sampling of the posterior. I cannot think of a scenario where things are working as intended and a longer warmup breaks sampling.

The important question is what “managed to fit” means precisely here. Not getting all-divergent transitions is better than getting them, but you may have a poor fit due to not optimizing the step size and mass-matrix parameters, for example. I think something like this is a lot more likely than some weirdness where a shorter warmup works and a longer one doesn’t, but what’s actually happening requires more detail about the run.

Dear @caesoma ,

Thank you very much for your reply. To be honest, I also think it’s weird that shorter warmup perform better than longer warmup.

I would like to provide more details and would be nice to have your professional suggestions.

I am fitting a Reinforcement learning model to my data, as I would like to have trial-by-trial model regressors. However, there are 3 things should be noted:

  1. I have many missing data (nearly 1/3), but I cannot remove them, because in my experiment the trial sequence is meaningful.
  2. My DV is skin conductance reaction, which can only take positive values.
  3. My model is hierarchical, meaning that I include hyper parameter priors at the participant-level.

Following is my Stan code:

data {
  int<lower=0> N_obs; // number of observed data points
  int<lower=0> N_mis; // number of missing data points 
  array[N_obs]int<lower=1, upper=N_obs + N_mis> ii_obs; // indexes of the observed data
  array[N_mis]int<lower=1, upper=N_obs + N_mis> ii_mis; // indexes of the missing data
  
  int<lower=1> S; // number of participants
  int<lower=1> Tri; // maximum number of trials per participant
  array[S]int<lower=1, upper=Tri> subj_trial; // an array in which specify the trial numbers per subject
  
  array[S, Tri]int<lower=1, upper=2> CS; // CS per trial for each subject
  array[S, Tri]int<lower=0, upper=1> US; // US per trial for each subject
  array[N_obs] real<lower=0> SCR_obs; // observed SCR data
  
}

transformed data{
  int<lower=0> N_all = N_obs + N_mis;
  
  vector[2] init;
  init  = rep_vector(0.5, 2);
  
}

parameters {
  // missing data
  array[N_mis] real<lower=0> SCR_mis; // missing SCR data
  
  // Hierarchical mu parameters (at the population-level)
  real mu_b0; // intercept
  real mu_b1; // slope
  real mu_sigma; // sigma
  real mu_kappa; // scaling parameter for updating the learning rate trial by trial
  real mu_eta; // learning rate

  
  // sigma parameters
  real<lower=0.001> sigma_b0; // from the hierarchical distribution of intercept
  real<lower=0.001> sigma_b1; // from the hierarchical distribution of slope
  real<lower=0.001> sigma_par; // from the hierarchical distribution of sigma
  real<lower=0.001> sigma_kappa; // for kappa
  real<lower=0.001> sigma_eta; // for the learning rate
   
  // Normal distributions for non-center parameterization, i.e., Matt trick
  real z_b0[S];
  real z_b1[S];
  real z_sigma[S];
  real z_kappa[S];
  real z_eta[S];
  
}

transformed parameters {
  // data
  array[N_all] real<lower=0> SCR; 
  SCR[ii_obs] = SCR_obs;
  SCR[ii_mis] = SCR_mis; 
  
  // participant-level parameters
  real b0[S]; 
  real b1[S];
  real<lower=0> sigma[S];
  real<lower=0, upper=1> kappa[S];
  real<lower=0, upper=1> eta[S];
  
  for (s in 1:S) {
    b0[s] = mu_b0 + z_b0[s]*sigma_b0;
    b1[s] = mu_b1 + z_b1[s]*sigma_b1;
    sigma[s] = exp(mu_sigma + z_sigma[s]*sigma_par);
    kappa[s] = Phi_approx(mu_kappa + z_kappa[s]*sigma_kappa);
    eta[s] = Phi_approx(mu_eta + z_eta[s]*sigma_eta);

  }

}

model {
  
  SCR_mis ~ gamma(2, 2);
  
  mu_b0 ~ normal(0, 2);
  mu_b1 ~ normal(0, 2);
  mu_sigma ~ normal(0, 2);
  mu_kappa ~ normal(0, 2);
  mu_eta ~ normal(0, 2);
  
  sigma_b0 ~ gamma(2, 2);
  sigma_b1 ~ gamma(2, 2);
  sigma_par ~ gamma(2, 2);
  sigma_kappa ~ gamma(2, 2);
  sigma_eta ~ gamma(2, 2);

  z_b0 ~ normal(0, 1);
  z_b1 ~ normal(0, 1);
  z_sigma ~ normal(0, 1);
  z_kappa ~ normal(0, 1);
  z_eta ~ normal(0, 1);
  
   /* trial-by-trial update */
  for (s in 1:S) {
    vector[2] q; // initial q values for CS+/CS-
    q = init;
    
    real eta_updated; // local variable to store the updated value of eta[s]
    eta_updated = eta[s]; // Initialize the updated_eta with the original eta[s]
    
    for (t in 1:subj_trial[s]) {
      int row_number; // current row number, cause we are doing trial-by-trial update
      row_number = sum(subj_trial[1:s-1]) + t;
      
      target += normal_lpdf(SCR[row_number] | b0[s] + b1[s]*eta_updated, sigma[s]);
      
      eta_updated = kappa[s]*abs(q[CS[s, t]] - US[s, t]) + (1 - kappa[s])*eta_updated;
      q[CS[s, t]] += eta_updated*(US[s, t] - q[CS[s, t]]);
      
    }
    
  }
 
}
 

generated quantities {
    // hierarchical parameter
  real<lower=0, upper=1> transf_mu_kappa = Phi_approx(mu_kappa);
  
}

I actually have no idea what’s going wrong, I think there might be some issue with my Stan code that seriously hinder the model fit or sampling efficiency. And I am very much appreciated if you have any suggestions.

Best

This is true in expectation, but not chain by chain. See below.

How many trials with different random seeds did you do? There is a lot of randomness due to random initialization (main effect) and then the stochastic nature of MCMC during warmup.

I’d also suggest running more than one chain, which can help diagnose the kind of chain-to-chain variation you’re seeing so that you don’t inadvertently accept biased answers. The problem is that the one chain that looks like it worked may be stuck in one part of the sample space and only appearing to sample the whole posterior.

1 Like

That’s of course true. Assuming the parameters space is not too large and the model isn’t too complex you should get a “well-behaved” warmup, where from some number iterations it will both get to an “optimal” step size as well as get to a region of parameter space that allows a reasonable estimate of the metric (i.e. mass matrix) of the parameter space.
So the corollary to that is a “good” warmup should produce both an optimal step size and a mass matrix that allows efficient exploration of the high-probability regions – in sum, you can look at those numbers and compare them between (parallel or sequential chains).

I agree that’s a good-model agnostic approach/sanity check for the result of warmups.

If the general advice doesn’t improve your results you will have to go into the model to see what’s causing the issues, and if they can be fixed. I’m not sure what S and Tri are, but it looks like they could be large, and with the hierarchical model you may be in the large-parameter-space/complex-model situation. Another good practice if your implementing a model for the first time (or if it turns out not to work right away as expected) is to implement a smaller version where you can do things like:

  1. remove the hierarchy (e.g fit only one or few participants);
  2. subset the data;
  3. simulate data with known parameters, then fit the model;
  4. reduce the number of parameters/simplify the model somewhere

It’s unlikely that a bug in the code would generate that behavior, but it’s certainly possible that an error gave the impression that it’s working when it’s not, and then it breaks more obviously in other conditions.