Inconsistent chain speed - does this give a clue about the problem?

optimization

#1

I am running a 12-chain MCMC models using the stan command in R on a server with 12 [virtual] cores, using all 12.

For one particular model I ran today, the initial iterations sped through, at about 10 iterations per second on each chain. That was much faster than I expected. The first few chains completed 500 iterations very quickly, within a few minutes. As they progressed through, they slowed down to a crawl to the point where it takes over a minute to get through a single iteration. To rephrase, it’s slowed from an initial ~600 iterations/minute to now less than 1 iteration/minute. There is still slow progress, so I know it hasn’t hung.

I have checked the server’s process monitor. I can see that the process has 100% CPU use. I upgraded its priority/niceness so that it has priority over most other processes, but this doesn’t seem to have made any difference.

The process I am referring to, whose priority I changed, is 21236. The other two processes pictured are from an earlier stan job, that happens to have the problem I’m describing for 21236.

My model itself contains several user-defined functions. There’s nothing obvious to me about the properties of the model design which would cause such a slowdown. I’ve run different versions of the model over the last 24 hours and have noticed similar behavior, but I haven’t noticed this before with models in the past.

One thing I’m wondering is whether these might be chains stranded in a bad area of estimation space. But I don’t know why that would make them slower - certainly not by this amount.

Has anyone seen anything like this, or can anyone identify some possible causes I could chase up?


#2

Do you know what you have max_treedepth / adapt_delta set to? If some of the chains ended up adapting with a very small step size, then said chains will need to take a larger number of steps / gradient evaluations per iteration (up to 2^{\text{max_treedepth}} steps, or thereabouts). What is your ballpark estimate of how much longer the slow chains have left to run? These things can really only be diagnosed once all the chains have finished running.


#3

It’ll be hours for those chains to finish.

Max treedepth is 15. adapt_delta is the default 0.8.

I did get a smaller 3-chain version to finish running! After that I got a warning:

There were 75 divergent transitions after warmup. Increasing adapt_delta above 0.8 may help.

And so I increased adapt_delta to 0.9 to try to fix. Not sure that made any difference.


#4

I’m going to try running with max_treedepth back down to default 10…


#5

Could be worse! I’ve definitely waited at least week for some models.

15 means there can be up to 32767 gradient evaluations per iteration, which can take quite some time.

If you are already getting divergent transitions, this is unlikely to solve your problem; there is little point running the model with lower values for the control parameters, as the output samples are likely to mostly consist of divergent samples.

The fact that some chains finished quickly whilst some didn’t suggests there are some difficult regions of parameter space that some chains just missed. I would look very carefully over the model / Stan code / simpler versions of the model. Can you post said Stan code (of the currently running model) here?


#6

I have about 150 subjects to fit in this model, and right now I’m just testing with 2. So I am hoping to see it complete relatively quickly with this number in order for processing the full dataset to be tractable!

I should note that I am running both fewer chains (3) and iterations (400, 200 warmup) just to test for now. Maybe that explains the “divergent transitions after warmup?” but I still need to work out why it was slow.

I am currently testing a partially non-centered parameterization (it’s a 3-level model, 1 level has been centered) as well.

Anyway - I’ll post the full code of the currently running model in the next post. It is long, which is why I hadn’t posted before.


#7
//these user-defined functions are slight adaptations of the code described
//in Annis, Miller, & Palmeri (2017)
functions{
     
     real lba_pdf(real t, real b, real A, real v, real s){
          //PDF of the LBA model
          
          real b_A_tv_ts;
          real b_tv_ts;
          real term_1;
          real term_2;
          real term_3;
          real term_4;
          real pdf;
          
          b_A_tv_ts = (b - A - t*v)/(t*s);
          b_tv_ts = (b - t*v)/(t*s);
          term_1 = v*Phi(b_A_tv_ts); //am testing alternative versions using Phi_approx here and everywhere else in place of Phi instead
          term_2 = s*exp(normal_lpdf(b_A_tv_ts | 0,1)); 
          term_3 = v*Phi(b_tv_ts);
          term_4 = s*exp(normal_lpdf(b_tv_ts | 0,1)); 
          pdf = (1/A)*(-term_1 + term_2 + term_3 - term_4);
          
          return pdf;
     }
     
     real lba_cdf(real t, real b, real A, real v, real s){
          //CDF of the LBA model
          
          real b_A_tv;
          real b_tv;
          real ts;
          real term_1;
          real term_2;
          real term_3;
          real term_4;
          real cdf;	
          
          b_A_tv = b - A - t*v;
          b_tv = b - t*v;
          ts = t*s;
          term_1 = b_A_tv/A * Phi(b_A_tv/ts);	
          term_2 = b_tv/A   * Phi(b_tv/ts);
          term_3 = ts/A     * exp(normal_lpdf(b_A_tv/ts | 0,1)); 
          term_4 = ts/A     * exp(normal_lpdf(b_tv/ts | 0,1)); 
          cdf = 1 + term_1 - term_2 + term_3 - term_4;
          
          return cdf;
          
     }
     
     real lba_log(real response_time, int response, real k, real A, vector v, real s, real tau){
          
          real t;
          real b;
          real cdf;
          real pdf;		
          real prob;
          real out;
          real prob_neg;

          b = A + k;
          // for (i in 1:rows(RT)){	
             t = response_time - tau;
             if(t > 0){			
                  cdf = 1;
                  
                  for(j in 1:num_elements(v)){
                       if(response == j){
                            pdf = lba_pdf(t, b, A, v[j], s);
                       }else{	
                            cdf = (1-lba_cdf(t, b, A, v[j], s)) * cdf;
                       }
                  }
                  prob_neg = 1;
                  for(j in 1:num_elements(v)){
                       prob_neg = Phi(-v[j]/s) * prob_neg;    
                  }
                  prob = pdf*cdf;		
                  prob = prob/(1-prob_neg);	
                  if(prob < 1e-10){
                       prob = 1e-10;				
                  }
                  
             }else{
                  prob = 1e-10;			
             }		

          out = log(prob);
          //print("lba lprob:",out)
          return out;		
     }
     
    vector lba_rng(real k, real A, vector v, real s, real tau){
          
          int get_pos_drift;	
          int no_pos_drift;
          int get_first_pos;
          vector[num_elements(v)] drift;
          int max_iter;
          int iter;
          real start[num_elements(v)];
          real ttf[num_elements(v)];
          int resp[num_elements(v)];
          real rt;
          vector[2] pred;
          real b;
          
          //try to get a positive drift rate
          get_pos_drift = 1;
          no_pos_drift = 0;
          max_iter = 1000;
          iter = 0;
          while(get_pos_drift){
               for(j in 1:num_elements(v)){
                    drift[j] = normal_rng(v[j],s);
                    if(drift[j] > 0){
                         get_pos_drift = 0;
                    }
               }
               iter = iter + 1;
               if(iter > max_iter){
                    get_pos_drift = 0;
                    no_pos_drift = 1;
               }	
          }
          //if both drift rates are <= 0
          //return an infinite response time
          if(no_pos_drift){
               pred[1] = -1;
               pred[2] = -1;
          }else{
               b = A + k;
               for(i in 1:num_elements(v)){
                    //start time of each accumulator	
                    start[i] = uniform_rng(0,A);
                    //finish times
                    ttf[i] = (b-start[i])/drift[i];
               }
               //rt is the fastest accumulator finish time	
               //if one is negative get the positive drift
               resp = sort_indices_asc(ttf);
               ttf = sort_asc(ttf);
               get_first_pos = 1;
               iter = 1;
               while(get_first_pos){
                    if(ttf[iter] > 0){
                         pred[1] = ttf[iter] + tau;
                         pred[2] = resp[iter]; 
                         get_first_pos = 0;
                    }
                    iter = iter + 1;
               }
          }
          return pred;	
     }
}

data{
     int<lower=2> NUM_CHOICES;
     real<lower=0> A;
     
     int<lower=1> NUM_SUBJECTS;
     int<lower=1> NUM_TRIALS;
     int<lower=1> NUM_RUNS;
     //int<lower=1> trial_subjid[NUM_TRIALS];
     int<lower=1> run_subjid[NUM_RUNS];
     int<lower=1> trial_runid[NUM_TRIALS];
     
     //matrix[NUM_TRIALS,2] RT;
     vector[NUM_TRIALS] response_time;
     int response[NUM_TRIALS];
     int required_choice[NUM_TRIALS];//the choice which would be reinforced on each round.
     int cue[NUM_TRIALS];
     
}
transformed data{
  real<lower=0> lba_sd = 1;
  int PARID_alpha = 1;
  int PARID_lba_k = 2;
  int PARID_lba_tau = 3;
  //from stan manual:
  //Before generating any samples, data variables are read in, 
  //then the transformed data variables are declared and the associated statements executed to define them. 
  //This means the statements in the transformed data block are only ever evaluated once.
  
  matrix[NUM_TRIALS,NUM_CHOICES] choice_outcomes;
  for (i in 1:NUM_TRIALS){
    //if the required choice was chosen then assign the value of 1 to it and distribute the value of -1 across all alternatives.
    //or if the required choice was not chosen, assign the value of -1 to it and distribute the value of 1 across all alternatives.
    for (j in 1:NUM_CHOICES){
      if(j==response[i]){
        choice_outcomes[i,j]=((response[i]==required_choice[i])-0.5)*2;
      }else{
        choice_outcomes[i,j]=-((response[i]==required_choice[i])-0.5)*2/(NUM_CHOICES-1);
      }
    }
  }
  
  print("matrix of choice outcomes generated from required_choice and response input data:")
  print(choice_outcomes);
  print(NUM_SUBJECTS);
  print(NUM_TRIALS);
  print(NUM_RUNS);
  print(trial_runid);
  
  
}

parameters {
  ////////////////////
  //GROUP LEVEL
  real subj_mu[3];
  real<lower=0> subj_sigma[3];
  
  
  ////////////////////
  //SUBJECT LEVEL
  real run_mu[NUM_SUBJECTS,3];
  real<lower=0> run_sigma[NUM_SUBJECTS,3];
  
  ////////////////////
  //RUN LEVEL
  real alpha_pr[NUM_RUNS];
  real k_pr[NUM_RUNS];
  real tau_pr[NUM_RUNS];
}

transformed parameters {
  real <lower=0,upper=1> alpha[NUM_RUNS];
  real<lower=0> k[NUM_RUNS];
  real<lower=0> tau[NUM_RUNS];
  
  alpha = inv_logit(alpha_pr);
  k = exp(k_pr);
  tau = exp(tau_pr);
}

model {
  real exp_val[NUM_RUNS,max(cue),NUM_CHOICES];
  real pred_err;
  real outcome;
  vector[NUM_CHOICES] v;
  //int curRunId=0;//just for printout diagnostic
  exp_val = rep_array(0,NUM_RUNS,max(cue),NUM_CHOICES);
  
  ////////////////////
  //GROUP LEVEL
  
  //priors for mean of subject params
  subj_mu[PARID_alpha] ~ normal(-3,3);
  subj_mu[PARID_lba_k] ~ normal(log(.5),1);
  subj_mu[PARID_lba_tau] ~ normal(log(.5),0.5);
  
  //priors for deviation of subject params from their mean.
  subj_sigma[PARID_alpha] ~ cauchy(0,5); 
  //these have lower prior SDs because our priors for them originally, from Palmeri et al., were lower.
  subj_sigma[PARID_lba_k] ~ cauchy(0,3); 
  subj_sigma[PARID_lba_tau] ~ cauchy(0,2);
  
  ////////////////////
  //RUN LEVEL
  for (s in 1:NUM_SUBJECTS){
    run_mu[s,] ~ normal(subj_mu,subj_sigma);//I think we can do this vectorized here.
  
    //we do NOT assume same run-level variance across subjects but we do assume a constant prior for these.
    //this simplifies the calculation although with sigma unconstrained across subjects it may make the estimation actually harder.
    //these might be too narrow, but I'm wary of having so much variance at every level!
    
    run_sigma[s,PARID_alpha] ~ cauchy(0,4); 
    run_sigma[s,PARID_lba_k] ~ cauchy(0,2); 
    run_sigma[s,PARID_lba_tau] ~ cauchy(0,1);
  }
  
  for (r in 1:NUM_RUNS){
    ////////////////////
    //SUBJECT LEVEL
    alpha_pr[r] ~ normal(run_mu[run_subjid[r],PARID_alpha],run_sigma[run_subjid[r],PARID_alpha]);//weak prior for no learning.  
    k_pr[r] ~ normal(run_mu[run_subjid[r],PARID_lba_k],run_sigma[run_subjid[r],PARID_lba_k]);
    tau_pr[r] ~ normal(run_mu[run_subjid[r],PARID_lba_tau],run_sigma[run_subjid[r],PARID_lba_tau]);
  }
  
  
  
  for (t in 1:NUM_TRIALS){//loop through timesteps.
    for(j in 1:NUM_CHOICES){
      v[j]=logit(exp_val[trial_runid[t],cue[t],j]/4.0+0.75);
      //our LBA model relies on getting values for *each choice* so we do need to model that.
      //if j was the reinforced choice and it was the response value,
      pred_err=choice_outcomes[t,j]-exp_val[trial_runid[t],cue[t],j]; 

      exp_val[trial_runid[t],cue[t],j] = exp_val[trial_runid[t],cue[t],j] + alpha[trial_runid[t]]*pred_err;
      //for occam's razor, I'm going to avoid any transformation from expected value to drift rate. we'll treat expected value as drift rate exactly!
      
    }
    //print("t=",t,"; ",exp_val[NUM_RUNS,cue[t],]);
    response_time[t] ~ lba(response[t],k[trial_runid[t]],A,v,lba_sd,tau[trial_runid[t]]);
  }
  
    
}




#8

This whole thing is a three-level model. I did test a two-level model and it seemed to have reasonable efficiency. What I might do is loop through running the two level model on each subject (across each subject’s runs) to get suitable posteriors for the group mean/deviation parameters.


#9

Some initial thoughts: (which may/may not be of any use!)

  • Perhaps swap the Cauchys for normals? The heavy tails might be doing something weird, especially for small numbers of subjects (that is, you might get divergent transitions when testing with 2 subjects because the group level posterior looks too much like the prior / too much probability mass in the tails). You might want to test with a few more subjects? I guess it depends how much that increases the run time
  • 200 warmup is perhaps a little low. When testing I tend to set warmup to 95% of the value of iter, because I really only care about whether or not NUTS can adapt to the posterior. Then once I’m convinced it can with n warmup iterations, I set the number of warmup iterations to :warmup = n + some_safety_factor , and then iter = warmup + k, where k is the number that makes the plots look good for the publication / report, or the maximum before I run out of memory (preferably the first)
  • That density function looks very funky, and I imagine the parameters that define it could end up being quite correlated in the posterior. I also imagine the automatically calculated derivative consists of many terms, which might be quite computationally expensive to evaluate (but that’s just a guess, and I’m not sure there’s anyway around it here!)
  • The classic BDA advice of rescaling your inputs, although most of your data is integers, which aren’t exactly amenable to rescaling.

If I think of anything else I’ll add it here, but this looks tricky!

This will probably get you the scales / approximate estimates of the individuals estimates, and probably the scale / approximate estimate of the group mean (if the two-level model fits okay), but if you’re using the hierarchical nature of the model / shrinkage to obtain reasonable estimates for some parameters then you might be outta luck.


#10

Just trying out your suggestions now (minus rescaling; the only real data input is the response times and those are modeled by the lba distribution which is not necessarily normal).

One more relevant factor: previously I observed that chain iterations started off fast and got slower. That is the general trend, but I do sometimes see a chain speed up again. I guess that lends credence to the idea that slow chains are exploring difficult regions of the parameter space. I’m not sure if it’s a good sign that the other chains have avoided those areas entirely!?


System time function for profiling
#11

Sorry to jump in so late here, but typically what this means one of two things, assuming it’s not due to something like CPU contention in a multi-threaded or multi-process setting.

If you’re lucky, it will only mean that you haven’t run long enough for adaptation to explore the posterior to estimate the parameter scales (the metric, in geometric terms) or find a good step size (the temporal quantization in diff eq solver terms). Or you haven’t targeted a high enough step size.

If not, it’ll mean you have some insurmountable geometry in the posterior, such as the funnel that we describe in the manual. The only way to solve these problems is to reparameterize to remove posterior correlations and varying curvature where possible.