Looking for an optimization for running multiple Stan models in parallel with OpenMP

Hi,

I am using OpenMP to run 10 Stan models with three chains each in parallel on four cores. I’m using 10-fold cross-validation, so each model corresponds to a fold and I’m running 30 chains. There’s a high variance in the amount of time a core takes to complete a chain (it’s ranged from three hours to three days). As it stands, each core runs every third chain, then stops when it completes chain #28, 29, or 30 and waits for the others to complete. Due to the variance in the amount of required time, I’m finding one core gets “stuck” on a model and I have to wait a day or two for it to finish behind the others. Is there a way to require a core that finishes its workload to step in and pick up another chain for the straggling core?

The relevant code is below, and more information about the model is here.

Thanks very much!

Best,

Richard

## RJC: From R-bloggers tutorial on k-fold cross-validation in Stan, modified to use less memory all at once.
stan_kfold_memory_friendly <- function(file, save_file_stem, list_of_datas, chains, cores,iter,control,...){
  library(pbmcapply)
  badRhat <- 1.1 # don't know why we need this?
  n_fold <- length(list_of_datas)
  model <- stan_model(file=file)
  generate_sample_chain <- function(i){
    # Fold number:
    k <- ceiling(i / chains)
    s <- sampling(model, data = list_of_datas[[k]], 
                  chains = 1, chain_id = i,
                  iter=iter,
                  control=control)
    save(s, file=paste(save_file_stem,"i=",i,"_k=",k,sep=""))
    rm(s)
    gc()
    return(NULL)
  }

  # First parallelize all chains:
  sflist <- 
    mclapply(1:(n_fold*chains), mc.cores = cores, 
               generate_sample_chain)
  # NOTE: RStudio will mask the output of the individual calls to Stan().
  #       If I want to see that output, run everything directly from the console!
  #       It will be a bit of a mess to read, but provides finer-grain information 
  #       about the progress of parallelized model fitting for really big models.
  
  # Then merge the K * chains to create K stanfits, but save them to disk instead of returning in a list:
  for(k in 1:n_fold){
    sflist <- list()
    for(i in 1:chains){
      load(file=paste(save_file_stem,"i=",(chains*(k-1)+i),"_k=",k,sep=""))
      sflist[[i]] <- s
      rm(s)
      gc()
    }
    inchains <- 1:chains
    
    #  Merge `chains` of each fold
    merged_model <- sflist2stanfit(sflist[inchains])
    save(merged_model, file=paste(save_file_stem,"k=",k,sep=""))
    
    # Delete the intermediate files I just created
    for(i in 1:chains){
      filename <- paste(save_file_stem,"i=",(chains*(k-1)+i),"_k=",k,sep="")
      if(file.exists(filename)){
        file.remove(filename)
      }
    }
  }
  
  # At this point the directory referenced to by file_stem should contain K stan_fit files, 
  # each containing n_chains chains. We don't have those loaded into memory. 
  return() 
}

You could call with chains = 1 in parallel, but then you would have to re-assemble the chains yourself to do convergence diagnostics, etc.

More generally, if you are getting variance on the order of days, then your Markov chain is not efficient enough to be doing anything reliable with the draws anyways.

Thank you very much. I’ll definitely take another look at the convergence diagnostics for each of the folds to see if I’ve missed something. (I really should have done that before posting – I skipped that step because the diagnostic plots all looked healthy when I was not using k-fold cross-validation, maybe a rookie mistake.) I thought that running 30 models with chains = 1 in parallel is exactly what this code is doing, though:

  # First parallelize all chains:
  sflist <- 
    mclapply(1:(n_fold*chains), mc.cores = cores, 
               generate_sample_chain)

where each call to generate_sample_chain calls:

    s <- sampling(model, data = list_of_datas[[k]], 
                  chains = 1, chain_id = i,
                  iter=iter,
                  control=control)

Then I go ahead and re-assemble the chains:

  # Then merge the K * chains to create K stanfits, but save them to disk instead of returning in a list:
  for(k in 1:n_fold){
    sflist <- list()
    for(i in 1:chains){
      load(file=paste(save_file_stem,"i=",(chains*(k-1)+i),"_k=",k,sep=""))
      sflist[[i]] <- s
      rm(s)
      gc()
    }
    inchains <- 1:chains
    
    #  Merge `chains` of each fold
    merged_model <- sflist2stanfit(sflist[inchains])
    save(merged_model, file=paste(save_file_stem,"k=",k,sep=""))
    
    # Delete the intermediate files I just created
    for(i in 1:chains){
      filename <- paste(save_file_stem,"i=",(chains*(k-1)+i),"_k=",k,sep="")
      if(file.exists(filename)){
        file.remove(filename)
      }
    }

Am I doing something else wrong, or is this just a limitation with OpenMP?

Thanks,

Richard