Hi,
I am using OpenMP to run 10 Stan models with three chains each in parallel on four cores. I’m using 10-fold cross-validation, so each model corresponds to a fold and I’m running 30 chains. There’s a high variance in the amount of time a core takes to complete a chain (it’s ranged from three hours to three days). As it stands, each core runs every third chain, then stops when it completes chain #28, 29, or 30 and waits for the others to complete. Due to the variance in the amount of required time, I’m finding one core gets “stuck” on a model and I have to wait a day or two for it to finish behind the others. Is there a way to require a core that finishes its workload to step in and pick up another chain for the straggling core?
The relevant code is below, and more information about the model is here.
Thanks very much!
Best,
Richard
## RJC: From R-bloggers tutorial on k-fold cross-validation in Stan, modified to use less memory all at once.
stan_kfold_memory_friendly <- function(file, save_file_stem, list_of_datas, chains, cores,iter,control,...){
library(pbmcapply)
badRhat <- 1.1 # don't know why we need this?
n_fold <- length(list_of_datas)
model <- stan_model(file=file)
generate_sample_chain <- function(i){
# Fold number:
k <- ceiling(i / chains)
s <- sampling(model, data = list_of_datas[[k]],
chains = 1, chain_id = i,
iter=iter,
control=control)
save(s, file=paste(save_file_stem,"i=",i,"_k=",k,sep=""))
rm(s)
gc()
return(NULL)
}
# First parallelize all chains:
sflist <-
mclapply(1:(n_fold*chains), mc.cores = cores,
generate_sample_chain)
# NOTE: RStudio will mask the output of the individual calls to Stan().
# If I want to see that output, run everything directly from the console!
# It will be a bit of a mess to read, but provides finer-grain information
# about the progress of parallelized model fitting for really big models.
# Then merge the K * chains to create K stanfits, but save them to disk instead of returning in a list:
for(k in 1:n_fold){
sflist <- list()
for(i in 1:chains){
load(file=paste(save_file_stem,"i=",(chains*(k-1)+i),"_k=",k,sep=""))
sflist[[i]] <- s
rm(s)
gc()
}
inchains <- 1:chains
# Merge `chains` of each fold
merged_model <- sflist2stanfit(sflist[inchains])
save(merged_model, file=paste(save_file_stem,"k=",k,sep=""))
# Delete the intermediate files I just created
for(i in 1:chains){
filename <- paste(save_file_stem,"i=",(chains*(k-1)+i),"_k=",k,sep="")
if(file.exists(filename)){
file.remove(filename)
}
}
}
# At this point the directory referenced to by file_stem should contain K stan_fit files,
# each containing n_chains chains. We don't have those loaded into memory.
return()
}