Model Performance Tracking - Real time convergence/divergence information for cmdstanr

I have been struggling with convergence challenges in a pretty advanced model which I have mentioned on another thread. The interest in this topic is to bring up real time tracking of a Cmdstan model from r.

I have, in many cases, noted that for a new model almost all the transitions are divergent, for one reason or another. My models can run for over an hour, even at there most optimized, and result in a failed model overall. It would be great to be able to observe the performance (ShinyStan style I suggest) in real time, or updated every few iterations. I could then kill models early, and correct them, speeding up development time.

I had thought that maybe Tensorboard might be able to be utilised, but it is beyond me to understand how at this stage.

I have been thinking about this for a few days now as I don’t think I am the only one with this issue. I thought there might be some example code out there for loading the Cmdstan log file intermittently and processing it into an object to be plotted that has a refresh button, or refreshes at a specified rate. The advantage of this would be increased productivity and learning about what models work, and which don’t, with difficult data sets.

The model development phase can be very difficult for real world data in many cases, and a tool like this, in retrospect, would save significant time in my opinion.

I assume that there is already code out there that performs this task to some extent, but I just can’t find it. Maybe someone can point me in the right direction?

4 Likes

Hi Cynon,

this is related to the discussion here: During-sampling diagnostics (feature request & design discussion) · Issue #425 · stan-dev/cmdstanr · GitHub

It seems that there is more interest in this issue than that lack-of-responses there would suggest :)

Right now with cmdstanr you would not be able to do it, but could if we offered a non-blocking way of calling $sample() (see Background/asynchronous sampling (feature request & design discussion) · Issue #424 · stan-dev/cmdstanr · GitHub). Which could be achieved fairly easy I think given that the underlying behaviour was made blocking but the more native way of what cmdstanr actually does (running cmdstan executables) would actually be non-blocking.

After we add that, all we would have to do is read in the CSV files every X-seconds or minutes.

cc-ing: @mike-lawrence the author of the issue

6 Likes

Agreed this would be super useful, especially for models which are slow to sample in HMC

For the immediate future, a simple solution that might help in your workflow is to manually load the csv file in a fresh R session and check whatever you need to check.

1 Like

unfortunately the csv file contents only become available after the simulation is complete as far as I can see. If you have a way of getting the csv file to be added to during the simulation run, then that would be great, but I am told that this would require modification to the base code of cmdstan.

Under rstan, this was the case. The log file was continuously written, allowing for real time convergence/performance monitoring. This was really useful when you didn’t know why the model was running slow.

You can do the following temporarily, which is what @jsocolar had in mind:

Run your sampling in a R session:
You need to specify the ouput folder:

library(cmdstanr)

folder <- "folder_for_new_files"
if(!dir.exists(folder)) {
  dir.create(folder)  
}

# some long example sampling
fit <- cmdstanr_example(output_dir = folder, iter_sampling = 500000) 

and then in another session inspect the CSV files:

setwd(...)
folder <- "folder_for_new_files"
files <- file.path(folder, list.files(path = folder, pattern = "*.csv"))
print(files)

# though inspecting them separately might be less error-prone
# as the CSV files can have differing number of iterations
current_fit <- as_cmdstan_fit(files) 
5 Likes

Depending on the output that you’re interested in inspecting, you might additionally want to use save_warmup = T in your call to my_cmdstanr_model$sample().

Also note that even if no output_dir is set, you can probably track down the output files in the default temporary directory if needed. The R function tempdir() is helpful here.

3 Likes

The CSV files are only available once the chain finishes. So there is no ability to track progress or stop the chain part way if it has failed.

We need to do this while the chain is running.

The CSV files are definitely created at the beginning of model fitting and updated every time the model completes an iteration that is ultimately saved (i.e. every iteration if save_warmup = T and you aren’t thinning). Perhaps the problem is that the CSVs are being saved in a non-obvious location on your system? Are you on a regular personal computer? If so, what happens if you follow @rok_cesnovar’s code above?

This would be great if it were true. You are right that the csv’s are created at the beginning of the run, but they remain empty, in my case, until the chain is finished all iterations.

In my case I am tracking down the temporary files only. So no save_warmup or output_dir.

I am excited to hear there might be a solution here. I will try it tomorrow.

Bear in mind that with the default save_warmup = F, the csvs will remain empty until warmup terminates (because nothing is getting saved). Some models spend most of their time in warmup and then sample relatively quickly during sampling (e.g. if good tuning of the dynamic HMC sampling results in much faster sampling), so it can be hard to catch them in the sampling phase.

Also bear in mind that some divergences during warmup are expected, so you need a different heuristic than occasional divergences for detecting an obviously unhealthy posterior geometry during warmup. In my experience, a good heuristic during warmup is that after a few hundred iterations I’d hope to see a complete lack of divergences below some given step-size, and ideally this step-size would be associated with tree-depths smaller than the max_treedepth and with acceptance probabilities consistently smaller than adapt_delta.

3 Likes

I figured a couple of things out :)

Cmdstan is storing iterations to the csv file progressively as you suggest, but only for the first chain !! :o

In my case I am not using any special settings, and am running on windows. I just look up the file in my temp directory “C:\Users\Cynon\AppData\Local\Temp\Rtmp0q7HZt”

I tested a single chain, and the csv file is being written out as you indicate, but when I test 4 chains the csv files for the other chains remain blank. AHA ! The files for other chains do get written to eventually, I can’t tell how many times before the simulation ends.

Weird huh?

library(brms)


model_test = brm( bf(y~1)
                  ,data = data.frame(y=rnorm(1e3,0,1))
                  ,chains=4
                  ,iter = 1e6
                  ,warmup = 500
                  ,refresh=1E2
                  ,thin=100
                  ,inits = 0
                  ,backend = "cmdstanr"
                  ,threads = threading(16)
                  ,control=list(max_treedepth=20)
)

At least this offers a partial solution that you can assess the first chain for performance. At least on the windows version.

Ah, if you want the chains to run in parallel, you need to also specify parallel_chains=4

3 Likes

Oops, parallel_chains=4 is for if you’re using cmdstanr; for brms you need cores=4.

4 Likes

you’re right. I’m just digging a hole for myself. In my haste I had neglected to turn on the concurrent chains… I’ll check again if the behaviour is corrected for all chains tomorrow. I imagine it will be.

Then it will just come down to user error of me breaking the ability of cmdstan to continue to write to the files because I opened them rather than opening a copy (which works fine I think).