Combine rstan::read_stan_csv's with different #s of samples?

Before I wrote something myself, I thought I’d check if anyone has already written a function to combine the results of rstan::read_stan_csv for in-progress chains that are likely to have different numbers of samples each. If not, should I just shorten each chain to the length of the shortest one, or is there maybe a more nuanced approach that doesn’t toss potentially useful information?

1 Like

Here’s the naive approach, where all the chains are simply shortened to the length of the shortest:

combine_samples = function(sflist){
  min_iter = min(unlist(lapply(sflist,function(x){attr(x,'sim')$iter})))
  chop = function(x,min_iter){
    attr(x,'sim')$iter = min_iter
    attr(x,'sim')$n_save = min_iter
    z = attr(x,'sim')$samples[[1]]
    z = z[1:min_iter,]
    attributes(z)$sampler_params = attributes(z)$sampler_params[1:min_iter,]
    attr(x,'sim')$samples[[1]] = z
  sflist = lapply(sflist,chop,min_iter=min_iter)
1 Like

We’ve been talking about doing ragged R-hat and n_eff calculations because we want to start thinking about runs based on times rather than number of iterations.

Nothing intrinsically hard about this—if you look at the formulas, it’s pretty easy to see how to compute with unequal sizes. But it immediately brings up the issue of whether to weight by draw or weight by chain or something in between (like say sqrt of length of chain).

I’d like to recommend that read_stan_csv have some options for reading in-progress csv files. In particular:

For example:

samps <- read_stan_csv(list.files(".",pattern=“foo.StanSamps_.*csv”))
Warning messages:
1: In FUN(X[[i]], …) : line with “Elapsed Time” not found
2: In read_stan_csv(list.files(".", pattern = “foo.StanSamps_1.*csv”)) :
the number of iterations after warmup found (-44) does not match iter/warmup/thin from CSV comments (20)

Stan model ‘foo.StanSamps’ does not contain samples.

I’d like to suggest that there be an option “inprogress=TRUE” where the reader goes through and reads the files even if they don’t match etc, and even if they don’t have the comments at the bottom, and then truncates each run to the length of the shortest one, and prints a warning message something like:

“Read samples from %d in progress chains. All chains truncated to %d total samples. Use these results for debugging and monitoring purposes only” or the like

It is hard enough to debug complex models without the utility functions refusing to even look at the data ;-)

1 Like

YES… this kind of thing is more broadly held up because it’ll be so much easier to do once the writers are spitting out a less insane format.

A couple of people have packages to deal with the .csv output live already (me included) so if you just want a hack there are a few available.

1 Like

Have you guys considered the option to log samples to a database? maybe create a table called stan_samples_modelname with an appropriate schema and column names, and then just slam the samples in?

You could even maybe write a wrapper program that listens to a file system fifo and interacts with the DB so that DB related errors don’t crash the computation, or something.

I wrote a writer for that (still somewhere on my github…) but even with a threadpool writing in long format it ran into database performance issues. It was pretty robust though (it was a little sad when it would finish writing samples a few hours after the model finished running…) I’ve since learned more about it and I think it might work if you wrote out Postgres arrays rather than full key-value for each sample because the overhead on the DB is per-row insert.

1 Like

Most likely if you’re going to write in long format, you want to start a transaction, then slam each of the parameters in, then end the transaction. That would have way less overhead I think. If you just write one at a time without the explicit transaction, the database makes a transaction for each row.

I think you don’t understand how thorough I was. :)

I got write rates (in terms of rows) as high as any I saw cited anywhere on commodity hardware. If you have striped raid on multiple SSD’s you could do it, but that’s not what we’re looking for. I’m pretty sure if you wrote arrays an unpacked afterwards it would be worth trying.

What about writing in wide format? make a table specifically for your particular model with the right columns etc. Each row then contains thousands of columns. That was what I was originally thinking of anyway.

I think I tried that as well but the C++ postgres library I used didn’t have great performance when it came to creating really wide prepared queries so it didn’t get me much. OTOH using the array type is pretty natural and basically the same thing with much less overhead…

I was thinking perhaps something not postgres specific. I see both mysql/mariadb and postgres support some kind of JSON formatted column type. I imagine you could use a json format in with a sqlite TEXT column as well. Having different database options lets people use existing databases and not have to set up a database just for Stan. I could imagine a generic table like:

Model, Chain, SampleID, DateTime, SampleJSON

and then use a JSON parser to unpack the parameter vector and parameter names all encoded as some kind of JSON object…

Yeah, that sort of thing could work, I’m just warning that it’s non-trivial not just on first principles but because many databases are optimized for consistency and the ones that aren’t don’t necessarily make the data any easier to handle than writing out a binary file.

This is, in part, because the warmup iterations don’t mean anything. The first iteration after a window transition will typically diverge. The estimated variances are weird. And it is not even a Markov Chain yet.

If you are thinking of reading the CSV files after the warmup period, that is possible. But at that point, you have already waited about two-thirds or more of the wall time.

They do mean something. They tell you how well the sampler is doing at exploring the space. There’s a distinct pattern to poor exploration vs. effective exploration. It’s possible to go from one state to the other but often the first sign you should re-parameterize comes after the first dozen or so iterations.

The first window is at least a Markov Chain, but it doesn’t tell you anything about how much the performance is going to change when it starts adapting. What I more meant was that if we read a CSV with warmup iterations into a stanfit object, we can’t stop the user from doing invalid things with it like calculating means, Rhats, effective sample sizes, etc.

I understand your concern about people doing invalid things and why you might want to keep rstan the way it is. Personally I think users who do that must not know anything about MCMC in which case… well, what can you do except recommend a book? I’m more concerned about users who are likely to push the envelope in terms of what models Stan can reliably be used for and in that use case seeing whether the initial exploration is doing well or not can be critical to testing a new model every 15 minutes instead of a new model every few hours.

+1 to a new model every 15 minutes instead of every few hours. I’m tempted to create a wrapper for cmdstan that does monitoring and graphical output of diagnostics to a pdf file along the way. Maybe randomly selects some number of parameters and graphs their traceplots, shows the step-size through sampling time, shows divergences, just generally gives you a lot of information about how the sampling is going.

I don’t think making users keep their seatbelts fastened is incompatible with letting them look out the window. I’d very much like to have something to monitor progress before sampling is finished. I’m not sure what I want to see by default—really what I’d want is live ShinyStan, but that’s asking a lot. We’ve looked into things like online R-hat (can’t really do split R-hat online efficiently for the same reasons that 50% quantiles are hard to compute online), but as Ben points out, that only makes sense once you get to sampling. Before then, we’d want to monitor aspects of adaptation like step size, mass matrix, whether multiple chains are converging to the same mass matrix, etc.