I’m building a large model (millions of rows, tens of thousands of parameters) and calling the summary method after the model is done fitting takes 30-60 minutes.
Technically, I’m doing this in cmdstanr, so I’ll show the simple code below, but I’ve had similar experiences in rstan.
mod <- cmdstan_model("...")
fit <- mod$sample(data = ...)
mod_summary <- fit$summary() # this takes surprisingly long
Is it the diagnostics that are taking a while? I can’t imagine the means and quantiles would take very long, even with the size of the model. Is there anything I can do to speed it up? Can I opt out of some of the features of the summary if (for now) I’m only interested in the posterior means/medians?
Can I opt out of some of the features of the summary if (for now) I’m only interested in the posterior means/medians?
yes - you should be able to pass in args “mean”, “median”
you could try timing how long calling the draws function takes; then immediately call summarize_draws - if it’s an I/O bottleneck, the the former will take 30-60 minutes.
and if you’re only interested in select variables, the draws function lets you specify which variables to read in; this should speed up I/O and use way less memory - at which point, you can use methods in the posterior package directly.
The model is 4 chains, 1000 warm-up, 1000 post warm-up.
Unfortunately I need means and medians for basically every variable, but I will try passing in those args.
Do you have any insight into why the draws function would be slow – is it not simply reading the .csvs created during the fit? Am I just underestimating the data size because I’ve never actually loaded something that was 100k x 4000?
FWIW, I usually use the fread function from the data.table package on the unfinished csvs if I’m interested in the samples before the model is done fitting, and while it’s not “fast,” I think it’s a good bit quicker than $draws.
Internally the draws method is calling read_cmdstan_csv() function which currently uses the vroom package to read in the csv. It’s faster than the standard read.csv() but yeah I think it’s slower than fread(). There’s been some related discussion here:
As a workaround, if fread is faster, then I think you can:
Load up the individual chains with fread
Glue them together with posterior::bind_draws(df1, df2, along = "chain") (so posterior can know they are different chains)
Pass that into posterior::summarize_draws() to do the summarizing (or do custom summarizing or whatever)
This makes me think it might be handy to have a thin option on our reading functions, but I don’t know how that would work. If the analysis functions start getting choked up you could do some thinning after step 1 above (that’ll make your ESS estimates wonky though, but presumably you don’t need those to do your analysis).
reviving this thread because I’m now experiencing some long waits for the summary method on an optimize object. Unless I’m mistaken, there aren’t giant csvs to be read in since we’re optimizing and not sampling – any clue why this method is also slow or ways I might speed it up? @jonah@mitzimorris
why call summary on an optimize object?
the optimize method is returning a single vector containing the penalized MLE (i.e., posterior mode).
it looks like CmdStanR hands this single draw off to the posterior package’s summarize-draws method - that’s all I know - I hope one of the R devs can help out here.
how long does it take to get a response via mle?
both summary and mle read in the csv file. if summary is taking way longer than mle, this needs to be investigated. (just opened an issue on this: https://github.com/stan-dev/cmdstanr/issues/315)
Yeah, this is the same issue as https://github.com/stan-dev/cmdstanr/issues/299
It seems that vroom or how we use vroom has issues with that many parameters ( has issues = runs very slow).
Thanks and nice recommendation. Here is what I tried, reduced draw time from 50s (model only took 23s to fit with 4 chians,1000 warm up and 1000 samples) to 3s, if I just use fit$draws