Summary method slow for large models

walkerharrison · September 23, 2020, 2:56am

I’m building a large model (millions of rows, tens of thousands of parameters) and calling the summary method after the model is done fitting takes 30-60 minutes.

Technically, I’m doing this in cmdstanr, so I’ll show the simple code below, but I’ve had similar experiences in rstan.

mod <- cmdstan_model("...")
fit <- mod$sample(data = ...)
mod_summary <- fit$summary() # this takes surprisingly long

Is it the diagnostics that are taking a while? I can’t imagine the means and quantiles would take very long, even with the size of the model. Is there anything I can do to speed it up? Can I opt out of some of the features of the summary if (for now) I’m only interested in the posterior means/medians?

mitzimorris · September 23, 2020, 3:32am

how many chains are you running?

do you want to get means/medians for all variables or just some of them?

cmdstanr feeds the assembled sample to the posterior package’s summarize_draws function - Summaries of draws objects — draws_summary • posterior

Can I opt out of some of the features of the summary if (for now) I’m only interested in the posterior means/medians?

yes - you should be able to pass in args “mean”, “median”

you could try timing how long calling the draws function takes; then immediately call summarize_draws - if it’s an I/O bottleneck, the the former will take 30-60 minutes.

and if you’re only interested in select variables, the draws function lets you specify which variables to read in; this should speed up I/O and use way less memory - at which point, you can use methods in the posterior package directly.

bbbales2 · September 23, 2020, 3:39am

I was just messing with a 7k parameter model. $draws() was pretty slow until I did this.

I think this only loads the beta variable and only does certain processing on it:

fit$draws("beta") %>%
  summarise_draws(median, mad, rhat, ess_bulk)

summarise_draws comes from the posterior package (GitHub - stan-dev/posterior: The posterior R package)

walkerharrison · September 23, 2020, 3:49am

The model is 4 chains, 1000 warm-up, 1000 post warm-up.
Unfortunately I need means and medians for basically every variable, but I will try passing in those args.

Do you have any insight into why the draws function would be slow – is it not simply reading the .csvs created during the fit? Am I just underestimating the data size because I’ve never actually loaded something that was 100k x 4000?

FWIW, I usually use the fread function from the data.table package on the unfinished csvs if I’m interested in the samples before the model is done fitting, and while it’s not “fast,” I think it’s a good bit quicker than $draws.

jonah · September 23, 2020, 4:37am

Internally the draws method is calling read_cmdstan_csv() function which currently uses the vroom package to read in the csv. It’s faster than the standard read.csv() but yeah I think it’s slower than fread(). There’s been some related discussion here:

mitzimorris · September 23, 2020, 5:06am

multiply again by 8 to get the number of bytes - that’s 3.2 GB

bbbales2 · September 23, 2020, 4:27pm

As a workaround, if fread is faster, then I think you can:

Load up the individual chains with fread
Glue them together with posterior::bind_draws(df1, df2, along = "chain") (so posterior can know they are different chains)
Pass that into posterior::summarize_draws() to do the summarizing (or do custom summarizing or whatever)

This makes me think it might be handy to have a thin option on our reading functions, but I don’t know how that would work. If the analysis functions start getting choked up you could do some thinning after step 1 above (that’ll make your ESS estimates wonky though, but presumably you don’t need those to do your analysis).

walkerharrison · October 11, 2020, 11:35pm

reviving this thread because I’m now experiencing some long waits for the summary method on an optimize object. Unless I’m mistaken, there aren’t giant csvs to be read in since we’re optimizing and not sampling – any clue why this method is also slow or ways I might speed it up?
@jonah @mitzimorris

mitzimorris · October 12, 2020, 3:28am

hi Walker,

is the optimize method itself reasonably fast?

why call summary on an optimize object?
the optimize method is returning a single vector containing the penalized MLE (i.e., posterior mode).

it looks like CmdStanR hands this single draw off to the posterior package’s summarize-draws method - that’s all I know - I hope one of the R devs can help out here.

walkerharrison · October 12, 2020, 3:38am

Yes the optimize method takes only a few minutes (large model, 300K parameters).

I was using the summary method because the intro to cmdstanr does so:

fit_mle <- mod$optimize(data = data_list, seed = 123)
fiit_mle$summary()

Although it appears I can just call mle to get a vector of the solution.

mitzimorris · October 12, 2020, 5:20pm

how long does it take to get a response via mle?
both summary and mle read in the csv file. if summary is taking way longer than mle, this needs to be investigated. (just opened an issue on this: https://github.com/stan-dev/cmdstanr/issues/315)

rok_cesnovar · October 12, 2020, 5:34pm

Yeah, this is the same issue as https://github.com/stan-dev/cmdstanr/issues/299
It seems that vroom or how we use vroom has issues with that many parameters ( has issues = runs very slow).

Working on replacing it with fread very soon.

zcai · May 31, 2021, 12:13am

Thanks and nice recommendation. Here is what I tried, reduced draw time from 50s (model only took 23s to fit with 4 chians,1000 warm up and 1000 samples) to 3s, if I just use fit$draws

        cmdstanfiles <- list()
        for (f in fit$output_files()) {
          cmdstanfiles[[f]] <- fread(cmd= paste0("grep -v '^#' ", f))
        }
        var_of_interest <- cmdstanfiles %>% bind_rows() %>% dplyr::select(starts_with("var_of_interest")) %>% as.matrix()

Topic		Replies	Views
Extracting draw summaries prohibitively slow for massive models Interfaces cmdstanr , posterior-package	2	529	May 2, 2023
Slow cmdstanr/posterior vs. rstan summary CmdStan cmdstanr	5	1291	November 16, 2021
Fast cmdstanr summary function Modeling cmdstanr	6	351	July 24, 2024
$draws() method in CmdStanR is still slow General	17	1259	December 18, 2020
Variational Bayes runtime and memory usage Algorithms	13	1546	January 23, 2018

Summary method slow for large models

Related topics