Summary method slow for large models

I’m building a large model (millions of rows, tens of thousands of parameters) and calling the summary method after the model is done fitting takes 30-60 minutes.

Technically, I’m doing this in cmdstanr, so I’ll show the simple code below, but I’ve had similar experiences in rstan.

mod <- cmdstan_model("...")
fit <- mod$sample(data = ...)
mod_summary <- fit$summary() # this takes surprisingly long

Is it the diagnostics that are taking a while? I can’t imagine the means and quantiles would take very long, even with the size of the model. Is there anything I can do to speed it up? Can I opt out of some of the features of the summary if (for now) I’m only interested in the posterior means/medians?

2 Likes

how many chains are you running?

do you want to get means/medians for all variables or just some of them?

cmdstanr feeds the assembled sample to the posterior package’s summarize_draws function - Summaries of draws objects — draws_summary • posterior

Can I opt out of some of the features of the summary if (for now) I’m only interested in the posterior means/medians?

yes - you should be able to pass in args “mean”, “median”

you could try timing how long calling the draws function takes; then immediately call summarize_draws - if it’s an I/O bottleneck, the the former will take 30-60 minutes.

and if you’re only interested in select variables, the draws function lets you specify which variables to read in; this should speed up I/O and use way less memory - at which point, you can use methods in the posterior package directly.

2 Likes

I was just messing with a 7k parameter model. $draws() was pretty slow until I did this.

I think this only loads the beta variable and only does certain processing on it:

fit$draws("beta") %>%
  summarise_draws(median, mad, rhat, ess_bulk)

summarise_draws comes from the posterior package (GitHub - stan-dev/posterior: The posterior R package)

3 Likes

The model is 4 chains, 1000 warm-up, 1000 post warm-up.
Unfortunately I need means and medians for basically every variable, but I will try passing in those args.

Do you have any insight into why the draws function would be slow – is it not simply reading the .csvs created during the fit? Am I just underestimating the data size because I’ve never actually loaded something that was 100k x 4000?

FWIW, I usually use the fread function from the data.table package on the unfinished csvs if I’m interested in the samples before the model is done fitting, and while it’s not “fast,” I think it’s a good bit quicker than $draws.

2 Likes

Internally the draws method is calling read_cmdstan_csv() function which currently uses the vroom package to read in the csv. It’s faster than the standard read.csv() but yeah I think it’s slower than fread(). There’s been some related discussion here:

2 Likes

multiply again by 8 to get the number of bytes - that’s 3.2 GB

1 Like

As a workaround, if fread is faster, then I think you can:

  1. Load up the individual chains with fread
  2. Glue them together with posterior::bind_draws(df1, df2, along = "chain") (so posterior can know they are different chains)
  3. Pass that into posterior::summarize_draws() to do the summarizing (or do custom summarizing or whatever)

This makes me think it might be handy to have a thin option on our reading functions, but I don’t know how that would work. If the analysis functions start getting choked up you could do some thinning after step 1 above (that’ll make your ESS estimates wonky though, but presumably you don’t need those to do your analysis).

3 Likes

reviving this thread because I’m now experiencing some long waits for the summary method on an optimize object. Unless I’m mistaken, there aren’t giant csvs to be read in since we’re optimizing and not sampling – any clue why this method is also slow or ways I might speed it up?
@jonah @mitzimorris

hi Walker,

is the optimize method itself reasonably fast?

why call summary on an optimize object?
the optimize method is returning a single vector containing the penalized MLE (i.e., posterior mode).

it looks like CmdStanR hands this single draw off to the posterior package’s summarize-draws method - that’s all I know - I hope one of the R devs can help out here.

Yes the optimize method takes only a few minutes (large model, 300K parameters).

I was using the summary method because the intro to cmdstanr does so:

fit_mle <- mod$optimize(data = data_list, seed = 123)
fiit_mle$summary()

Although it appears I can just call mle to get a vector of the solution.

how long does it take to get a response via mle?
both summary and mle read in the csv file. if summary is taking way longer than mle, this needs to be investigated. (just opened an issue on this: https://github.com/stan-dev/cmdstanr/issues/315)

1 Like

Yeah, this is the same issue as https://github.com/stan-dev/cmdstanr/issues/299
It seems that vroom or how we use vroom has issues with that many parameters ( has issues = runs very slow).

Working on replacing it with fread very soon.

6 Likes

Thanks and nice recommendation. Here is what I tried, reduced draw time from 50s (model only took 23s to fit with 4 chians,1000 warm up and 1000 samples) to 3s, if I just use fit$draws

        cmdstanfiles <- list()
        for (f in fit$output_files()) {
          cmdstanfiles[[f]] <- fread(cmd= paste0("grep -v '^#' ", f))
        }
        var_of_interest <- cmdstanfiles %>% bind_rows() %>% dplyr::select(starts_with("var_of_interest")) %>% as.matrix()