Hallo everybody.
I noticed that for big models cmdstanr seemed a lot slower in summarizing than rstan, so I run a test on a dummy model with N=100
, 4 chains and 4000 warmup and 4000 sampling iterations and saved the result in .csv files.
data {
int<lower=0> N;
}
parameters {
vector[N] big_mat[N];
}
model {
for (n in 1:N)
big_mat[n] ~ std_normal();
}
It seems that the posterior::summarise_draws
function is very slow.
I ran the following comparisons:
## Read-ins
> system.time({stanfit_by_csvs <- read_stan_csv(grep("\\.csv", dir(), value = TRUE))})
user system elapsed
70.012 0.583 70.583
> system.time({cmdstan_fit_by_csvs <- as_cmdstan_fit(grep("\\.csv", dir(), value = TRUE))})
user system elapsed
14.876 3.111 15.497
> system.time({read_cmdstan_csv_input <- read_cmdstan_csv(grep("\\.csv", dir(), value = TRUE))})
user system elapsed
15.340 2.615 14.551
Rstan is slower to read in the files and create a fit object, but for the summary it’s much faster:
## Summaries
> system.time({rstan_summary <- rstan::summary(stanfit_by_csvs)})
user system elapsed
56.724 0.134 58.782
> system.time({cmdstanr_summary <- cmdstan_fit_by_csvs$summary()})
user system elapsed
232.343 0.158 232.512
> system.time({posterior_summarise_draws <- posterior::summarise_draws(read_cmdstan_csv_input$post_warmup_draws)})
user system elapsed
231.480 0.064 231.503
I made a “manual” version and it’s a bit faster:
> system.time({manual_summary <- posterior::as_draws_df(read_cmdstan_csv_input$post_warmup_draws) %>%
+ pivot_longer(-c(.iteration, .chain, .draw), names_to = "variable") %>%
+ group_by(variable) %>%
+ summarize(
+ across(
+ value,
+ list(
+ mean = mean,
+ median = median,
+ sd = sd,
+ mad = mad,
+ q5 = function(x) quantile(x, 0.05),
+ q95 = function(x) quantile(x, 0.95),
+ rhat = posterior::rhat,
+ ess_bulk = posterior::ess_bulk,
+ ess_tail = posterior::ess_tail
+ ),
+ .names = "{.fn}"
+ )
+ )})
user system elapsed
159.625 1.732 161.333
So I’m wondering:
- What’s going on?
- How can rstan be so much faster?
- Also: When rstan calculates the summaries, it (to me) magically stores the result somewhere and all subsequent calculations (including diagnostics) are very fast. How does that work and why can’t cmdstanr do that?
- Is there a way to speed up the manual cmdstanr version (for example when I don’t want to work with rstan for some reason)?
- Basically, what is the fastest way to get the summary statistics for a set of parameters of interest AND ALSO the sampler diagnostics?
One more strange thing:
All versions seem to differ in the calculated ESS (I didn’t put the cmdstanr_summary
version here since it’s the same as the posterior_summarise_draws
one). What is going on there??
> manual_summary %>% arrange(match(variable, posterior_summarise_draws$variable)) %>% select(variable, ess_bulk, ess_tail) %>% print(n = 5)
# A tibble: 10,001 Ă— 3
variable ess_bulk ess_tail
<chr> <dbl> <dbl>
1 lp__ 4523. 6340.
2 big_mat[1,1] 38982. 10783.
3 big_mat[2,1] 37777. 10699.
4 big_mat[3,1] 36204. 10589.
5 big_mat[4,1] 36468. 10457.
# … with 9,996 more rows
> posterior_summarise_draws %>% select(variable, ess_bulk, ess_tail) %>% print(n = 5)
# A tibble: 10,001 Ă— 3
variable ess_bulk ess_tail
<chr> <dbl> <dbl>
1 lp__ 4534. 6347.
2 big_mat[1,1] 39649. 10829.
3 big_mat[2,1] 38682. 10710.
4 big_mat[3,1] 36529. 10628.
5 big_mat[4,1] 36782. 10448.
# … with 9,996 more rows
> rstan_summary$summary %>% as_tibble(rownames = "variable") %>% arrange(match(variable, posterior_summarise_draws$variable)) %>% select(variable, n_eff) %>% print(n = 5)
# A tibble: 10,001 Ă— 2
variable n_eff
<chr> <dbl>
1 lp__ 4516.
2 big_mat[1,1] 39265.
3 big_mat[2,1] 38295.
4 big_mat[3,1] 36844.
5 big_mat[4,1] 36501.
# … with 9,996 more rows