Cmdstan_diagnose() is slow with log_lik included in the model

zcai · March 6, 2022, 11:01pm

Hi there,

I am using cmdstan_diagnose() to show the diagnostic results. However, it takes much longer to print when I calculate log_lik in the mode than without log_lik. I might be wrong but I guess it is because there are many data points, each has a log_lik distribution and cmdstan_diagnose() is also “diagnosing” log_lik. My gut feeling is that we should not consider log_lik in cmdstan_diagnose()since they are generated from posteriors. Could you please teach me a simple way to exclude log_lik from cmdstan_diagnose()? It sames there is no option argument in the cmdstan_diagnose() function, should I manually remove log_lik from fit restuls before running it? I believe it is the same case for cmdstan_summary()

Thank you very much.

jonah · March 10, 2022, 10:53pm

Yeah that’s probably the reason it’s so slow.

Yeah that’s right. cmdstan_diagnose() and cmdstan_summary() are just calling underlying methods from CmdStan itself, which doesn’t provide a way to select variables. However, if you use fit$summary() (which uses the posterior package) then you can specify which variables to summarize. For example:

fit$summary(variables = c("alpha", "beta"))

or to include everything except log_lik:

exclude_log_lik <- grep("log_lik", fit$metadata()$model_params, value = TRUE, invert = TRUE)
fit$summary(variables = exclude_log_lik)

This will give you posterior summary statistics, rhat, effective sample sizes, but not divergence and treedepth warnings. Those are coming in this pull request

github.com/stan-dev/cmdstanr

New method summarizing sampler diagnostics and warnings

stan-dev:master ← stan-dev:expose-diagnostics

opened 06:07PM - 04 Nov 21 UTC

jgabry

+561 -117

#### Submission Checklist - [x] Run unit tests - [x] Declare copyright holde…r and agree to license (see below) #### Summary * Closes #205 * Builds off of PR #500 from @jsocolar (so merging this one would in effect merge that one too). Opening a draft PR to discuss introducing a new method to summarize the sampler diagnostics and regenerate the warning messages. The returned values are vectors of diagnostics per chain (e.g. divergences per chain). ```r # note: the warning messages will change once we have the website to point people to > fit$diagnose_sampler() Warning: 89 of 4000 (2.0%) transitions ended with a divergence. This may indicate insufficient exploration of the posterior distribution. Possible remedies include: * Increasing adapt_delta closer to 1 (default is 0.8) * Reparameterizing the model (e.g. using a non-centered parameterization) * Using informative or weakly informative prior distributions $num_divergent [1] 34 24 11 20 $num_max_treedepth [1] 0 0 0 0 $ebfmi [1] 0.4277230 0.3856897 0.3388729 0.3541327 ``` ### Things I'd like feedback on Tagging some people who have expressed interest in this: @rok-cesnovar @jsocolar @martinmodrak @avehtari #### The method name Options: * `diagnose_sampler()` * `diagnostic_summary()` * `check_diagnostics()` * other suggestions? #### Should it include R-hat, ESS? Does it make sense to include both together here? On the one hand it's nice to have all diagnostics together. On the other hand there are important differences: * The HMC/NUTS diagnostics are specific to the Markov chains not the individual parameters. * R-hat and ESS are parameter-specific diagnostics. * Calculating all of the R-hats and ESS values therefore takes a lot longer than calculating these HMC/NUTS diagnostics. Right now they're separate: the HMC/NUTS diagnostics are handled by this method and the R-hat and ESS diagnostics are handled by `fit$summary()`. Also note: eventually the posterior package may provide functionality for diagnostics, but this method will be useful until then (or potentially could make use of what posterior provides eventually). #### Copyright and Licensing Please list the copyright holder for the work you are submitting (this will be you or your assignee, such as a university or company): **Columbia University** By submitting this pull request, the copyright holder is agreeing to license the submitted work under the following licenses: - Code: BSD 3-clause (https://opensource.org/licenses/BSD-3-Clause) - Documentation: CC-BY 4.0 (https://creativecommons.org/licenses/by/4.0/)

which will be merged soon but is already usable if you want to try it.

zcai · March 16, 2022, 5:23pm

Hi @jonah, thanks for your suggestions. May I ask if you have any suggestion to kind of ignore log_lik when doing diagnostic? Should I use your method to export posteriors without log_lik and then apply a user-defined diagnostic function? but then, as you mentioned, I will lose divergence info. Do you think it is possible to add this as a new feature to allow something like

exclude_log_lik <- grep("log_lik", fit$metadata()$model_params, value = TRUE, invert = TRUE)
fit$cmdstan_diagnose(variables = exclude_log_lik)

In rstan, we have stan_diag() RStan Diagnostic plots — Diagnostic plots • rstan, probably I can remove log_lik from the fitting result before applying this function. However, it requires a rstan fit object and usually converting cmdstan results to rstan fit object with large number of parameters would take some time.

Thanks,
ZC

jonah · March 17, 2022, 8:44pm

We can’t do this in CmdStanR without a change in CmdStan because cmdstan_diagnose() is just calling CmdStan’s diagnose utility. But we just merged a pull request on the master branch of CmdStanR that adds a method for getting other diagnostics like divergences. So if you install CmdStanR from GitHub with

devtools::install_github("stan-dev/cmdstanr")

you can use this:

# this will tell you if you have divergence, treedepth or E-BFMI issues and shouldn't be
# affected by large number of log_lik elements (these are per-variable diagnostics) 
# see ?diagnostic_summary for details
fit$diagnostic_summary()

And then you can use fit$summary() to get the r-hat and ess values like I mentioned above.

zcai · March 17, 2022, 8:51pm

Sounds great. Thanks a lot.

Topic		Replies	Views
Bin/diagnose in cmdstan to skip warmup General	8	500	November 27, 2020
Processing Large Posterior Modeling	5	590	July 30, 2022
Slow cmdstanr/posterior vs. rstan summary CmdStan cmdstanr	5	1297	November 16, 2021
Extracting draw summaries prohibitively slow for massive models Interfaces cmdstanr , posterior-package	2	533	May 2, 2023
Efficient way to save diagnostics CmdStan cmdstanr	2	59	July 16, 2024

Cmdstan_diagnose() is slow with log_lik included in the model

Related topics