Fewer generated quantities than parameter samples?

nerutenbeck · November 4, 2020, 4:09pm

This is a follow-up to Faster / better loading of sampler diagnostics in cmdstanr?.

I’ve successfully split out the generated quantities section from the model fitting, but now I’m wondering if it is possible to generate fewer samples via the $generated_quantities() call than iter_sampling originally supplied in the call to $sample(). The specific use case is that I might want more iterations to achieve convergence, but actually need fewer generated quantities in application. From looking at the source code, it doesn’t seem that there’s a way to do it, but on the off chance there is… If not, may I humbly submit a feature request?

Thanks again!

rok_cesnovar · November 4, 2020, 6:11pm

In order to do that you need to write the draws to CSV, which you can do with the following function:

draws_to_csv <- function (draws) {
  # make dummy sampler diagnostics, needed because cmdstan requires those columns
  sampler_diagnostic_names <- c("accept_stat__", "stepsize__", "treedepth__", "n_leapfrog__", "divergent__", "energy__")
  n <- posterior::niterations(draws)
  n_chains <- posterior::nchains(draws)
  dummy_sampler_diag <- rep(0, n * length(sampler_diagnostic_names) * n_chains)
  dim(dummy_sampler_diag) <- c(n, n_chains, length(sampler_diagnostic_names))
  dummy_sampler_diag <- posterior::as_draws_array(dummy_sampler_diag)
  posterior::variables(dummy_sampler_diag) <- sampler_diagnostic_names
  
  # the columns must also be in order lp__, sampler_diagnostics, parameters
  variables <- posterior::variables(draws)
  draws <- posterior::subset_draws(
    posterior::bind_draws(draws, dummy_sampler_diag, along = "variable"),
    variable = c("lp__", sampler_diagnostic_names, variables[variables != "lp__"])
  )
  variables <- posterior::variables(draws)
  chains <- posterior::chain_ids(draws)
  # we generate file names for temporary CSV files
  paths <- cmdstanr:::generate_file_names(basename = "fittedParams", ids = chains)
  paths <- file.path(tempdir(), paths)
  chain <- 1
  for (path in paths) {
    # write iterations (required by cmdstan) and variables names
    write(
      paste0("# num_samples = ", n, "\n", paste0(cmdstanr:::unrepair_variable_names(variables), collapse = ",")),
      file = path,
      append = FALSE
    )
    utils::write.table(
      posterior::subset_draws(draws, chain = chain),
      sep = ",",
      file = path,
      col.names = FALSE,
      row.names = FALSE,
      append = TRUE
    )
    chain <- chain + 1
  }
  paths
}

Example of how to use:

library(cmdstanr)

stan_file_bernoulli <- file.path(cmdstan_path(), "examples", "bernoulli", "bernoulli.stan")
data_bernoulli <- file.path(cmdstan_path(), "examples", "bernoulli", "bernoulli.data.json")

stan_ppc_file_bernoulli <- "~/Desktop/cmdstanr/tests/testthat/resources/stan/bernoulli_ppc.stan"
data_ppc_bernoulli <- "~/Desktop/cmdstanr/tests/testthat/resources/data/bernoulli_ppc.data.json"

mod <- cmdstan_model(stan_file_bernoulli)
fit <- mod$sample(data_bernoulli, chains = 4, iter_sampling = 2000)


draws <- posterior::subset_draws(fit$draws(), iteration = 1:10)

mod_ppc <- cmdstan_model(stan_ppc_file_bernoulli)
fit_gq <- mod_ppc$generate_quantities(fitted_params = draws_to_csv(draws), data_ppc_bernoulli)
print(dim(fit_gq$draws())) #should be 10 4 11

All these gymnastics with the draws are required because cmdstan is very peculiar on how the CSV must be organized. But @mitzimorris is working on that (https://github.com/stan-dev/cmdstan/issues/943) so that should be much simpler soon.

rok_cesnovar · November 4, 2020, 6:19pm

Also thanks for the feedback!

Please open an issue if you feel fittedParams in $generate_quantities() should support the draws array. Based on your post I think it makes sense to support this.

jonah · November 11, 2020, 12:43am

Yeah I agree! @nerutenbeck If you want to open an issue with a feature request that would be great.

nerutenbeck · November 11, 2020, 1:16am

I opened an issue here: https://github.com/stan-dev/cmdstanr/issues/342

Thanks @rok_cesnovar and @jonah!

nerutenbeck · November 11, 2020, 1:01pm

As @mitzimorris pointed out, what I really need to do is remember that you control the number of iterations to achieve convergence with iter_warmup and there is no need to go overkill on iter_sampling. So specifying iter_sampling carefully plus use of posterior::thin_draws()will get the behavior I want.

Topic		Replies	Views
Question regarding generated quantities matrix output from cmdstanr / posterior Interfaces cmdstanr	4	503	November 6, 2020
Faster / better loading of sampler diagnostics in cmdstanr? Interfaces cmdstanr	4	524	November 3, 2020
Standalone generated quantities usage help : cmdstanr CmdStan techniques , fitting-issues , algorithms	3	989	April 23, 2020
Adding standalone generated quantities option to cmdstan (and rstan and pystan) Interfaces	9	1210	December 15, 2017
Generated Quantities returns error "Mismatch between model and fitted_parameters csv file" CmdStan cmdstanr	13	1714	September 6, 2020

Fewer generated quantities than parameter samples?

Related topics