Various observations on rstan, cmdstanr, pystan and cmdstanpy after teaching with all in parallel

This may sound pedantic, but I mention it just because you say the rationale for iter_* is that it’s the number of iterations… it’s not. It’s the number of accepted proposals. There are more iterations (not counting leapfrogging) that lead to proposals that are rejected. It’s like a while loop instead of a for loop.

1 Like

No, this is not true. See:

library(rstan)
#> Loading required package: StanHeaders
#> rstan (Version 2.26.0.9000, GitRev: 2e1f913d3ca3)
#> For execution on a local, multicore CPU with excess RAM we recommend calling
#> options(mc.cores = parallel::detectCores()).
#> To avoid recompilation of unchanged Stan programs, we recommend calling
#> rstan_options(auto_write = TRUE)
library(magrittr)
#> 
#> Attaching package: 'magrittr'
#> The following object is masked from 'package:rstan':
#> 
#>     extract
res <- stan(
  model_code = 'data {} parameters{real x;} model{x ~ std_normal();}',
  chains = 1,
  control = list(adapt_delta = 0.5)
)
#> SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 1).
#> Chain 1: 
#> Chain 1: Gradient evaluation took 9e-06 seconds
#> Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.09 seconds.
#> Chain 1: Adjust your expectations accordingly!
#> Chain 1: 
#> Chain 1: 
#> Chain 1: Iteration:    1 / 2000 [  0%]  (Warmup)
#> Chain 1: Iteration:  200 / 2000 [ 10%]  (Warmup)
#> Chain 1: Iteration:  400 / 2000 [ 20%]  (Warmup)
#> Chain 1: Iteration:  600 / 2000 [ 30%]  (Warmup)
#> Chain 1: Iteration:  800 / 2000 [ 40%]  (Warmup)
#> Chain 1: Iteration: 1000 / 2000 [ 50%]  (Warmup)
#> Chain 1: Iteration: 1001 / 2000 [ 50%]  (Sampling)
#> Chain 1: Iteration: 1200 / 2000 [ 60%]  (Sampling)
#> Chain 1: Iteration: 1400 / 2000 [ 70%]  (Sampling)
#> Chain 1: Iteration: 1600 / 2000 [ 80%]  (Sampling)
#> Chain 1: Iteration: 1800 / 2000 [ 90%]  (Sampling)
#> Chain 1: Iteration: 2000 / 2000 [100%]  (Sampling)
#> Chain 1: 
#> Chain 1:  Elapsed Time: 0.007 seconds (Warm-up)
#> Chain 1:                0.006 seconds (Sampling)
#> Chain 1:                0.013 seconds (Total)
#> Chain 1:
#> Warning: There were 78 divergent transitions after warmup. See
#> https://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
#> to find out why this is a problem and how to eliminate them.
#> Warning: Examine the pairs() plot to diagnose sampling problems
#> Warning: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
#> Running the chains for more iterations may help. See
#> https://mc-stan.org/misc/warnings.html#bulk-ess
#> Warning: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
#> Running the chains for more iterations may help. See
#> https://mc-stan.org/misc/warnings.html#tail-ess
samples <- rstan::extract(res, pars = 'x', permuted = FALSE)

# if these were all 'accepted' then this would be a vector of length 100, 
# as each 'sample' would be an 'accepted' propsal (which would 
# necessarily differ)  from the original point, and thus
samples %>%
  magrittr::extract(1 : 1000, ,) %>% 
  round(digits = 8) %>% # floating point error will do funny things otherwise
  unique() %>% 
  length()
#> [1] 230

Created on 2021-10-07 by the reprex package (v2.0.1)

adapt_delta is the target acceptance rate, so you can make unique(samples) differ arbitrarily from 1000 (well, between 1 and 1000) by decreasing its value.

1 Like

I had to convince myself that this was also true for cmdstan:

library(cmdstanr)
#> This is cmdstanr version 0.4.0.9000
#> - Online documentation and vignettes at mc-stan.org/cmdstanr
#> - CmdStan path set to: /Users/amanderson/.cmdstanr/cmdstan-2.26.0
#> - Use set_cmdstan_path() to change the path
#> 
#> A newer version of CmdStan is available. See ?install_cmdstan() to install it.
#> To disable this check set option or environment variable CMDSTANR_NO_VER_CHECK=TRUE.
library(magrittr)

cmdstan_model <- cmdstan_model(
  write_stan_file('data {} parameters{real x;} model{x ~ std_normal();}')
)

run_model <- cmdstan_model$sample(
  chains = 1,
  iter_warmup = 200,
  iter_sampling = 1000,
  adapt_delta = 0.5
)
#> Running MCMC with 1 chain...
#> 
#> Chain 1 Iteration:    1 / 1200 [  0%]  (Warmup) 
#> Chain 1 Iteration:  100 / 1200 [  8%]  (Warmup) 
#> Chain 1 Iteration:  200 / 1200 [ 16%]  (Warmup) 
#> Chain 1 Iteration:  201 / 1200 [ 16%]  (Sampling) 
#> Chain 1 Iteration:  300 / 1200 [ 25%]  (Sampling) 
#> Chain 1 Iteration:  400 / 1200 [ 33%]  (Sampling) 
#> Chain 1 Iteration:  500 / 1200 [ 41%]  (Sampling) 
#> Chain 1 Iteration:  600 / 1200 [ 50%]  (Sampling) 
#> Chain 1 Iteration:  700 / 1200 [ 58%]  (Sampling) 
#> Chain 1 Iteration:  800 / 1200 [ 66%]  (Sampling) 
#> Chain 1 Iteration:  900 / 1200 [ 75%]  (Sampling) 
#> Chain 1 Iteration: 1000 / 1200 [ 83%]  (Sampling) 
#> Chain 1 Iteration: 1100 / 1200 [ 91%]  (Sampling) 
#> Chain 1 Iteration: 1200 / 1200 [100%]  (Sampling) 
#> Chain 1 finished in 0.0 seconds.
samples_cmdstan <- run_model$draws('x')
samples_cmdstan %>%
  magrittr::extract(1 : 1000, ,) %>% 
  round(digits = 8) %>% # floating point error will do funny things otherwise
  unique() %>% 
  length()
#> [1] 589

Created on 2021-10-07 by the reprex package (v2.0.1)

1 Like

I see, the difference is in whether we count the unique draws or all draws.

Here’s the description of the algorithm, note use of term “iteration”: 15.1 Hamiltonian Monte Carlo | Stan Reference Manual

The Hamiltonian Monte Carlo algorithm starts at a specified initial set of parameters θ; in Stan, this value is either user-specified or generated randomly. Then, for a given number of iterations, a new momentum vector is sampled and the current value of the parameter θ is updated using the leapfrog integrator with discretization time ϵ and number of steps L according to the Hamiltonian dynamics. Then a Metropolis acceptance step is applied, and a decision is made whether to update to the new state (θ∗, ρ∗) or keep the existing state.

Divergences are those iterations where the decision is made to keep the existing state.

The thin parameter controls the rate at which these updates are reported in the output Stan CSV file. By default, every update is reported - thinning is not encouraged - cf: How not to compare the speed of Stan to something else | Statistical Modeling, Causal Inference, and Social Science

For each reported update, the CSV output reports both the updated values for all model parameters, transformed parameters, and generated quantities variables, as well as the sampler parameters:

    'lp__',  'accept_stat__',  'stepsize__',  'treedepth__',  'n_leapfrog__',  'divergent__',  'energy__',

I would say that the number of unique draws in the resulting sample are those draws where divergent__ == 0 (False). If several draws happen to have the same value for a paremeter, that’s either an artifact of the fitted model or a lack of numerical resolution. But it depends on what you mean by “unique”.

Given this, I don’t think it’s correct to say that iterations are the number of accepted proposals. Iterations are the sampler transitions, which are either an update based on accepted proposal, or no update (divergent transition).

Finally, I very much appreciate this discussion, as it nicely illustrates the problem of choosing names in the API which are concise, correct, and helpful to the user.

3 Likes

An iteration can result in no change to the state without an iteration being declared ‘divergent’ – the Metropolis step can reject a transition, but the difference between the Hamiltonian at the current value and the Hamiltonian at the ‘proposed value’ may not be high enough to flag the trajectory as divergent (Footnote 24 in 15.5 Divergent transitions | Stan Reference Manual suggests that the difference can be up to 10^3 before an iteration/trajectory is declared divergent).

2 Likes

Likewise, I’m reasonably certain that the sampler can change the state even if the trajectory is divergent. If the trajectory diverges immediately, then the probability of this happening is very low, but if the trajectory lingers for a little while in a region of non-negligible posterior density prior to diverging, then there’s a decent chance that the sampler will accept an update from the trajectory. That is, the Metropolis step is applied as usual regardless of whether the trajectory gets flagged as divergent or not.

1 Like