Saving & reusing adaptation in cmdstanr

@mitzimorris could you elaborate on why this isn’t a bug? I don’t understand your abbreviated comment closing my bug report. If it were simply about the first sampling iteration, using the initial values, then what am I doing wrong in setting the initial values and wouldn’t you expect just the first few iterations to be divergent? I observe a very high proportion of samples going divergent in the second data & model example I provided.

2 Likes

first off, apologies for the too terse message. I tried to find a better explanation than what I offered. Perhaps the discussion here would be useful: Current state of checkpointing in Stan

this issue comes up alot and it is indeed something we need to figure out.

Hm, as I understand it checkpointing is about achieving capture/reinstatement of all rng states, which if achieved would permit yield identical samples between a run with warmup and sampling together and two runs, one with warmup and one with sampling. However, absent such checkpointing, I expected that capture/reinstatement of at least the inv_metric and step_size, plus use of the final warmup draw as inits, would at least place the sampling run in the typical set with HMC parameters that should yield not-identical-but-roughly-equivalent performance in sampling, as measured by the rate of encountering divergences, ESS, Rhat, etc. But what I observe, particularly with the second data/model I posted on the issue, is that while the combined warmup+sampling runs almost never encounter divergences, the split warmup-then-sampling runs nearly always do, and when they do they have divergences for the majority of samples.

Where is my understanding faulty?

the sampler takes the initial parameter values and tries to use them, but if for some reason
those inits fail to meet constraints, it will try to use other values, therefore it starts in a bad place.
as you don’t do any further adaptation, it continues to have problems.
perhaps @bbbales2 has more insight as to what’s going on in these examples.

This seems like a bug to me. I looked at the scripts earlier and I definitely think it should work and it seems to have worked a couple times earlier in this thread even (here and here). Something going on. Hopefully @mike-lawrence and @ahartikainen figure it out!

But the inits are coming from the final draw of the warmup, so aren’t those guaranteed to meet the model contraints?

if it sometimes works and sometimes doesn’t, then it sounds like it’s hit a difficult model/data combo. there’s a loss of precision when you dump out the draw and then read it back in again. parameter values very close to one are particularly problematic.

for the problematic example, has the chain properly converged during the warmup phase?

I’m using sig_figs=18 during the warmup run

There is a cholesky_factor_corr parameter that, when I look at the summary from a run with warmup-and-sampling done together, has a constant value of 1 in the [1,1] entry, but when I look at the inits and test if they are precisely equal to one, the init value for that entry is.

I think so? If it weren’t, I’d expect more issues with the runs where I do warmup and sampling together. Also, I still observe the same behaviour when I bump the warmup iterations up to 1e4.

Here’s a gist containing the stan file, the code to generate data and the code to loop over seeds, generating new data and attempting to sample in the two ways for each seed.

Data and model are a standard hierarchical model where there are 10 subjects each observed with common gaussian measurement noise 10 times in each of 2 conditions, and the subjects’ intercept and condition effect are modelled as multivariate normal.

I just ran it for 100 seeds and where only about 5% of traditional runs encounter divergences, 96% of of the two-stage runs had them, and lots of them:

image

In R I used cmdstanr::write_stan_json()

Tried this and I get the same error (and just updated the cmdstanpy bug report to reflect this).

@ahartikainen you can ignore at least the part of this thread where I had trouble using cmdstanpy; turned out I was still using an old version and the latest version runs my examples fine. Starting a loop to confirm that the core issue of increased divergences manifests in cmdstanpy as it does with cmdstanr…

Yup, same behaviour in cmdstanpy as cmdstanr, so seems to be something core to cmdstan itself.

Ooof, this is very compelling. Definitely want to figure out what is happening here before the 2.26 release.

1 Like

Any insights so far? I’m happy to help explore but would need a little guidance on what to try.

Ooo, thanks for the reminder. I’ve got a window while some stuff builds. Will go check it out now.

1 Like

So @bbbales2 solved this by discovering that (1) I wasn’t composing the inits properly and (2) cmdstan was silently ignoring the inits because they weren’t in the right format and instead using random inits.

I think that the following (super kludgey, possibly-over-tidyversed) code should do the init composition better, handling single value parameters, vector parameters, and matrices. It should handle single and 2 dimensional arrays too, but I haven’t tried those. N-dimensional arrays are probably a fairly straightforward generalization, but I’ll leave that for another time.

get_inits = function(chain_id){
  warmup_draws = warmup$draws(inc_warmup=T)
  final_warmup_value = warmup_draws[iter_warmup,chain_id,2:(dim(warmup_draws)[3])]
  (
    final_warmup_value
    %>% tibble::as_tibble(
      .name_repair = function(names){
        dimnames(final_warmup_value)$variable
      }
    )
    %>% tidyr::pivot_longer(cols=dplyr::everything())
    %>% tidyr::separate(
      name
      , into = c('variable','index')
      , sep = "\\["
      , fill = 'right'
    )
    %>% dplyr::group_split(variable)
    %>% purrr::map(
      .f = function(x){
        (
          x
          %>% dplyr::mutate(
            index = stringr::str_replace(index,']','')
          )
          %>% tidyr::separate(
            index
            , into = c('first','second')
            , sep = ","
            , fill = 'right'
            , convert = T
          )
          %>% dplyr::arrange(first,second)
          %>% (function(x){
            out = list()
            if(all(is.na(x$second))){
              out[[1]] = x$value
            }else{
              out[[1]] = matrix(
                x$value
                , nrow = max(x$first)
                , ncol = max(x$second)
              )
            }
            names(out) = x$variable[1]
            return(out)
          })
        )         
      }
    )
    %>% unlist(recursive=F)
    %>% return()
  )
}

#example usage:
warmup = mod$sample(
  data = data_for_stan
  , chains = parallel_chains
  , parallel_chains = parallel_chains

  , seed = seed

  , iter_warmup = iter_warmup
  , save_warmup = T #for inits
  , sig_figs = 18

  , iter_sampling = 0
)

samples = mod$sample(
  data = data_for_stan
  , chains = parallel_chains
  , parallel_chains = parallel_chains

  , seed = seed+1

  , iter_warmup = 0
  , adapt_engaged = FALSE
  , inv_metric = warmup$inv_metric(matrix=F)
  , step_size = warmup$metadata()$step_size_adaptation

  , iter_sampling = iter_sample
  , init = get_inits
)

1 Like

the reason this happens is the intersection of the following:

  1. underlyingly, both data inits and parameter inits are stan::io::varcontext objects.

  2. inits files can contain variables that aren’t in the model.
    we need this because a part of the workflow is model expansion and model comparison, and it’s convenient to let all versions of a model run off the same data and inits files.

  3. unlike the data init file, which must contain definitions for all variables in the data block, the params init file doesn’t have to, therefore no complaints from CmdStan about missing parameter inits, similarly, no complaints about parameter inits for variable names that don’t match variables in the model.

further thoughts: missing functionality - the interfaces don’t have an operation to instantiate the model given the data and then initialize the parameters given the constraints and any supplied inits.

such an operation would detect missing data from the inits file. it could also report on the initial parameter values. thus one could easily validate the inits. but AFAIK, in C++ the resulting instantiated object can’t be saved such that you could just start adaptation and sampling from that point, so we don’t do this.

1 Like

Here is extract function for CmdStanPy. This returns output as dictionary containing ndim numpy arrays. We could also add option to remove diagnostics.

from cmdstanpy import CmdStanMCMC
from collections import defaultdict
import numpy as np
from typing import Optional

def extract(fit: CmdStanMCMC, parameters: Optional[list]=None, inc_warmup: bool=False):
    """Extract ndim dictionary from CmdStanPy fit.
    
    Parameters
    ----------
    fit: CmdStanMCMC
    parameters: list
    inc_warmup: bool
    
    Returns
    -------
    dict    
    """
    if parameters is None:
        parameters = fit.column_names
    
    if inc_warmup and not fit._save_warmup:
        inc_warmup = False
    
    data = fit.draws(inc_warmup=inc_warmup)
    draws, chains, *_ = data.shape
    column_groups = defaultdict(list)
    column_locs = defaultdict(list)
    # iterate flat column names
    for i, col in enumerate(fit.column_names):
        col_base, *col_tail = col.replace("]", "").replace("[", ",").split(",")
        if len(col_tail):
            # gather nD array locations
            column_groups[col_base].append(tuple(map(int, col_tail)))
        # gather raw data locations for each parameter
        column_locs[col_base].append(i)
    dims = {}
    for colname, col_dims in column_groups.items():
        # gather parameter dimensions (assumes dense arrays)
        dims[colname] = tuple(np.array(col_dims).max(0))
    
    sample = {}
    valid_base_cols = []
    # get list of parameters for extraction (basename) X.1.2 --> X
    for col in parameters:
        base_col = col.split("[")[0].split(".")[0]
        if base_col not in valid_base_cols:
            valid_base_cols.append(base_col)
    
    for key in valid_base_cols:
        ndim = dims.get(key, None)
        shape_location = column_groups.get(key, None)
        if ndim is not None:
            sample[key] = np.full((draws, chains, *ndim), np.nan)
        if shape_location is None:
            (i,) = column_locs[key]
            sample[key] = data[..., i]
        else:
            for i, shape_loc in zip(column_locs[key], shape_location):
                # location to insert extracted array
                shape_loc = tuple([Ellipsis] + [j - 1 for j in shape_loc])
                sample[key][shape_loc] = data[..., i]
    return sample

edit. inspiration from arviz/io_cmdstanpy.py at master · arviz-devs/arviz · GitHub

5 Likes

just filed this feature proposal on CmdStan: utility to check inits · Issue #967 · stan-dev/cmdstan · GitHub

feedback welcome

2 Likes