Saving & reusing adaptation in cmdstanr

I’m using sig_figs=18 during the warmup run

There is a cholesky_factor_corr parameter that, when I look at the summary from a run with warmup-and-sampling done together, has a constant value of 1 in the [1,1] entry, but when I look at the inits and test if they are precisely equal to one, the init value for that entry is.

I think so? If it weren’t, I’d expect more issues with the runs where I do warmup and sampling together. Also, I still observe the same behaviour when I bump the warmup iterations up to 1e4.

Here’s a gist containing the stan file, the code to generate data and the code to loop over seeds, generating new data and attempting to sample in the two ways for each seed.

Data and model are a standard hierarchical model where there are 10 subjects each observed with common gaussian measurement noise 10 times in each of 2 conditions, and the subjects’ intercept and condition effect are modelled as multivariate normal.

I just ran it for 100 seeds and where only about 5% of traditional runs encounter divergences, 96% of of the two-stage runs had them, and lots of them:

image

In R I used cmdstanr::write_stan_json()

Tried this and I get the same error (and just updated the cmdstanpy bug report to reflect this).

@ahartikainen you can ignore at least the part of this thread where I had trouble using cmdstanpy; turned out I was still using an old version and the latest version runs my examples fine. Starting a loop to confirm that the core issue of increased divergences manifests in cmdstanpy as it does with cmdstanr…

Yup, same behaviour in cmdstanpy as cmdstanr, so seems to be something core to cmdstan itself.

Ooof, this is very compelling. Definitely want to figure out what is happening here before the 2.26 release.

1 Like

Any insights so far? I’m happy to help explore but would need a little guidance on what to try.

Ooo, thanks for the reminder. I’ve got a window while some stuff builds. Will go check it out now.

1 Like

So @bbbales2 solved this by discovering that (1) I wasn’t composing the inits properly and (2) cmdstan was silently ignoring the inits because they weren’t in the right format and instead using random inits.

I think that the following (super kludgey, possibly-over-tidyversed) code should do the init composition better, handling single value parameters, vector parameters, and matrices. It should handle single and 2 dimensional arrays too, but I haven’t tried those. N-dimensional arrays are probably a fairly straightforward generalization, but I’ll leave that for another time.

get_inits = function(chain_id){
  warmup_draws = warmup$draws(inc_warmup=T)
  final_warmup_value = warmup_draws[iter_warmup,chain_id,2:(dim(warmup_draws)[3])]
  (
    final_warmup_value
    %>% tibble::as_tibble(
      .name_repair = function(names){
        dimnames(final_warmup_value)$variable
      }
    )
    %>% tidyr::pivot_longer(cols=dplyr::everything())
    %>% tidyr::separate(
      name
      , into = c('variable','index')
      , sep = "\\["
      , fill = 'right'
    )
    %>% dplyr::group_split(variable)
    %>% purrr::map(
      .f = function(x){
        (
          x
          %>% dplyr::mutate(
            index = stringr::str_replace(index,']','')
          )
          %>% tidyr::separate(
            index
            , into = c('first','second')
            , sep = ","
            , fill = 'right'
            , convert = T
          )
          %>% dplyr::arrange(first,second)
          %>% (function(x){
            out = list()
            if(all(is.na(x$second))){
              out[[1]] = x$value
            }else{
              out[[1]] = matrix(
                x$value
                , nrow = max(x$first)
                , ncol = max(x$second)
              )
            }
            names(out) = x$variable[1]
            return(out)
          })
        )         
      }
    )
    %>% unlist(recursive=F)
    %>% return()
  )
}

#example usage:
warmup = mod$sample(
  data = data_for_stan
  , chains = parallel_chains
  , parallel_chains = parallel_chains

  , seed = seed

  , iter_warmup = iter_warmup
  , save_warmup = T #for inits
  , sig_figs = 18

  , iter_sampling = 0
)

samples = mod$sample(
  data = data_for_stan
  , chains = parallel_chains
  , parallel_chains = parallel_chains

  , seed = seed+1

  , iter_warmup = 0
  , adapt_engaged = FALSE
  , inv_metric = warmup$inv_metric(matrix=F)
  , step_size = warmup$metadata()$step_size_adaptation

  , iter_sampling = iter_sample
  , init = get_inits
)

2 Likes

the reason this happens is the intersection of the following:

  1. underlyingly, both data inits and parameter inits are stan::io::varcontext objects.

  2. inits files can contain variables that aren’t in the model.
    we need this because a part of the workflow is model expansion and model comparison, and it’s convenient to let all versions of a model run off the same data and inits files.

  3. unlike the data init file, which must contain definitions for all variables in the data block, the params init file doesn’t have to, therefore no complaints from CmdStan about missing parameter inits, similarly, no complaints about parameter inits for variable names that don’t match variables in the model.

further thoughts: missing functionality - the interfaces don’t have an operation to instantiate the model given the data and then initialize the parameters given the constraints and any supplied inits.

such an operation would detect missing data from the inits file. it could also report on the initial parameter values. thus one could easily validate the inits. but AFAIK, in C++ the resulting instantiated object can’t be saved such that you could just start adaptation and sampling from that point, so we don’t do this.

2 Likes

Here is extract function for CmdStanPy. This returns output as dictionary containing ndim numpy arrays. We could also add option to remove diagnostics.

from cmdstanpy import CmdStanMCMC
from collections import defaultdict
import numpy as np
from typing import Optional

def extract(fit: CmdStanMCMC, parameters: Optional[list]=None, inc_warmup: bool=False):
    """Extract ndim dictionary from CmdStanPy fit.
    
    Parameters
    ----------
    fit: CmdStanMCMC
    parameters: list
    inc_warmup: bool
    
    Returns
    -------
    dict    
    """
    if parameters is None:
        parameters = fit.column_names
    
    if inc_warmup and not fit._save_warmup:
        inc_warmup = False
    
    data = fit.draws(inc_warmup=inc_warmup)
    draws, chains, *_ = data.shape
    column_groups = defaultdict(list)
    column_locs = defaultdict(list)
    # iterate flat column names
    for i, col in enumerate(fit.column_names):
        col_base, *col_tail = col.replace("]", "").replace("[", ",").split(",")
        if len(col_tail):
            # gather nD array locations
            column_groups[col_base].append(tuple(map(int, col_tail)))
        # gather raw data locations for each parameter
        column_locs[col_base].append(i)
    dims = {}
    for colname, col_dims in column_groups.items():
        # gather parameter dimensions (assumes dense arrays)
        dims[colname] = tuple(np.array(col_dims).max(0))
    
    sample = {}
    valid_base_cols = []
    # get list of parameters for extraction (basename) X.1.2 --> X
    for col in parameters:
        base_col = col.split("[")[0].split(".")[0]
        if base_col not in valid_base_cols:
            valid_base_cols.append(base_col)
    
    for key in valid_base_cols:
        ndim = dims.get(key, None)
        shape_location = column_groups.get(key, None)
        if ndim is not None:
            sample[key] = np.full((draws, chains, *ndim), np.nan)
        if shape_location is None:
            (i,) = column_locs[key]
            sample[key] = data[..., i]
        else:
            for i, shape_loc in zip(column_locs[key], shape_location):
                # location to insert extracted array
                shape_loc = tuple([Ellipsis] + [j - 1 for j in shape_loc])
                sample[key][shape_loc] = data[..., i]
    return sample

edit. inspiration from arviz/io_cmdstanpy.py at master · arviz-devs/arviz · GitHub

6 Likes

just filed this feature proposal on CmdStan: utility to check inits · Issue #967 · stan-dev/cmdstan · GitHub

feedback welcome

2 Likes

I’ve been trying the provided funtion. My intention is to continue sampling one of my models every night, while doing other stuff on the same computer during the day.
I run into the following problem:
Two days ago I run the model, then saved the output as an .RData file. Today I wanted to continue sampling from it. Problem: It complains that the .csv file is missing. It seems the CmdStanFit object in R doesn’t actually contain all the model results, but some are only in that csv file. Since the fit was from a differen r-session two day earlier that file was deleted (its a temporary file). The help of the the sample method of cmdstanr states that the .csv is deleted when the R-object representing the sampling results is garbage collected (which aparrently includes when the r-session is closed):
Possible Solution: use the parameter “output_dir” when calling the sampler to make the .csv permanent.

Just a quick followup: this solution did work, but does of course require manually managing when to delete the file, since these can eat a lot of memory over time.

I think this: How does CmdStanR work? • cmdstanr
should have everything you need.

So use

fit$save_object(file = path)

as stated in the vignette, this will ensure that everything has been read into R before saving the object using saveRDS(). So all the $summary() and the likes will work after restoring the RDS file.

I’ve been using this function, including in a fairly complicated model.
You wrote you didn’t test whether 2d-arrays work correctly. I found they do not, the elements are not sorted correctly (adding byrow=TRUE in the call to matrix() fixes it, or alternatively removing the line starting with “arrange” probably does too).

I’ve taken the liberty to:

  • fix the bug
  • wrap your function into a closure to make it pure,
  • add support for arrays of any dimensionalty (including arrays of vectors or matrices)
  • add support for initialising from previous sampling runs, nut just from warmup runs
  • some minor refactoring.

Here’s the code:

createStanInitsPreviousRun=
    # function to initialize one stan run (via cmdstanr) from the endpoint of another
    # returns a function that returns inits.
    # this function is discussed on stan forums here: https://discourse.mc-stan.org/t/saving-reusing-adaptation-in-cmdstanr/19166/51
    # known bugs:
    #   errors (complaining about missing argument) when used inside a dplyr pipeline (e.g. "fit %>% createStanInitsPreviousRun()") because non standard evaluation somehow breaks the creation of the promise because aparrently the arguments dont get evaulated at the right moment and so the data cant be captured in the environment.
    
    # # useage examples:
    # 
    # # initial pure warmup run
    # lastRunResults=mod$sample(
    #     data=data_for_stan
    #     ,chains=parallel_chains
    #     ,parallel_chains=parallel_chains
    #     # initialisation and adaptation settings
    #     ,iter_warmup=iter_warmup # run automatic initialisation and adaptation
    #     # sampling settings
    #     ,iter_sampling=0
    #     # settings to allow chains to be initialized from this one
    #     ,save_warmup=T # for getting inits from a pure warmup run
    #     ,sig_figs=18 # in crease output bitdepth to match the one used internally by stan. prevents rounding errors when initializing a run from this one (e.g. erroneously initializing at zero causing infinite density)
    #     ,output_dir="stanSamples/" # makes stan save its output to a non-temporary directory (since the r-object returned doen't contain all the results, but instead links to the file). Often needed to be able to continue sampling a model in a separate r-session or after its associated r-object has been removed (to be precise: the standard temporary output file is deleten when the r-object representing the sampling results (and thus linking to this file) has bee ngrbage collected, or the R-session holding it closed.)
    # )
    # 
    # # followup pure sampling runs
    # lastRunResults=mod$sample(
    #     data=data_for_stan
    #     ,chains=parallel_chains
    #     ,parallel_chains=parallel_chains
    #     # initialisation and adaptation settings
    #     ,iter_warmup=0 # use prespecified adaptation and inits
    #     ,adapt_engaged=FALSE # use prespecified adaptation and inits
    #     ,inv_metric=lastRunResults$inv_metric(matrix=F)
    #     ,step_size=lastRunResults$metadata()$step_size_adaptation
    #     ,init=createInitsFunction(lastRunResults)
    #     # sampling settings
    #     ,iter_sampling=iter_sample # difference
    #     # settings to allow chains to be initialized from this one
    #     ,save_warmup=T # needs to be set to TRUE even if no warmup is executed, because otherwise the function that extracts draws from the run complains since it is set to include warmup draws.
    #     ,sig_figs=18 # increase output bitdepth to match the one used internally by stan. prevents rounding errors when initializing a run from this one (e.g. erroneously initializing at zero causing infinite density)
    #     ,output_dir="stanSamples/" # makes stan save its output to a non-temporary directory (since the r-object returned doen't contain all the results, but instead links to the file). Often needed to be able to continue sampling a model in a separate r-session or after its associated r-object has been removed (to be precise: the standard temporary output file is deleten when the r-object representing the sampling results (and thus linking to this file) has bee ngrbage collected, or the R-session holding it closed.)
    # )
function(
    lastRunResults # cmdstanr fit which was fitted with include_warmup=T (this is required even if the number of warmup draws was zero, because of an API limitation)
)
{
    initsFunction = function(chain_id){ # modified from https://discourse.mc-stan.org/t/saving-reusing-adaptation-in-cmdstanr/19166/44
        #   note: this function is not pure on its own. It captures lastRunResults from the enclosing environment, forming a pure closure.
        lastRun_draws = lastRunResults$draws(inc_warmup=T) # note that this row errors when used on a model which wasnt run with save_warmup=TRUE.
        final_warmup_value = lastRun_draws[dim(lastRun_draws)[1],chain_id,2:(dim(lastRun_draws)[3])]
        final_warmup_value %>% 
            tibble::as_tibble(
                .name_repair = function(names){
                    dimnames(final_warmup_value)$variable
                }
            ) %>% 
            tidyr::pivot_longer(cols=dplyr::everything()) %>% 
            tidyr::separate(
                name
                , into = c('variable','index')
                , sep = "\\["
                , fill = 'right'
            ) %>% 
            dplyr::mutate(
                index = 
                    stringr::str_replace(index,']','') %>% 
                    {ifelse(!is.na(.),.,"")}
            ) %>% 
            dplyr::group_split(variable) %>% 
            purrr::map(
                .f = function(x){
                    out= 
                        x %>%
                        dplyr::mutate(
                            index=strsplit(index,",",fixed=TRUE)
                            ,maxIndexes=
                                do.call(rbind,index) %>%
                                {array(as.numeric(.),dim=dim(.))} %>%
                                plyr::aaply(2,max) %>%
                                list()
                        ) %>%
                        {
                            array(
                                .$value
                                ,dim=
                                    if(length(.$maxIndexes[[1]])>0){
                                        .$maxIndexes[[1]]
                                    } else {
                                        1
                                    }
                            )
                        } %>% 
                        list()
                    names(out) = x$variable[1]
                    out
                }
            ) %>% 
            unlist(recursive=F)
    }
    force(lastRunResults) # force evaluation of lastRunResults (i could've just written "lastRunResults=lastRunResults") to change its promise to an actual value that can get captured by the closure (since R has lazy evaluation objects dont get passed by reference or by copy, but by passing a promise to the object, if the object changes before the promise is evaluated then the promise returns the changed object instead of the "promised" one. Thus in that case the closure captures the changed object instead of the one originally passed (by promise) to its environment.)
    initsFunction # return the closure
}

Please note that this implementation throws an error when used as an elemen of a pipe (see comments under “known bugs” in the code.

Edit:
removed some unnecessary parantheses in the code.
clarified one comment in code
removed mention of Mike-Lawrence in the code because I didn’t ask for his permission to include him and the post clearly references his contribution.

4 Likes

Would this method work to generate inputs to cmdstan? I have a model that runs with cmdstanr, but it was suggested I try running it with cmdstan. I’d like to use my cmdstanr results as a starting point.

I tried using inv_metric() from cmdstanr and converting it to json using toJSON, but cmdstan isn’t recognizing it. I get

Cannot get inverse Euclidean metric from input file.
Caught exception: 
variable does not exist; processing stage=read diag inv metric; variable name=inv_metric; base type=vector_d

using

./my_model random seed=76543 \
           data file=my_data_V3.json \
           output file=output_${i}.csv refresh=100 \
           sample num_warmup=1000 \
           algorithm=hmc engine=nuts max_depth=10 \
           metric=diag_e metric_file=my_metric1.json \
           stepsize=1 \
           adapt delta=0.9
           num_chains=4
	       num_threads=32

here are examples of JSON files for the metric - this should work in CmdStan, CmdStanR, and CmdStanPy

1 Like

Thanks! That seems to solve that problem. I still have the underlying model problems to solve but that’s off this topic.