Saving & reusing adaptation in cmdstanr

bbbales2 · January 9, 2021, 2:54pm

Ooof, this is very compelling. Definitely want to figure out what is happening here before the 2.26 release.

mike-lawrence · January 12, 2021, 1:58pm

Any insights so far? I’m happy to help explore but would need a little guidance on what to try.

bbbales2 · January 12, 2021, 2:43pm

Ooo, thanks for the reminder. I’ve got a window while some stuff builds. Will go check it out now.

mike-lawrence · January 13, 2021, 10:24pm

So @bbbales2 solved this by discovering that (1) I wasn’t composing the inits properly and (2) cmdstan was silently ignoring the inits because they weren’t in the right format and instead using random inits.

I think that the following (super kludgey, possibly-over-tidyversed) code should do the init composition better, handling single value parameters, vector parameters, and matrices. It should handle single and 2 dimensional arrays too, but I haven’t tried those. N-dimensional arrays are probably a fairly straightforward generalization, but I’ll leave that for another time.

get_inits = function(chain_id){
  warmup_draws = warmup$draws(inc_warmup=T)
  final_warmup_value = warmup_draws[iter_warmup,chain_id,2:(dim(warmup_draws)[3])]
  (
    final_warmup_value
    %>% tibble::as_tibble(
      .name_repair = function(names){
        dimnames(final_warmup_value)$variable
      }
    )
    %>% tidyr::pivot_longer(cols=dplyr::everything())
    %>% tidyr::separate(
      name
      , into = c('variable','index')
      , sep = "\\["
      , fill = 'right'
    )
    %>% dplyr::group_split(variable)
    %>% purrr::map(
      .f = function(x){
        (
          x
          %>% dplyr::mutate(
            index = stringr::str_replace(index,']','')
          )
          %>% tidyr::separate(
            index
            , into = c('first','second')
            , sep = ","
            , fill = 'right'
            , convert = T
          )
          %>% dplyr::arrange(first,second)
          %>% (function(x){
            out = list()
            if(all(is.na(x$second))){
              out[[1]] = x$value
            }else{
              out[[1]] = matrix(
                x$value
                , nrow = max(x$first)
                , ncol = max(x$second)
              )
            }
            names(out) = x$variable[1]
            return(out)
          })
        )         
      }
    )
    %>% unlist(recursive=F)
    %>% return()
  )
}

#example usage:
warmup = mod$sample(
  data = data_for_stan
  , chains = parallel_chains
  , parallel_chains = parallel_chains

  , seed = seed

  , iter_warmup = iter_warmup
  , save_warmup = T #for inits
  , sig_figs = 18

  , iter_sampling = 0
)

samples = mod$sample(
  data = data_for_stan
  , chains = parallel_chains
  , parallel_chains = parallel_chains

  , seed = seed+1

  , iter_warmup = 0
  , adapt_engaged = FALSE
  , inv_metric = warmup$inv_metric(matrix=F)
  , step_size = warmup$metadata()$step_size_adaptation

  , iter_sampling = iter_sample
  , init = get_inits
)

mitzimorris · January 14, 2021, 1:32am

the reason this happens is the intersection of the following:

underlyingly, both data inits and parameter inits are stan::io::varcontext objects.
inits files can contain variables that aren’t in the model.
we need this because a part of the workflow is model expansion and model comparison, and it’s convenient to let all versions of a model run off the same data and inits files.
unlike the data init file, which must contain definitions for all variables in the data block, the params init file doesn’t have to, therefore no complaints from CmdStan about missing parameter inits, similarly, no complaints about parameter inits for variable names that don’t match variables in the model.

further thoughts: missing functionality - the interfaces don’t have an operation to instantiate the model given the data and then initialize the parameters given the constraints and any supplied inits.

such an operation would detect missing data from the inits file. it could also report on the initial parameter values. thus one could easily validate the inits. but AFAIK, in C++ the resulting instantiated object can’t be saved such that you could just start adaptation and sampling from that point, so we don’t do this.

ahartikainen · January 14, 2021, 8:31am

Here is extract function for CmdStanPy. This returns output as dictionary containing ndim numpy arrays. We could also add option to remove diagnostics.

from cmdstanpy import CmdStanMCMC
from collections import defaultdict
import numpy as np
from typing import Optional

def extract(fit: CmdStanMCMC, parameters: Optional[list]=None, inc_warmup: bool=False):
    """Extract ndim dictionary from CmdStanPy fit.
    
    Parameters
    ----------
    fit: CmdStanMCMC
    parameters: list
    inc_warmup: bool
    
    Returns
    -------
    dict    
    """
    if parameters is None:
        parameters = fit.column_names
    
    if inc_warmup and not fit._save_warmup:
        inc_warmup = False
    
    data = fit.draws(inc_warmup=inc_warmup)
    draws, chains, *_ = data.shape
    column_groups = defaultdict(list)
    column_locs = defaultdict(list)
    # iterate flat column names
    for i, col in enumerate(fit.column_names):
        col_base, *col_tail = col.replace("]", "").replace("[", ",").split(",")
        if len(col_tail):
            # gather nD array locations
            column_groups[col_base].append(tuple(map(int, col_tail)))
        # gather raw data locations for each parameter
        column_locs[col_base].append(i)
    dims = {}
    for colname, col_dims in column_groups.items():
        # gather parameter dimensions (assumes dense arrays)
        dims[colname] = tuple(np.array(col_dims).max(0))
    
    sample = {}
    valid_base_cols = []
    # get list of parameters for extraction (basename) X.1.2 --> X
    for col in parameters:
        base_col = col.split("[")[0].split(".")[0]
        if base_col not in valid_base_cols:
            valid_base_cols.append(base_col)
    
    for key in valid_base_cols:
        ndim = dims.get(key, None)
        shape_location = column_groups.get(key, None)
        if ndim is not None:
            sample[key] = np.full((draws, chains, *ndim), np.nan)
        if shape_location is None:
            (i,) = column_locs[key]
            sample[key] = data[..., i]
        else:
            for i, shape_loc in zip(column_locs[key], shape_location):
                # location to insert extracted array
                shape_loc = tuple([Ellipsis] + [j - 1 for j in shape_loc])
                sample[key][shape_loc] = data[..., i]
    return sample

edit. inspiration from arviz/io_cmdstanpy.py at master · arviz-devs/arviz · GitHub

mitzimorris · January 14, 2021, 3:35pm

just filed this feature proposal on CmdStan: utility to check inits · Issue #967 · stan-dev/cmdstan · GitHub

feedback welcome

Raoul-Kima · December 4, 2021, 11:19am

I’ve been trying the provided funtion. My intention is to continue sampling one of my models every night, while doing other stuff on the same computer during the day.
I run into the following problem:
Two days ago I run the model, then saved the output as an .RData file. Today I wanted to continue sampling from it. Problem: It complains that the .csv file is missing. It seems the CmdStanFit object in R doesn’t actually contain all the model results, but some are only in that csv file. Since the fit was from a differen r-session two day earlier that file was deleted (its a temporary file). The help of the the sample method of cmdstanr states that the .csv is deleted when the R-object representing the sampling results is garbage collected (which aparrently includes when the r-session is closed):
Possible Solution: use the parameter “output_dir” when calling the sampler to make the .csv permanent.

Raoul-Kima · February 4, 2022, 8:26am

Just a quick followup: this solution did work, but does of course require manually managing when to delete the file, since these can eat a lot of memory over time.

rok_cesnovar · February 4, 2022, 9:12am

I think this: How does CmdStanR work? • cmdstanr
should have everything you need.

So use

fit$save_object(file = path)

as stated in the vignette, this will ensure that everything has been read into R before saving the object using saveRDS(). So all the $summary() and the likes will work after restoring the RDS file.

Raoul-Kima · February 17, 2022, 9:38am

I’ve been using this function, including in a fairly complicated model.
You wrote you didn’t test whether 2d-arrays work correctly. I found they do not, the elements are not sorted correctly (adding byrow=TRUE in the call to matrix() fixes it, or alternatively removing the line starting with “arrange” probably does too).

I’ve taken the liberty to:

fix the bug
wrap your function into a closure to make it pure,
add support for arrays of any dimensionalty (including arrays of vectors or matrices)
add support for initialising from previous sampling runs, nut just from warmup runs
some minor refactoring.

Here’s the code:

createStanInitsPreviousRun=
    # function to initialize one stan run (via cmdstanr) from the endpoint of another
    # returns a function that returns inits.
    # this function is discussed on stan forums here: https://discourse.mc-stan.org/t/saving-reusing-adaptation-in-cmdstanr/19166/51
    # known bugs:
    #   errors (complaining about missing argument) when used inside a dplyr pipeline (e.g. "fit %>% createStanInitsPreviousRun()") because non standard evaluation somehow breaks the creation of the promise because aparrently the arguments dont get evaulated at the right moment and so the data cant be captured in the environment.
    
    # # useage examples:
    # 
    # # initial pure warmup run
    # lastRunResults=mod$sample(
    #     data=data_for_stan
    #     ,chains=parallel_chains
    #     ,parallel_chains=parallel_chains
    #     # initialisation and adaptation settings
    #     ,iter_warmup=iter_warmup # run automatic initialisation and adaptation
    #     # sampling settings
    #     ,iter_sampling=0
    #     # settings to allow chains to be initialized from this one
    #     ,save_warmup=T # for getting inits from a pure warmup run
    #     ,sig_figs=18 # in crease output bitdepth to match the one used internally by stan. prevents rounding errors when initializing a run from this one (e.g. erroneously initializing at zero causing infinite density)
    #     ,output_dir="stanSamples/" # makes stan save its output to a non-temporary directory (since the r-object returned doen't contain all the results, but instead links to the file). Often needed to be able to continue sampling a model in a separate r-session or after its associated r-object has been removed (to be precise: the standard temporary output file is deleten when the r-object representing the sampling results (and thus linking to this file) has bee ngrbage collected, or the R-session holding it closed.)
    # )
    # 
    # # followup pure sampling runs
    # lastRunResults=mod$sample(
    #     data=data_for_stan
    #     ,chains=parallel_chains
    #     ,parallel_chains=parallel_chains
    #     # initialisation and adaptation settings
    #     ,iter_warmup=0 # use prespecified adaptation and inits
    #     ,adapt_engaged=FALSE # use prespecified adaptation and inits
    #     ,inv_metric=lastRunResults$inv_metric(matrix=F)
    #     ,step_size=lastRunResults$metadata()$step_size_adaptation
    #     ,init=createInitsFunction(lastRunResults)
    #     # sampling settings
    #     ,iter_sampling=iter_sample # difference
    #     # settings to allow chains to be initialized from this one
    #     ,save_warmup=T # needs to be set to TRUE even if no warmup is executed, because otherwise the function that extracts draws from the run complains since it is set to include warmup draws.
    #     ,sig_figs=18 # increase output bitdepth to match the one used internally by stan. prevents rounding errors when initializing a run from this one (e.g. erroneously initializing at zero causing infinite density)
    #     ,output_dir="stanSamples/" # makes stan save its output to a non-temporary directory (since the r-object returned doen't contain all the results, but instead links to the file). Often needed to be able to continue sampling a model in a separate r-session or after its associated r-object has been removed (to be precise: the standard temporary output file is deleten when the r-object representing the sampling results (and thus linking to this file) has bee ngrbage collected, or the R-session holding it closed.)
    # )
function(
    lastRunResults # cmdstanr fit which was fitted with include_warmup=T (this is required even if the number of warmup draws was zero, because of an API limitation)
)
{
    initsFunction = function(chain_id){ # modified from https://discourse.mc-stan.org/t/saving-reusing-adaptation-in-cmdstanr/19166/44
        #   note: this function is not pure on its own. It captures lastRunResults from the enclosing environment, forming a pure closure.
        lastRun_draws = lastRunResults$draws(inc_warmup=T) # note that this row errors when used on a model which wasnt run with save_warmup=TRUE.
        final_warmup_value = lastRun_draws[dim(lastRun_draws)[1],chain_id,2:(dim(lastRun_draws)[3])]
        final_warmup_value %>% 
            tibble::as_tibble(
                .name_repair = function(names){
                    dimnames(final_warmup_value)$variable
                }
            ) %>% 
            tidyr::pivot_longer(cols=dplyr::everything()) %>% 
            tidyr::separate(
                name
                , into = c('variable','index')
                , sep = "\\["
                , fill = 'right'
            ) %>% 
            dplyr::mutate(
                index = 
                    stringr::str_replace(index,']','') %>% 
                    {ifelse(!is.na(.),.,"")}
            ) %>% 
            dplyr::group_split(variable) %>% 
            purrr::map(
                .f = function(x){
                    out= 
                        x %>%
                        dplyr::mutate(
                            index=strsplit(index,",",fixed=TRUE)
                            ,maxIndexes=
                                do.call(rbind,index) %>%
                                {array(as.numeric(.),dim=dim(.))} %>%
                                plyr::aaply(2,max) %>%
                                list()
                        ) %>%
                        {
                            array(
                                .$value
                                ,dim=
                                    if(length(.$maxIndexes[[1]])>0){
                                        .$maxIndexes[[1]]
                                    } else {
                                        1
                                    }
                            )
                        } %>% 
                        list()
                    names(out) = x$variable[1]
                    out
                }
            ) %>% 
            unlist(recursive=F)
    }
    force(lastRunResults) # force evaluation of lastRunResults (i could've just written "lastRunResults=lastRunResults") to change its promise to an actual value that can get captured by the closure (since R has lazy evaluation objects dont get passed by reference or by copy, but by passing a promise to the object, if the object changes before the promise is evaluated then the promise returns the changed object instead of the "promised" one. Thus in that case the closure captures the changed object instead of the one originally passed (by promise) to its environment.)
    initsFunction # return the closure
}

Please note that this implementation throws an error when used as an elemen of a pipe (see comments under “known bugs” in the code.

Edit:
removed some unnecessary parantheses in the code.
clarified one comment in code
removed mention of Mike-Lawrence in the code because I didn’t ask for his permission to include him and the post clearly references his contribution.

jfhawkin · June 7, 2022, 3:16pm

Would this method work to generate inputs to cmdstan? I have a model that runs with cmdstanr, but it was suggested I try running it with cmdstan. I’d like to use my cmdstanr results as a starting point.

I tried using inv_metric() from cmdstanr and converting it to json using toJSON, but cmdstan isn’t recognizing it. I get

Cannot get inverse Euclidean metric from input file.
Caught exception: 
variable does not exist; processing stage=read diag inv metric; variable name=inv_metric; base type=vector_d

using

./my_model random seed=76543 \
           data file=my_data_V3.json \
           output file=output_${i}.csv refresh=100 \
           sample num_warmup=1000 \
           algorithm=hmc engine=nuts max_depth=10 \
           metric=diag_e metric_file=my_metric1.json \
           stepsize=1 \
           adapt delta=0.9
           num_chains=4
	       num_threads=32

mitzimorris · June 7, 2022, 4:56pm

here are examples of JSON files for the metric - this should work in CmdStan, CmdStanR, and CmdStanPy

github.com

stan-dev/cmdstanpy/blob/develop/test/data/metric_dense.data.json

{
    "inv_metric" : [[ 0.926739, 0.0734898, -0.12395],
                    [0.0734898, 0.876038, -0.051543],
                    [-0.12395, -0.051543, 0.8274]] 
}

jfhawkin · June 8, 2022, 1:52am

Thanks! That seems to solve that problem. I still have the underlying model problems to solve but that’s off this topic.

Topic		Replies	Views
Checkpointing with CmdStanPy General	2	550	September 24, 2020
Passing a different mass matrix for each chain in cmdstanr Interfaces	7	706	January 13, 2021
Chkptstanr v0.2.0-alpha: checkpoint brms and cmdstanr sampling Publicity techniques , cmdstanr , brms	6	361	February 29, 2024
Saved mass matrix General	22	2699	December 12, 2022
Which Stan interface supports checkpointing? General cmdstan , rstan , stan , cmdstanr	6	898	May 5, 2021

Saving & reusing adaptation in cmdstanr

Related topics