Ooof, this is very compelling. Definitely want to figure out what is happening here before the 2.26 release.
Any insights so far? Iām happy to help explore but would need a little guidance on what to try.
Ooo, thanks for the reminder. Iāve got a window while some stuff builds. Will go check it out now.
So @bbbales2 solved this by discovering that (1) I wasnāt composing the inits properly and (2) cmdstan was silently ignoring the inits because they werenāt in the right format and instead using random inits.
I think that the following (super kludgey, possibly-over-tidyversed) code should do the init composition better, handling single value parameters, vector parameters, and matrices. It should handle single and 2 dimensional arrays too, but I havenāt tried those. N-dimensional arrays are probably a fairly straightforward generalization, but Iāll leave that for another time.
get_inits = function(chain_id){
warmup_draws = warmup$draws(inc_warmup=T)
final_warmup_value = warmup_draws[iter_warmup,chain_id,2:(dim(warmup_draws)[3])]
(
final_warmup_value
%>% tibble::as_tibble(
.name_repair = function(names){
dimnames(final_warmup_value)$variable
}
)
%>% tidyr::pivot_longer(cols=dplyr::everything())
%>% tidyr::separate(
name
, into = c('variable','index')
, sep = "\\["
, fill = 'right'
)
%>% dplyr::group_split(variable)
%>% purrr::map(
.f = function(x){
(
x
%>% dplyr::mutate(
index = stringr::str_replace(index,']','')
)
%>% tidyr::separate(
index
, into = c('first','second')
, sep = ","
, fill = 'right'
, convert = T
)
%>% dplyr::arrange(first,second)
%>% (function(x){
out = list()
if(all(is.na(x$second))){
out[[1]] = x$value
}else{
out[[1]] = matrix(
x$value
, nrow = max(x$first)
, ncol = max(x$second)
)
}
names(out) = x$variable[1]
return(out)
})
)
}
)
%>% unlist(recursive=F)
%>% return()
)
}
#example usage:
warmup = mod$sample(
data = data_for_stan
, chains = parallel_chains
, parallel_chains = parallel_chains
, seed = seed
, iter_warmup = iter_warmup
, save_warmup = T #for inits
, sig_figs = 18
, iter_sampling = 0
)
samples = mod$sample(
data = data_for_stan
, chains = parallel_chains
, parallel_chains = parallel_chains
, seed = seed+1
, iter_warmup = 0
, adapt_engaged = FALSE
, inv_metric = warmup$inv_metric(matrix=F)
, step_size = warmup$metadata()$step_size_adaptation
, iter_sampling = iter_sample
, init = get_inits
)
the reason this happens is the intersection of the following:
-
underlyingly, both data inits and parameter inits are
stan::io::varcontext
objects. -
inits files can contain variables that arenāt in the model.
we need this because a part of the workflow is model expansion and model comparison, and itās convenient to let all versions of a model run off the same data and inits files. -
unlike the data init file, which must contain definitions for all variables in the data block, the params init file doesnāt have to, therefore no complaints from CmdStan about missing parameter inits, similarly, no complaints about parameter inits for variable names that donāt match variables in the model.
further thoughts: missing functionality - the interfaces donāt have an operation to instantiate the model given the data and then initialize the parameters given the constraints and any supplied inits.
such an operation would detect missing data from the inits file. it could also report on the initial parameter values. thus one could easily validate the inits. but AFAIK, in C++ the resulting instantiated object canāt be saved such that you could just start adaptation and sampling from that point, so we donāt do this.
Here is extract
function for CmdStanPy. This returns output as dictionary containing ndim numpy arrays. We could also add option to remove diagnostics.
from cmdstanpy import CmdStanMCMC
from collections import defaultdict
import numpy as np
from typing import Optional
def extract(fit: CmdStanMCMC, parameters: Optional[list]=None, inc_warmup: bool=False):
"""Extract ndim dictionary from CmdStanPy fit.
Parameters
----------
fit: CmdStanMCMC
parameters: list
inc_warmup: bool
Returns
-------
dict
"""
if parameters is None:
parameters = fit.column_names
if inc_warmup and not fit._save_warmup:
inc_warmup = False
data = fit.draws(inc_warmup=inc_warmup)
draws, chains, *_ = data.shape
column_groups = defaultdict(list)
column_locs = defaultdict(list)
# iterate flat column names
for i, col in enumerate(fit.column_names):
col_base, *col_tail = col.replace("]", "").replace("[", ",").split(",")
if len(col_tail):
# gather nD array locations
column_groups[col_base].append(tuple(map(int, col_tail)))
# gather raw data locations for each parameter
column_locs[col_base].append(i)
dims = {}
for colname, col_dims in column_groups.items():
# gather parameter dimensions (assumes dense arrays)
dims[colname] = tuple(np.array(col_dims).max(0))
sample = {}
valid_base_cols = []
# get list of parameters for extraction (basename) X.1.2 --> X
for col in parameters:
base_col = col.split("[")[0].split(".")[0]
if base_col not in valid_base_cols:
valid_base_cols.append(base_col)
for key in valid_base_cols:
ndim = dims.get(key, None)
shape_location = column_groups.get(key, None)
if ndim is not None:
sample[key] = np.full((draws, chains, *ndim), np.nan)
if shape_location is None:
(i,) = column_locs[key]
sample[key] = data[..., i]
else:
for i, shape_loc in zip(column_locs[key], shape_location):
# location to insert extracted array
shape_loc = tuple([Ellipsis] + [j - 1 for j in shape_loc])
sample[key][shape_loc] = data[..., i]
return sample
edit. inspiration from arviz/io_cmdstanpy.py at master Ā· arviz-devs/arviz Ā· GitHub
just filed this feature proposal on CmdStan: utility to check inits Ā· Issue #967 Ā· stan-dev/cmdstan Ā· GitHub
feedback welcome
Iāve been trying the provided funtion. My intention is to continue sampling one of my models every night, while doing other stuff on the same computer during the day.
I run into the following problem:
Two days ago I run the model, then saved the output as an .RData file. Today I wanted to continue sampling from it. Problem: It complains that the .csv file is missing. It seems the CmdStanFit object in R doesnāt actually contain all the model results, but some are only in that csv file. Since the fit was from a differen r-session two day earlier that file was deleted (its a temporary file). The help of the the sample method of cmdstanr states that the .csv is deleted when the R-object representing the sampling results is garbage collected (which aparrently includes when the r-session is closed):
Possible Solution: use the parameter āoutput_dirā when calling the sampler to make the .csv permanent.
Just a quick followup: this solution did work, but does of course require manually managing when to delete the file, since these can eat a lot of memory over time.
I think this: How does CmdStanR work? ⢠cmdstanr
should have everything you need.
So use
fit$save_object(file = path)
as stated in the vignette, this will ensure that everything has been read into R before saving the object using saveRDS()
. So all the $summary() and the likes will work after restoring the RDS file.
Iāve been using this function, including in a fairly complicated model.
You wrote you didnāt test whether 2d-arrays work correctly. I found they do not, the elements are not sorted correctly (adding byrow=TRUE in the call to matrix() fixes it, or alternatively removing the line starting with āarrangeā probably does too).
Iāve taken the liberty to:
- fix the bug
- wrap your function into a closure to make it pure,
- add support for arrays of any dimensionalty (including arrays of vectors or matrices)
- add support for initialising from previous sampling runs, nut just from warmup runs
- some minor refactoring.
Hereās the code:
createStanInitsPreviousRun=
# function to initialize one stan run (via cmdstanr) from the endpoint of another
# returns a function that returns inits.
# this function is discussed on stan forums here: https://discourse.mc-stan.org/t/saving-reusing-adaptation-in-cmdstanr/19166/51
# known bugs:
# errors (complaining about missing argument) when used inside a dplyr pipeline (e.g. "fit %>% createStanInitsPreviousRun()") because non standard evaluation somehow breaks the creation of the promise because aparrently the arguments dont get evaulated at the right moment and so the data cant be captured in the environment.
# # useage examples:
#
# # initial pure warmup run
# lastRunResults=mod$sample(
# data=data_for_stan
# ,chains=parallel_chains
# ,parallel_chains=parallel_chains
# # initialisation and adaptation settings
# ,iter_warmup=iter_warmup # run automatic initialisation and adaptation
# # sampling settings
# ,iter_sampling=0
# # settings to allow chains to be initialized from this one
# ,save_warmup=T # for getting inits from a pure warmup run
# ,sig_figs=18 # in crease output bitdepth to match the one used internally by stan. prevents rounding errors when initializing a run from this one (e.g. erroneously initializing at zero causing infinite density)
# ,output_dir="stanSamples/" # makes stan save its output to a non-temporary directory (since the r-object returned doen't contain all the results, but instead links to the file). Often needed to be able to continue sampling a model in a separate r-session or after its associated r-object has been removed (to be precise: the standard temporary output file is deleten when the r-object representing the sampling results (and thus linking to this file) has bee ngrbage collected, or the R-session holding it closed.)
# )
#
# # followup pure sampling runs
# lastRunResults=mod$sample(
# data=data_for_stan
# ,chains=parallel_chains
# ,parallel_chains=parallel_chains
# # initialisation and adaptation settings
# ,iter_warmup=0 # use prespecified adaptation and inits
# ,adapt_engaged=FALSE # use prespecified adaptation and inits
# ,inv_metric=lastRunResults$inv_metric(matrix=F)
# ,step_size=lastRunResults$metadata()$step_size_adaptation
# ,init=createInitsFunction(lastRunResults)
# # sampling settings
# ,iter_sampling=iter_sample # difference
# # settings to allow chains to be initialized from this one
# ,save_warmup=T # needs to be set to TRUE even if no warmup is executed, because otherwise the function that extracts draws from the run complains since it is set to include warmup draws.
# ,sig_figs=18 # increase output bitdepth to match the one used internally by stan. prevents rounding errors when initializing a run from this one (e.g. erroneously initializing at zero causing infinite density)
# ,output_dir="stanSamples/" # makes stan save its output to a non-temporary directory (since the r-object returned doen't contain all the results, but instead links to the file). Often needed to be able to continue sampling a model in a separate r-session or after its associated r-object has been removed (to be precise: the standard temporary output file is deleten when the r-object representing the sampling results (and thus linking to this file) has bee ngrbage collected, or the R-session holding it closed.)
# )
function(
lastRunResults # cmdstanr fit which was fitted with include_warmup=T (this is required even if the number of warmup draws was zero, because of an API limitation)
)
{
initsFunction = function(chain_id){ # modified from https://discourse.mc-stan.org/t/saving-reusing-adaptation-in-cmdstanr/19166/44
# note: this function is not pure on its own. It captures lastRunResults from the enclosing environment, forming a pure closure.
lastRun_draws = lastRunResults$draws(inc_warmup=T) # note that this row errors when used on a model which wasnt run with save_warmup=TRUE.
final_warmup_value = lastRun_draws[dim(lastRun_draws)[1],chain_id,2:(dim(lastRun_draws)[3])]
final_warmup_value %>%
tibble::as_tibble(
.name_repair = function(names){
dimnames(final_warmup_value)$variable
}
) %>%
tidyr::pivot_longer(cols=dplyr::everything()) %>%
tidyr::separate(
name
, into = c('variable','index')
, sep = "\\["
, fill = 'right'
) %>%
dplyr::mutate(
index =
stringr::str_replace(index,']','') %>%
{ifelse(!is.na(.),.,"")}
) %>%
dplyr::group_split(variable) %>%
purrr::map(
.f = function(x){
out=
x %>%
dplyr::mutate(
index=strsplit(index,",",fixed=TRUE)
,maxIndexes=
do.call(rbind,index) %>%
{array(as.numeric(.),dim=dim(.))} %>%
plyr::aaply(2,max) %>%
list()
) %>%
{
array(
.$value
,dim=
if(length(.$maxIndexes[[1]])>0){
.$maxIndexes[[1]]
} else {
1
}
)
} %>%
list()
names(out) = x$variable[1]
out
}
) %>%
unlist(recursive=F)
}
force(lastRunResults) # force evaluation of lastRunResults (i could've just written "lastRunResults=lastRunResults") to change its promise to an actual value that can get captured by the closure (since R has lazy evaluation objects dont get passed by reference or by copy, but by passing a promise to the object, if the object changes before the promise is evaluated then the promise returns the changed object instead of the "promised" one. Thus in that case the closure captures the changed object instead of the one originally passed (by promise) to its environment.)
initsFunction # return the closure
}
Please note that this implementation throws an error when used as an elemen of a pipe (see comments under āknown bugsā in the code.
Edit:
removed some unnecessary parantheses in the code.
clarified one comment in code
removed mention of Mike-Lawrence in the code because I didnāt ask for his permission to include him and the post clearly references his contribution.
Would this method work to generate inputs to cmdstan? I have a model that runs with cmdstanr, but it was suggested I try running it with cmdstan. Iād like to use my cmdstanr results as a starting point.
I tried using inv_metric()
from cmdstanr and converting it to json using toJSON
, but cmdstan isnāt recognizing it. I get
Cannot get inverse Euclidean metric from input file.
Caught exception:
variable does not exist; processing stage=read diag inv metric; variable name=inv_metric; base type=vector_d
using
./my_model random seed=76543 \
data file=my_data_V3.json \
output file=output_${i}.csv refresh=100 \
sample num_warmup=1000 \
algorithm=hmc engine=nuts max_depth=10 \
metric=diag_e metric_file=my_metric1.json \
stepsize=1 \
adapt delta=0.9
num_chains=4
num_threads=32
here are examples of JSON files for the metric - this should work in CmdStan, CmdStanR, and CmdStanPy
Thanks! That seems to solve that problem. I still have the underlying model problems to solve but thatās off this topic.