I’m asking this here instead of using the issue tracker since I’m not entirely sure whether this is a bug or a case of PEBCAK.
I am fitting a model using cmdstanpy, using
sample_args = {'iter': 100, 'warmup': 10, 'chains': 1, 'seed': 42, 'adapt_delta': 0.8, 'max_treedepth': 16, 'save_warmup': True}
fit_model = sample_cmdstanpy(my_model, input_data, sample_args)
(please disregard the low numbers of draws and chains, this is just to illustrate the problem, which occurs also for more realistic values of these parameter.)
After the fitting/sampling is completed, I attempt to extract a variable (created in the generated quantities block of the model) from the resulting stanfit object (to be precise, the object type is: cmdstanpy.stanfit.CmdStanMCMC
). This is done using
my_variable = fit_model.stan_variable('quantity_of_interest')
This causes the following exception
File ".../venv/lib/python3.8/site-packages/cmdstanpy/stanfit.py", line 762, in stan_variable
self._draws[
ValueError: cannot reshape array of size 561400 into shape (110,5614)
I expect an array of size 5614 for each draw, and since I am asking for 100 post-warmup draws + 10 warmup draws, and exactly 10x5610 entries are missing somewhere along the way, this suggests to me that the warmup draws are somehow lost.
I should also say that the code in question worked (i.e., supplied all post-warmup draws) when save_warmup
was set to False
.
Digging deeper, I put a break-point in stan_variable
, the code of which is:
def stan_variable(self, name: str) -> pd.DataFrame:
"""
Return a new DataFrame which contains the set of post-warmup draws
for the named Stan program variable. Flattens the chains.
Underlyingly draws are in chain order, i.e., for a sample
consisting of N chains of M draws each, the first M array
elements are from chain 1, the next M are from chain 2,
and the last M elements are from chain N.
* If the variable is a scalar variable, the shape of the DataFrame is
( draws X chains, 1).
* If the variable is a vector, the shape of the DataFrame is
( draws X chains, len(vector))
* If the variable is a matrix, the shape of the DataFrame is
( draws X chains, size(dim 1) X size(dim 2) )
* If the variable is an array with N dimensions, the shape of the
DataFrame is ( draws X chains, size(dim 1) X ... X size(dim N))
:param name: variable name
"""
if name not in self._stan_variable_dims:
raise ValueError('unknown name: {}'.format(name))
self._assemble_draws()
dim0 = self.num_draws * self.runset.chains
dims = np.prod(self._stan_variable_dims[name])
pattern = r'^{}(\[[\d,]+\])?$'.format(name)
names, idxs = [], []
for i, column_name in enumerate(self.column_names):
if re.search(pattern, column_name):
names.append(column_name)
idxs.append(i)
return pd.DataFrame(
self._draws[
self._draws_warmup:, :, idxs
].reshape((dim0, dims), order='A'),
columns=names
)
Checking the values of the variables here, dim0=110
, and dims=5614
. So it looks like in the final return statement (where the error is occurring), the function is taking the entries of self._draws
starting at self._draws_warmup=10
, (so, the 100 post-warmup draws), and trying to reshape those into size (num_draws_including_warmup, num_values_generated_per_draw).
Even stan_variable
's docstring indicates that it returns the post-warmup draws… so it seems to be performing as advertised, but not playing nicely with the save_warmup option.
Is it possible that the line
dim0 = self.num_draws * self.runset.chains
should be
dim0 = self._draws_sampling * self.runset.chains
instead?
Or, am I missing something obvious and am just doing something wrong?
Thanks a bunch!
Chai
P.S.
I could, if needed, attempt to generate minimal code to recreate the issue, but unless I’m misreading the python code that generates the exception, I think the problem should be clear even without that.
- Operating System: Ubuntu 20.4.1
- CmdStan Version: 2.25
*CmdStanPy Version: 0.9.67 - Compiler/Toolkit: g++ (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0