Is there a different, less memory-intensive way to plot a posterior predictive check?

I was finally able to generate quantities in order to do a posterior predictive check. When I run the below code, I get an memory error.

az.from_cmdstanpy(
    posterior=model_ppc, 
    posterior_predictive=["y_rep"], 
    dtypes={"y_rep": int}
)

The error is

Traceback (most recent call last):

  File "C:\Users\JORDAN~1.GIT\AppData\Local\Temp/ipykernel_9312/1949767068.py", line 1, in <module>
    idata = az.from_cmdstanpy(

  File "C:\Users\JORDAN.HOWELL.GITDIR\AppData\Local\Continuum\anaconda3\envs\cmdstan\lib\site-packages\arviz\data\io_cmdstanpy.py", line 813, in from_cmdstanpy
    return CmdStanPyConverter(

  File "C:\Users\JORDAN.HOWELL.GITDIR\AppData\Local\Continuum\anaconda3\envs\cmdstan\lib\site-packages\arviz\data\io_cmdstanpy.py", line 441, in to_inference_data
    "posterior": self.posterior_to_xarray(),

  File "C:\Users\JORDAN.HOWELL.GITDIR\AppData\Local\Continuum\anaconda3\envs\cmdstan\lib\site-packages\arviz\data\base.py", line 65, in wrapped
    return func(cls)

  File "C:\Users\JORDAN.HOWELL.GITDIR\AppData\Local\Continuum\anaconda3\envs\cmdstan\lib\site-packages\arviz\data\io_cmdstanpy.py", line 95, in posterior_to_xarray
    return self.posterior_to_xarray_pre_v_0_9_68()

  File "C:\Users\JORDAN.HOWELL.GITDIR\AppData\Local\Continuum\anaconda3\envs\cmdstan\lib\site-packages\arviz\data\base.py", line 65, in wrapped
    return func(cls)

  File "C:\Users\JORDAN.HOWELL.GITDIR\AppData\Local\Continuum\anaconda3\envs\cmdstan\lib\site-packages\arviz\data\io_cmdstanpy.py", line 508, in posterior_to_xarray_pre_v_0_9_68
    data, data_warmup = _unpack_frame(

  File "C:\Users\JORDAN.HOWELL.GITDIR\AppData\Local\Continuum\anaconda3\envs\cmdstan\lib\site-packages\arviz\data\io_cmdstanpy.py", line 673, in _unpack_frame
    data = fit.draws(inc_warmup=save_warmup)

  File "c:\users\jordan.howell.gitdir\src\cmdstanpy\cmdstanpy\stanfit.py", line 1491, in draws
    self._assemble_generated_quantities()

  File "c:\users\jordan.howell.gitdir\src\cmdstanpy\cmdstanpy\stanfit.py", line 1847, in _assemble_generated_quantities
    gq_sample = np.empty(

MemoryError: Unable to allocate 332. GiB for an array with shape (1000, 4, 11153250) and data type float64

Is there a way to do this plot with seaborn maybe?

re: MemoryError: Unable to allocate 332. GiB for an array with shape (1000, 4, 11153250) and data type float64, it appears that cmdstanpy's _assemble_generated_quantities is attempting to allocate a very large array.

From a quick look at the cmdstanpy source code it looks like the shape of the array is (num_draws, self.chains, len(self.column_names)). The sizes of the two former dimensions seem reasonable: num_draws = 1000, chains = 4. The third dimension seems very large len(self.column_names) = 11153250. The cmdstanpy code documents the column_names as

Names of information items returned by sampler for each draw. Includes approximation information and names of model parameters and computed quantities.

Do you expect your model to have millions of parameters or generated quantities per draw? Is there a way to rework the model to reduce this number of by a factor of 100 or more? (so it might fit in a modest 3.4 GiB of ram)

I am not familiar with ArviZ, another possibility may be that for some reason ArviZ has decided you need 11 million columns per draw on your behalf.

Another possibility, that would be more challenging to do, is that perhaps that there is no need to have all the data in memory at the same time in order to compute the plot. E.g. arguably ArviZ and cmdstanpy could stream the data in chunks off disk and summarise the information needed to output the plot, without trying to load it all into memory at once. If that is possible, it might require a lot of re-thinking and re-working the code across both cmdstanpy and ArviZ packages.

2 Likes

Here’s a few ideas for how ArviZ and cmdstanpy could make it easier to process larger datasets:

  1. on the ArviZ side in from_cmdstanpy and from_cmdstan there’s a pattern of work that appears to go: (i) figure out which subset of columns we want, (ii) load all the columns into RAM (either by calling cmdstanpy draws() method or reading the entire CSV), then (iii) pluck out the subset of columns we are interested in. This is wasteful, in extreme cases suppose ArviZ only wants 10 out of 1000 columns it may end up using 100x more memory than necessary to load irrelevant data into RAM.
  2. on the cmdstanpy side there is draws_xr method that accepts an optional vars argument that can contain the list of desired column names. So in principle, if the user has xarray installed, ArviZ could call draws_xr to only extract the subset of relevant cols. But it also appears like under the hood that cmdstanpy implements draws_xr by first reading all the columns into RAM and then post-processing that to pluck out the requested cols, so some work might need to happen on cmdstanpy side to make it more efficient. That could make cmdstanpy more complicated, as then as well as “draws not loaded” or “draws loaded and cached” there would be many states of “subset of columns of draws loaded and cached”.
  3. there was a feature request for cmdstanpy Add a more memory efficient access to chain's samples · Issue #218 · stan-dev/cmdstanpy · GitHub , the status quo solution for memory efficient samples is for the user to parse the cmdstan output CSV file without using the cmdstanpy draws() method. ArviZ is indeed parsing the cmdstan CSV files directly in its from_cmdstan interface, but it implements the very simple approach that loads everything in the CSV file into RAM.
  4. perhaps cmdstanpy could expose an iterdraws style method that instead of returning a potentially huge numpy array or xarray, would instead iterate over individual rows, streaming them off disk while reading through the CSV file? The code for this already appears to exist internally inside cmdpystan's _assemble_draws – it iterates over the rows of the CSV one by one and puts them into a huge array. If this function was refactored to tease apart the “produce rows” and “consume rows” parts to expose the row iterator as part of the user-facing cmdstanpy API, a user could consume the rows (perhaps immediately summarising them to compute means of each column) without needing any large arrays to be allocated.