Draws_pd returns diagnostic data columns by default on cmdstanpy 1.2

akcchoi · November 14, 2023, 8:22am

Hi all,

When calling [CmdStanMCMC].draws_pd(“foo”), I would get a data frame with columns: chain__, tier__, and draw__ in addition to the “foo”

When calling [CmdStanMCMC].draws_pd(). I wold get a data frame with columns: chain__, tier__, and draw__, lp__, accept_stat__, stepsize__, treedepth__, n_leapfrog__, divergent__, energy__

Is this the new behaviour of draws_pd? If I don’t want any of the diagnostic data columns returned, what should I call draws_pd with?

print(cmdstanpy.show_versions())

INSTALLED VERSIONS
---------------------
python: 3.9.15 (main, Nov 24 2022, 14:31:59) 
[GCC 11.2.0]
python-bits: 64
OS: Linux
OS-release: 4.19.91-012.ali4000.alios7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
cmdstan_folder: /root/.cmdstan/cmdstan-2.33.1
cmdstan: (2, 33)
cmdstanpy: 1.2.0
pandas: 2.1.2
xarray: None
tqdm: 4.65.0
numpy: 1.24.2

Thanks,

Andy

WardBrian · November 16, 2023, 2:43pm

We added those columns to allow re-constructing of the chains if required (e.g., if you wanted to groupby them). Whether or not vars should work to request them not be included is something I didn’t consider during implementation.

I’d like the opinions of @mitzimorris and @roualdes, who requested the feature originally.

roualdes · November 16, 2023, 5:42pm

Recently, for inclusion in CmdStanPy 1.2, I requested draws_pd() include the columns chain__, iter__, and draw__, see cmdstanpy issue #676. It’s my understanding that CmdStanPy versions prior to 1.2 also included the columns lp__, …, and energy__.

My thinking behind the inclusion of all columns, including diagnostic and chain related information, is that more information by default is better. This is especially true, in my opinion, since from a user’s perspective it’s harder to get the extra information contained in these columns than it is for a user to remove those columns, e.g.

df = [CmdStanMCMC].draws_pd()
df = df.drop(columns = df.filter(regex = ".*__").columns) # or add inplace = True

see pandas.DataFrame.drop() doc.

A more specific reason I filed issue #676 is that CmdStanR by default provides columns chain__, iter__, and draw__ into a draws dataframe, e.g. [CmdStanMCMC]$draws(format = "df"). So I saw this as a step to better align CmdStan* interfaces.

If desired, @akcchoi, please open an issue on the CmdStanPy GitHub repository to discuss adding a flag, or some such option, to exclude the diagnostic and/or chain information.

akcchoi · November 17, 2023, 5:52am

Thanks. Will open an issue.

Topic		Replies	Views
Cmdstanpy and pandas dataframe Developers python	13	1446	March 27, 2021
How to get draws from cmdstanpy.stanfit.CmdStanMCMC? CmdStan cmdstanpy	5	1016	April 18, 2021
CmdStanPy release 0.9.77 (penultimate beta) Interfaces	0	363	August 18, 2021
Extracting draws from cmdstanr: array vs. df General	2	575	September 4, 2022
Cmdstanpy ValueError with multiprocess Interfaces fitting-issues	4	1170	December 3, 2021

Draws_pd returns diagnostic data columns by default on cmdstanpy 1.2

Related topics