Draws_pd returns diagnostic data columns by default on cmdstanpy 1.2

Hi all,

When calling [CmdStanMCMC].draws_pd(“foo”), I would get a data frame with columns: chain__, tier__, and draw__ in addition to the “foo”

When calling [CmdStanMCMC].draws_pd(). I wold get a data frame with columns: chain__, tier__, and draw__, lp__, accept_stat__, stepsize__, treedepth__, n_leapfrog__, divergent__, energy__

Is this the new behaviour of draws_pd? If I don’t want any of the diagnostic data columns returned, what should I call draws_pd with?

print(cmdstanpy.show_versions())

INSTALLED VERSIONS
---------------------
python: 3.9.15 (main, Nov 24 2022, 14:31:59) 
[GCC 11.2.0]
python-bits: 64
OS: Linux
OS-release: 4.19.91-012.ali4000.alios7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
cmdstan_folder: /root/.cmdstan/cmdstan-2.33.1
cmdstan: (2, 33)
cmdstanpy: 1.2.0
pandas: 2.1.2
xarray: None
tqdm: 4.65.0
numpy: 1.24.2

Thanks,

Andy

We added those columns to allow re-constructing of the chains if required (e.g., if you wanted to groupby them). Whether or not vars should work to request them not be included is something I didn’t consider during implementation.

I’d like the opinions of @mitzimorris and @roualdes, who requested the feature originally.

Recently, for inclusion in CmdStanPy 1.2, I requested draws_pd() include the columns chain__, iter__, and draw__, see cmdstanpy issue #676. It’s my understanding that CmdStanPy versions prior to 1.2 also included the columns lp__, …, and energy__.

My thinking behind the inclusion of all columns, including diagnostic and chain related information, is that more information by default is better. This is especially true, in my opinion, since from a user’s perspective it’s harder to get the extra information contained in these columns than it is for a user to remove those columns, e.g.

df = [CmdStanMCMC].draws_pd()
df = df.drop(columns = df.filter(regex = ".*__").columns) # or add inplace = True

see pandas.DataFrame.drop() doc.

A more specific reason I filed issue #676 is that CmdStanR by default provides columns chain__, iter__, and draw__ into a draws dataframe, e.g. [CmdStanMCMC]$draws(format = "df"). So I saw this as a step to better align CmdStan* interfaces.

If desired, @akcchoi, please open an issue on the CmdStanPy GitHub repository to discuss adding a flag, or some such option, to exclude the diagnostic and/or chain information.

Thanks. Will open an issue.