Cmdstanpy and pandas dataframe

jbaranowski · March 23, 2021, 12:06pm

I’ve seen that draws_pd() method is being removed. Are there any reasons for that? Will there be a replacement? I’m currently using cmdstanpy for teaching and I’d like to know will I need to have a workaround prepared.

Funko_Unko · March 23, 2021, 2:17pm

I do not know the reason, but you can recover the dataframe in just a few lines of code. Will post code later.

Something like this should give you the full dataframe:


def draws_df(self):
    return pd.concat([
        pd.read_csv(csv, comment='#') for csv in self.runset.csv_files
    ])

where self should be e.g. a result of the sample method.

Some of the logic is of course missing, like being able to exclude warmup samples or columns. Some other logic can easily be added, like the chain id as an extra column.

ahartikainen · March 24, 2021, 9:57am

There are couple of ways to get the dataframe. I recommend going with arviz solution

import arviz as az

# 1-based indexing
with az.rc_context({"data.index_origin": 1}):
    idata = az.from_cmdstanpy(fit)
    
df = idata.to_dataframe()
# by default, group name is added to column name
df.columns = [item if isinstance(item, str) else item[1] for item in df.columns]

Then there is possibility to use fit.draws()

import pandas as pd
df =  pd.DataFrame(fit.draws(concat_chains=True), columns=fit.column_names)
df["draw"] = np.tile(np.arange(1, 1+fit.num_draws_sampling), fit.chains)
df["chain"] = np.repeat(np.arange(1, 1+fit.chains), fit.num_draws_sampling)

mitzimorris · March 24, 2021, 3:30pm

We can keep draws_pd if that’s what’s easy for students to work with.

A pandas dataframe is not ideal if your Stan program has a container variable, i.e., i.e., an array, vector, matrix, array of vectors, etc. in exactly the same way that the CSV output is not ideal: it flattens the structure, and this is only problematic if you’ve got a 2D structure.

The methods stan_variable and stan_variables retain the proper structure by keeping things as a numpy.ndarray.

Since plotting and diagnostics are all done at the individual container element value, I can totally see why draws_pd is useful.

How about if I just file an issue to get rid of the will be removed msg?

jbaranowski · March 24, 2021, 3:33pm

Thanks I would be grateful, as it makes my life easier :)
Pandas data frames are helpful with making multiple plots etc.
Also this is a format students are already familiar with.

mitzimorris · March 24, 2021, 4:01pm

OriolAbril · March 24, 2021, 4:52pm

I don’t know how much customization is needed for the plots, but I’d recommend checking ArviZ plotting capabilities. ArviZ uses an xarray based data structure which is what we though was best suited for the task at hand, xarray is basically an n dimensional version of pandas. We are working hard on making cmdstanpy->arviz conversion straighforward, and after conversion you’ll be able to use an html repr if working from jupyter-like environments (you’ll see an example in the conversion docs I linked).

jbaranowski · March 24, 2021, 5:52pm

I like Arviz, but I really miss ability of using histograms instead of KDEs.
Either way I plan to use it further down the course, but as I mentioned pandas is more familiar initially.

OriolAbril · March 24, 2021, 6:07pm

You should be able to still use histograms in all (or most only, not completely sure about that).

In plot_posterior for example you can use kind="hist" to use histogram (the default is auto meaning kde for continuous, hist for discrete), same for plot_dist. It won’t with plot_trace where kind chooses trace or rank on the right column with this exact format, but I think it uses plot_dist under the hood so there may be some kwargs that enable this (kwargs are not very well documented for now though so we’d have to look at the source code).

Do you think it would be useful to have an rcParam to set the default kind to hist?

jbaranowski · March 25, 2021, 9:19am

Well probably yes, histograms are less reliant on things happening under the hood, as KDEs still need to compute bandwidth somehow. From histogram you can see everything directly from the image. It makes sharing results easier.

Funko_Unko · March 26, 2021, 11:33am

Have there been bad experiences with pandas MultiIndex?

Edit: Just realized, this doesn’t really help.

OriolAbril · March 27, 2021, 4:34pm

@jbaranowski we have added an rcParam to have density plots of continuous variables default to histograms instead of KDE.

Already on development version and starting with next release, you can add plot.density_kind : hist to your arvizrc file or set rcParams["plot.density_kind"] = "hist" at the beginning of your scripts/notebooks to use this. Note however that not all functions will allow to customize the number of bins, in some cases this will be hardcoded to "auto" from numpy.

As the data is inherently multidimensional, representing it as a 2d table needs to rely on some conventions between multiindexes and column names, these are not necessarily clear nor unique. In fact, the same draws_pd does not use multiindex and uses a wide format, where columns representing different variables are no different than columns representing different dimensions of the same variable. Using the actual shape of the data directly does not. Also note that any of those dimensions in the n-D labeled arrays can be indexed using a pandas multiindex. xarray is still a quite new library with its own limitations, but we have found this approach to have many advantages when it comes to using and computing with stan results. See for instance this example (the section where apply_ufunc is used) or this other example about labels and sorting.

I have no intention of entering a debate on data structures, but I figured sharing some of our rationale could be useful. This is not a “stop using dataframes” statement, only a 2 sentence explanation on why we have found more convenient to go with xarray as the data structure to store samples. And this samples is important. az.summary generates a transposed version of this same wide dataframe format because in that case we find this representation of the data more convenient.

Funko_Unko · March 27, 2021, 5:48pm

Thank you for the explanation. What I actually meant was nothing more than “Ah, just tried pandas Multi Index for samples and it kind of gets messier, not better”.

I’ll have a look at how arviz/xarray handles the problem, thanks!

jbaranowski · March 27, 2021, 7:40pm

@OriolAbril Saying that this community is amazing is an understatement. Thank you very much for addressing the matter.

Regarding pandas, I don’t think that it is better than xarray, but it is

Familiar to my students.
Generates pretty markdown tables in jupyter

But I see that I must start moving more and more to Arviz for further course work.

Topic		Replies	Views
Draws_pd returns diagnostic data columns by default on cmdstanpy 1.2 Other cmdstanpy	3	337	November 17, 2023
How to get draws from cmdstanpy.stanfit.CmdStanMCMC? CmdStan cmdstanpy	5	1008	April 18, 2021
Extracting a variable from a stanfit object fails when `save_warmup=True` CmdStan cmdstanpy	2	532	November 30, 2020
CmdStanPy - ready for beta testing! Developers pystan	23	2171	August 6, 2019
CmdStanPy release 0.9.77 (penultimate beta) Interfaces	0	363	August 18, 2021

Cmdstanpy and pandas dataframe

Related topics