Cmdstanpy and pandas dataframe

I’ve seen that draws_pd() method is being removed. Are there any reasons for that? Will there be a replacement? I’m currently using cmdstanpy for teaching and I’d like to know will I need to have a workaround prepared.

1 Like

I do not know the reason, but you can recover the dataframe in just a few lines of code. Will post code later.

Something like this should give you the full dataframe:

def draws_df(self):
    return pd.concat([
        pd.read_csv(csv, comment='#') for csv in self.runset.csv_files

where self should be e.g. a result of the sample method.

Some of the logic is of course missing, like being able to exclude warmup samples or columns. Some other logic can easily be added, like the chain id as an extra column.

1 Like

There are couple of ways to get the dataframe. I recommend going with arviz solution

import arviz as az

# 1-based indexing
with az.rc_context({"data.index_origin": 1}):
    idata = az.from_cmdstanpy(fit)
df = idata.to_dataframe()
# by default, group name is added to column name
df.columns = [item if isinstance(item, str) else item[1] for item in df.columns]

Then there is possibility to use fit.draws()

import pandas as pd
df =  pd.DataFrame(fit.draws(concat_chains=True), columns=fit.column_names)
df["draw"] = np.tile(np.arange(1, 1+fit.num_draws_sampling), fit.chains)
df["chain"] = np.repeat(np.arange(1, 1+fit.chains), fit.num_draws_sampling)

We can keep draws_pd if that’s what’s easy for students to work with.

A pandas dataframe is not ideal if your Stan program has a container variable, i.e., i.e., an array, vector, matrix, array of vectors, etc. in exactly the same way that the CSV output is not ideal: it flattens the structure, and this is only problematic if you’ve got a 2D structure.

The methods stan_variable and stan_variables retain the proper structure by keeping things as a numpy.ndarray.

Since plotting and diagnostics are all done at the individual container element value, I can totally see why draws_pd is useful.

How about if I just file an issue to get rid of the will be removed msg?

Thanks I would be grateful, as it makes my life easier :)
Pandas data frames are helpful with making multiple plots etc.
Also this is a format students are already familiar with.


I don’t know how much customization is needed for the plots, but I’d recommend checking ArviZ plotting capabilities. ArviZ uses an xarray based data structure which is what we though was best suited for the task at hand, xarray is basically an n dimensional version of pandas. We are working hard on making cmdstanpy->arviz conversion straighforward, and after conversion you’ll be able to use an html repr if working from jupyter-like environments (you’ll see an example in the conversion docs I linked).

1 Like

I like Arviz, but I really miss ability of using histograms instead of KDEs.
Either way I plan to use it further down the course, but as I mentioned pandas is more familiar initially.

1 Like

You should be able to still use histograms in all (or most only, not completely sure about that).

In plot_posterior for example you can use kind="hist" to use histogram (the default is auto meaning kde for continuous, hist for discrete), same for plot_dist. It won’t with plot_trace where kind chooses trace or rank on the right column with this exact format, but I think it uses plot_dist under the hood so there may be some kwargs that enable this (kwargs are not very well documented for now though so we’d have to look at the source code).

Do you think it would be useful to have an rcParam to set the default kind to hist?

Well probably yes, histograms are less reliant on things happening under the hood, as KDEs still need to compute bandwidth somehow. From histogram you can see everything directly from the image. It makes sharing results easier.

1 Like

Have there been bad experiences with pandas MultiIndex?

Edit: Just realized, this doesn’t really help.

@jbaranowski we have added an rcParam to have density plots of continuous variables default to histograms instead of KDE.

Already on development version and starting with next release, you can add plot.density_kind : hist to your arvizrc file or set rcParams["plot.density_kind"] = "hist" at the beginning of your scripts/notebooks to use this. Note however that not all functions will allow to customize the number of bins, in some cases this will be hardcoded to "auto" from numpy.

As the data is inherently multidimensional, representing it as a 2d table needs to rely on some conventions between multiindexes and column names, these are not necessarily clear nor unique. In fact, the same draws_pd does not use multiindex and uses a wide format, where columns representing different variables are no different than columns representing different dimensions of the same variable. Using the actual shape of the data directly does not. Also note that any of those dimensions in the n-D labeled arrays can be indexed using a pandas multiindex. xarray is still a quite new library with its own limitations, but we have found this approach to have many advantages when it comes to using and computing with stan results. See for instance this example (the section where apply_ufunc is used) or this other example about labels and sorting.

I have no intention of entering a debate on data structures, but I figured sharing some of our rationale could be useful. This is not a “stop using dataframes” statement, only a 2 sentence explanation on why we have found more convenient to go with xarray as the data structure to store samples. And this samples is important. az.summary generates a transposed version of this same wide dataframe format because in that case we find this representation of the data more convenient.

1 Like

Thank you for the explanation. What I actually meant was nothing more than “Ah, just tried pandas Multi Index for samples and it kind of gets messier, not better”.

I’ll have a look at how arviz/xarray handles the problem, thanks!

1 Like

@OriolAbril Saying that this community is amazing is an understatement. Thank you very much for addressing the matter.

Regarding pandas, I don’t think that it is better than xarray, but it is

  1. Familiar to my students.
  2. Generates pretty markdown tables in jupyter

But I see that I must start moving more and more to Arviz for further course work.