Adding Pandas as a PyStan dependency?


#1

Using pandas DataFrames in PyStan seems essential if PyStan is going to match behavior in RStan (e.g., provide something equivalent to as.data.frame(fit)). And using pandas might well provide a more pleasant user (and developer) experience because it allows PyStan to use tabular data with named columns.

Pandas is a serious dependency (9MB compressed source, 14MB binary wheel). It does have a stable API and binary wheels are available for every platform. (When PyStan was started, things were changing a bit faster and binary wheels did not exist.)

A hard dependency on pandas is a serious change. If anyone has objections or strong feelings about this, please post something.


#2

No strong opinion, but I would be surprised if there were that many PyStan users who did not already have Pandas installed anyway.


#3

I do like to work with dataframes.

The raw matrices are still going to be numpy arrays (other options is xarray=numpy array with pandas indexing)?


#4

xarray might be a great compromise! It’s far smaller than pandas and
is pure Python, I think.


#5

Minimizing dependencies is always preferable when possible, especially with respect to huge, fluid, hard-to-install packages. The problems arise when the package on which you depend changes somethign, as we keep running into with Boost and Eigen releases, and those are relatively stable packages.

(I have zero opinion on the particular case of Python and Pandas, not being a user of either.)


#6

I agree that minimizing dependencies is really important.

I think we may be able to do what we did with matplotlib: raise an
exception if pandas is not installed.


#7

I think we may be able to do what we did with matplotlib: raise an
exception if pandas is not installed.

I don’t know much about the intended design of PyStan 3, but would it be too much overhead to have two functions, one for numpy structured arrays and another for pandas like TensorFlow tf.estimator.inputs.numpy_input_fn and tf.estimator.inputs.pandas_input_fn? (Maybe in a way that the later checks for pandas installation and then calls the later).


#8

That’s exactly the approach I think will be best. fit.values will
return a NumPy array with shape (num_chains, num_draws,
num_flat_params). fit.to_frame() will return a pandas DataFrame and
fit.to_panel() will return a pandas Panel.

There will be a working demo out shortly. Just finishing it up now.


#9

Now that I can post in this topic, I will mostly just give a +1 to the ideas you are already discussing. AFAICT, everything depends on pandas already, so it does not feel like a big requirement. Your suggestion that the interop be available if pandas is, sounds wonderfully accommodating.

Most of the python code I have for working with stan involves formatting dataframes to put into my model and converting the results of a fit into dataframes to analyze the result. If pystan had functions that helped with that… many of the bugs I made in my first week with stan would have been avoided. Especially around the differences in output from fit vs optimize ve ADVI.