Python integration for Stan (in the style of R's ulam)

I’ve been quite impressed by the ulam R package which is used in the (excelent) Rethinking Statistics online lectures. However, I don’t like R very much, and as a result I usually end up using python and write models directly in stan. The only problem of writing raw stan is that if I want to use a more restricted language (which I often do), I can get the model to egnerate the log likelihood for each datapoint (for LOO validation) and the posterior values of the parameters (for PPC). I’ve decided to implement a way of specifying high level models in python. The end goal is to add easy support for missing data (with data imputation, of course),automatic data generation for PPC and log likelihood calculation for LOO. I came up with this:

from ulam import *

with UlamModel() as m:
    # Should go to the data block
    N = m.data('N', m.integer())
    x = m.data('x', m.vector(N), missing_data=True)
    y = m.data('y', m.vector(N), missing_data=True)

    # Should go to the parameters block
    intercept = m.parameter('intercept', m.real())
    slope = m.parameter('slope', m.real())
    error = m.parameter('error', m.real(lower=0))

    # Should go to the generated quantities block
    log_lik = m.generated('log_lik', m.vector(N))
    y_hat = m.generated('y_hat', m.vector(N))

    # Instead of relying on stan's vectorization, I've decided to be explicit
    # regarding iteration using indices. This makes it easy to manipulate
    # the program in order to proparly handle missing data.
    for i in m.range('i', 1, N):
        # The `^` (xor) operator represents a sampling statement:
        # -------------------------------------------------------
        # y[i] ~ normal(x[i] * slope + intercept, error)
        y[i] ^ m.normal(x[i] * slope + intercept, error)

    for i in m.range('i', 1, N):
        # The `<<` (left shift) operator represents the stan assignment operator (=)
        # --------------------------------------------------------------------------
        # log_lik[i] = normal_lpdf(y[i], x[i] * slope + intercept, error)
        log_lik[i] << m.normal_lpdf(y[i], x[i] * slope + intercept, error)

    for i in m.range('i', 1, N):
        # y_hat[i] = normal_rng(x[i] * slope + intercept, error)
        y_hat[i] << m.normal_rng(x[i] * slope + intercept, error)

Unfortunately, due to python’s limited reflection capabilities (if you want to keep things sane, if you’re willing to go a bit insane, then python’s reflection capabilities are very strong indeed!) I have to specify the name of the variables in the functions that create such variables.

Currently, the code above already generates valid Stan stratements AST (reimplemented in python), which with some effort can be split into blocks (although this isn’t fully implemented yet). Distributing the variables among the blocks is very easy, because the functions (data(), generated(), etc.) tag the variables with the correct block. The problem is distributing statements, which are not explicitly tagged iwith the right blocks. I’ll have to perform some kind of dependncy analysis.

Anyway, what do you think of the high-level python interface? Any suggestions? I’m pretty happy with the high-level interface and I’m thinking of focusing on dependency analysis to distribute the statements throughout the blocks.

1 Like

Tagging a few Python people who might have thoughts about this: @mitzimorris @WardBrian @ahartikainen @ariddell

how does this compare to PyMC? are you familiar with the YAPS project (no longer under developement, afaik)

also tagging @Bob_Carpenter

The main difference is that this attempts to generate Stan code instead of using the Theano background used by PyMC. I never could get PyMC to work with practical examples (it took ages to compile and ages to run) so I didn’t explore very deeply. But there are certainly some similarities between pyMC and what I’m doing here. The main advantage of Stan is that it makes it easier to handle vector/array elements individually instead of being at the mercy of the sometimes mystical “vectorization gods”. In fact, the way that Stan allows you to handle individual array elements with for loops allows you to trivially support things such as missing data in a very simple and understandable way.

I was not aware, I’ll look into it.

1 Like

YAPS is nice, but it does run afoul of something @tmbb mentioned:

YAPS used the Python ast module to ‘go a bit insane’. This is not very version portable, but is how it is basically the only Python-PPL-embedding I’ve seen that avoids the double specification of variable names

Yes, I’d love for this to be a legitimate approach, but Python doesn’t seem to like AST manipulations in the style of Lisp or Elixir, for example. I already have something which is pretty functional, and the only question is how to assign the expressions and variables to the right blocks. I’ll probably have to do something similar to what SlicStan does.

Currently, the following python code:

from ulam import *

with UlamModel() as m:
    # Should go to the data block
    N = m.data('N', m.integer(lower=0))
    x = m.data('x', m.vector(N))
    y = m.data('y', m.vector(N))

    # Should go to the parameters block
    intercept = m.parameter('intercept', m.real())
    slope = m.parameter('slope', m.real())
    error = m.parameter('error', m.real(lower=0))

    # Should go to the generated quantities block
    log_lik = m.generated('log_lik', m.vector(N))
    y_hat = m.generated('y_hat', m.vector(N))

    # Instead of relying on stan's vectorization, I've decided to be explicit
    # regarding iteration using indices. This makes it easy to manipulate
    # the program in order to proparly handle missing data.
    for i in m.range('i', 1, N):
        # The `^` (xor) operator represents a sampling statement:
        # -------------------------------------------------------
        # y[i] ~ normal(x[i] * slope + intercept, error)
        y[i] ^ m.normal(x[i] * slope + intercept, error)

        # The `<<` (left shift) operator represents the stan assignment operator (=)
        # --------------------------------------------------------------------------
        # log_lik[i] = normal_lpdf(y[i], x[i] * slope + intercept, error)
        log_lik[i] << m.normal_lpdf(y[i], x[i] * slope + intercept, error)
        
        # y_hat[i] = normal_rng(x[i] * slope + intercept, error)
        y_hat[i] << m.normal_rng(x[i] * slope + intercept, error)

generates the following stan code (actually it generates a python representation of a subset of stan’s AST):

int<lower=0> N;
vector[N] x;
vector[N] y;
real intercept;
real slope;
real<lower=0> error;
vector[N] log_lik;
vector[N] y_hat;

for (i in 1:N) {
    y[i] ~ normal(x[i] * slope + intercept, error);
    log_lik[i] = normal_lcdf(y[i], x[i] * slope + intercept, error);
    y_hat[i] = normal_rng(x[i] * slope + intercept, error);
}

That stan program is of course invalid because variables are note correctly divided among the blocks. The correct version would be:

data {
    int<lower=0> N;
    vector[N] x;
    vector[N] y;
}

parameters {
    real intercept;
    real slope;
    real<lower=0> error;
}

model {
    for (i in 1:N) {
        y[i] ~ normal(x[i] * slope + intercept, error);
    }
}

generated quantities {
    vector[N] log_lik;
    vector[N] y_hat;

    for (i in 1:N) {
        log_lik[i] = normal_lcdf(y[i], x[i] * slope + intercept, error);
        y_hat[i] = normal_rng(x[i] * slope + intercept, error);
    }
}

I intend to add missing variable imputation later. Ideally I’d also like to automatically marginalize discrete parameters, but that seems very hard to do in an efficient way.

1 Like

I have implemented missing data imputation or data which is MCAR - missing completely at random.

The next step would be to figure out how to automatically simulate observed data (e.g. y_hat and maybe x_hat) and how to calculate the log likelihood for automatical cross validation (with something like LOO for example). I’m willing to have the user specify which variables should be predicted and whihc variables should be used for the LOO testing, but I’d like Ulam to automatically generate the quantities from the sampling statements.

The main question for a project like this is always how far one should go:

  • Should we attempt to support something which basically amounts to transpiling python to stan? Probably not, as that would forbid us from doing things such as handling missing data automatically… If you want to write stan code in python, why don’t you just write it in stan from the beginning?

  • Do we want to support a “graphical language” for probabilistic models? A “graphical language” is actually not a very well-defined concept, but it’s an expression which is often used to contrast Stan to things like BUGS or PyMC. One thing I don’t like in most graphical languages is that they make it unnecessarily hard to do things which are trivial in a loop. This is why I’ve decided to treat “for loops” as a “first class feature” instead of trying to magically vectorize everything. Having access to an index makes it easy to include ragged datastructures as explained in the Stan manual. One can think of each vector and array as a database column and each index as the id of a database row, which is a very powerful way to model data.

I would say that it does not make sense to recreate PyMC.

I think what we are missing in Stan are the easily accessed user defined functions that do these steps. So maybe a library containing functionalities that can be added as #include would benefit everyone.

I think what we are missing in Stan are the easily accessed user defined functions that do these steps.

I don’t understand what you’re suggesting. Is it possible to automatically generate samples for posterior predictive plots from inside stan? I don’t belive so… The stan model language seems to too unconstrained for that. I belive the goal of these “mini-languages” that compile to stan should be to bring some additional functionality for the common and simple cases.

you can easily do this in Stan - see the Stan User’s Guide 27 Posterior Predictive Sampling | Stan User’s Guide, sections 4-7 contain several examples.