I’ve been quite impressed by the ulam R package which is used in the (excelent) Rethinking Statistics online lectures. However, I don’t like R very much, and as a result I usually end up using python and write models directly in stan. The only problem of writing raw stan is that if I want to use a more restricted language (which I often do), I can get the model to egnerate the log likelihood for each datapoint (for LOO validation) and the posterior values of the parameters (for PPC). I’ve decided to implement a way of specifying high level models in python. The end goal is to add easy support for missing data (with data imputation, of course),automatic data generation for PPC and log likelihood calculation for LOO. I came up with this:
from ulam import *
with UlamModel() as m:
# Should go to the data block
N = m.data('N', m.integer())
x = m.data('x', m.vector(N), missing_data=True)
y = m.data('y', m.vector(N), missing_data=True)
# Should go to the parameters block
intercept = m.parameter('intercept', m.real())
slope = m.parameter('slope', m.real())
error = m.parameter('error', m.real(lower=0))
# Should go to the generated quantities block
log_lik = m.generated('log_lik', m.vector(N))
y_hat = m.generated('y_hat', m.vector(N))
# Instead of relying on stan's vectorization, I've decided to be explicit
# regarding iteration using indices. This makes it easy to manipulate
# the program in order to proparly handle missing data.
for i in m.range('i', 1, N):
# The `^` (xor) operator represents a sampling statement:
# -------------------------------------------------------
# y[i] ~ normal(x[i] * slope + intercept, error)
y[i] ^ m.normal(x[i] * slope + intercept, error)
for i in m.range('i', 1, N):
# The `<<` (left shift) operator represents the stan assignment operator (=)
# --------------------------------------------------------------------------
# log_lik[i] = normal_lpdf(y[i], x[i] * slope + intercept, error)
log_lik[i] << m.normal_lpdf(y[i], x[i] * slope + intercept, error)
for i in m.range('i', 1, N):
# y_hat[i] = normal_rng(x[i] * slope + intercept, error)
y_hat[i] << m.normal_rng(x[i] * slope + intercept, error)
Unfortunately, due to python’s limited reflection capabilities (if you want to keep things sane, if you’re willing to go a bit insane, then python’s reflection capabilities are very strong indeed!) I have to specify the name of the variables in the functions that create such variables.
Currently, the code above already generates valid Stan stratements AST (reimplemented in python), which with some effort can be split into blocks (although this isn’t fully implemented yet). Distributing the variables among the blocks is very easy, because the functions (data(), generated(), etc.) tag the variables with the correct block. The problem is distributing statements, which are not explicitly tagged iwith the right blocks. I’ll have to perform some kind of dependncy analysis.
Anyway, what do you think of the high-level python interface? Any suggestions? I’m pretty happy with the high-level interface and I’m thinking of focusing on dependency analysis to distribute the statements throughout the blocks.