Bob’s response got to some of the things I was talking about, but I think he’s still more focused on Bayesian inference than I am here.

To start, I want to convey a very simple notion, which I don’t think is clear to most Stan users or potential Stan users, which is that the Stan modeling language is a way to write an R function that allows fast and convenient computation of a target and its gradient. The idea is that, in R, writing a Stan program and compiling it gives us an R function, foo(). It’s very much like old-school R where you could write a function in C or Fortran, compile it, and then access it like an R function.

I feel that if that notion is conveyed, then people will be able to use Stan more flexibly. The point is that the things that Pytorch etc can do, can also be done by Stan. But I don’t think people realize this. I think they perceive Stan as a black box that does one thing, Bayesian inference, and not as a computational tool. That’s what I mean when I say that Stan can be more Pythonic or Rthonic in this way: I’m not talking about the details of the modeling language but rather the way that a compiled Stan program can be viewed as an encapsulated function that returns target and gradient. Just like existing user-defined R functions, where you don’t need to know what’s inside the function, you can treat it as a black box.

Downloads during last month from RStudio CRAN mirror (one of 100+ mirrors)

rstan 47k

rjags 17k

R2WinBUGS 5k

nimble 1k

There are probably JAGS and WinBUGS users who don’t use RStudio or R, but I believe RStan has passed the others in users. Downloads during last month from PyPI

PyStan 511k

I guess there are less mirrors for PyPI and I have an impression that Python users install everything more often than R users which inflates downloads.

As someone whose made a lot of use of the capacity Andrew speaks of, yes it’s true the capacity exists, but there are some not insubstantial hurdles posed by the current stanfit object (within R at least). I just went through a lot of drama getting optimization to work in parallel for instance (splitting data over subjects then combining gradients / log probs at each iter), which I expect? other software which is built for the purpose would avoid. Looking forward to any improvements in this domain with stan3 :)

I’m going to assume you’re not pranking me here. I tried to express my reservations during the roadmap meeting and was told I was being difficult. I brought it up on the proposed roadmap post on discourse (since delisted) and was told I should’ve gone to the beach house if I wanted to be heard. I’ve had long discussions with @bgoodri and @ariddell in the past where I’ve been told my proposals weren’t R-like or Pythonic, which is why I wasn’t particularly keen to spend a weekend away from home to just lose the same battle once again. This doesn’t block my applied work now that I understand how it works, so it’s not that big of a deal to me, especially as I don’t do a lot with RStan or PyStan myself, especially at this level.

Anyway, to summarize, my main objection is that it’s used for too many things, most of which leave it in a partial state. Here’s some cases off the top of my head:

holding the arguments, output, and intermediate products of a call to `stan(), including config, generated random seeds, a link to the model, a link to the enviornment, the draws, features of the C++ config, time stamps, timing, adaptation stepsize and metric, the name of the file in which the model was saved if it was saved, and so on; it even contains summary statistics like means by default

gathering external output to evaluate using read_stan_csv(), in which case everything but the draws seems irrelevant

creating a log density function to use by running for zero iterations, in which case sampler config and draws are irrelevant; I’d much prefer a way to create a log density function from a compiled model and data

as a holder for a model to rerun (ther’s a fit argument in rstan, which if supplied, pulls the model out of the fit object somehow to reuse for the current run), potentially on new data with new config, so only the model link is relevant

Finally, how do we create a stanfit object when we’re able to restart sampling? Will it be mutable or do we create a copy of the data? Is it going to be used for standalone generated quantities? Which additional fields will this need? What about Laplace approximations from optimization? Or approximate draws from ADVI?

Anyway, here’s the full structure of a stanfit object in R in rstan 2.19.2.

This is perhaps somewhat orthogonal to the initial use case that Andy had in mind, but I have two cents to add from a[n economist] practitioner’s perspective:

The idea that “Stan is viewed as a Bayesian inference engine with a modeling language attached to it” has the consequence that it is not optimized for non-Bayesian (but still probabilistic) programming that it could in principle be very helpful for (e.g. comparing HMC against other estimators, like EM).

In most cases, implementing another estimator requires either coding it up natively in (e.g. R/Python) entirely, or exposing a stan function to an optimizer that interfaces with [R/Python/etc.]. My understanding is that this is intentional: part of the goal of Stan is to encourage good practice and question-raising when Bayesian estimation isn’t workable. Still, having access to industry grade optimizers as function calls within Stan, would certainly improve my life.

Apologies if this is too orthogonal to the thread’s aim.

I’d be thrilled if people would routinely use Stan when they’re doing maximum likelihood. EM could be more challenging–I think we’d want to do that using GMO, which hasn’t been cleanly implemented yet. Also you can use Stan for plain old operations research optimization problems, where you just write the target function directly as a Stan program.

I think that’s part of the issue, that users don’t realize that if you write a Stan program in R and then compile it, you’ve created an R function that computes a target function and its gradient, automatically. I think that’s been lost in all the Bayesian stuff.

Re: doing plain old OR optimization in Stan – I think part of the issue is that necessarily using the target as the objective for maximization makes some of the nice things about Stan clunky. It would work only if you didn’t invoke anything that contributes to the likelihood (“behind the scenes”) automatically.

That’s not a deal breaker, but say I wanted to run a GMM estimator with simulated moments (e.g. moment conditions that are composed with random draws) – this kind of thing is done often in economics where the moments are based on economic models of the DGP. If I were to generate draws with the model block, that would contribute to the target (unless I’m mistaken?). Not the end of the world, but it can get confusing quickly. What would be really cool/helpful (though I don’t know if valuable enough from Stan’s perspective) is if it were possible to build a standalone ‘target’ that can be outputted as Stan function or as a generated quantity.

Re: using stan functions in R – this is something that comes up a lot in economics since the goal of estimating DGPs is often to predict ‘counterfactuals’. Being able to access Stan data structures and functions (generally through expose_stan_function in R) is super helpful for this both because they’re just good and because this lets me reuse code/structures (which among other things, helps to avoid errors).

But I’m generally not using the target/gradient when doing this – I’d run maximum likelihood in Stan if I wanted to optimize that. Maybe I’m falling prey to exactly the problem you’re raising?

The title of the thread is a red herring that was triggered by a student that happened to be familiar with PyTorch. I don’t think there is much to say here about what is more Pythonic or Rthonic or whether that is better or worse but if so, it should be said by people who have written at least a little bit of code in both Python and R.

If Andrew had just said “It should be easier to evaluate the log-kernel and its gradient from Stan interfaces”, I think that would have been non-controversial. Indeed, we have been planning to do that for RStan since at least 2015, but it is not backward compatible with the 2013 RStan design where calling things like log_prob was an afterthought (and not a particularly well thought out afterthought).

Although we should make it easier to evaluate the log-kernel and its gradient from Stan interfaces, I don’t think it will make a big effect. It has always been possible to do this in RStan , I gave a presentation about it at the R conference in Brussels, Dustin tried to go down this road when implementing GMO, etc. and it hasn’t caught on for reasons that I think are deeper than it is currently clunky to do in RStan and PyStan:

For people who want to do maximum likelihood, having autodiff gradients is not enough of an inducement to write the log-likelihood function in the Stan language. In most cases where someone wants to do maximum likelihood, the gradient and Hessian have already been worked out.

For people who want to do what Andrew was doing in his class — finding the mode in a toy problem without implementing the gradient and Hessian — having autodiff gradients is not enough of an inducement to write the log-likelihood function in the Stan language because numerical gradients are good enough and allow you to write the log-likelihood function in the interface language.

For people who want to do supervised learning, having autodiffed gradients is not enough of an inducement to write the objective function in the Stan language because they want to implement specialized optimization algorithms that scale to problems with terrabytes of data

95% of people that Stan — and the percentage is even higher among people who are first learning Stan — do not recognize a Stan program as defining a function from the things declared in the parameters block to the real numbers that is defined by a log-kernel. This confusion is exacerbated by people continuing to write (and teach) the theta ~ normal(mu, sigma); notation rather than the target += normal_lpdf(theta | mu, sigma); notation, but even those of us who teach the latter notation find it difficult to communicate to students that a Stan program defines a function because the function it defines does not look like a function in R, Python, or whatever interface. When most people think of a function, they think of something that looks like (in R)

When I try to tell people that a Stan program defines a function they ask “WTF are the arguments to this function and what is returned?”, which are non-obvious unless you look at the generated C++ code.

The only case where ideas like this have caught on a little bit (in R; it hasn’t been implemented for Python or Julia yet) is with rstan::expose_stan_functions, in which case the Stan code implementing the function actually looks like the above normal_pdf function.

We should talk (also with Ben) about your specific models that you’re working with, to see if these can be done in Stan. The short answer is that it is easy to compute a standalone target function, as this is what compiling a Stan model does arleady.

I’m in the “teach target += first” camp and it has been very successful in my courses. I’ve also had no problems communicating the “a Stan program defines a function” concept, but at the same time I get to that only after four hours of background on statistics and computation, so that abstraction has already been introduced and reinforced. But I definitely agree that a tighter UX/pedagogical connection will continue to improve this, and trying to turn Stan into PyTorch will not.

Hi, just to clarify: I don’t think anyone was talking about turning Stan into Pytorch. At least, I wasn’t talking about that, as I don’t even know what Pytorch is! What I was talking about is what Ben was saying too, that it’s not clear to people that a compiled Stan program gives you an R (or Python) function that computes a target function, and that calling this R (or Python) function gives you the function and its gradient.

This is exactly where @andrewgelman was going with this.

This is something we want to add. We’re only just adding 1D integrators and algebraic solvers, so it’s slow going to get the right algorithms that can be differentiated. Is there a feature request somewhere that you detail what you want? One of the other problems we run into is imagining what people want. So concreteness helps, as in "it would be great if Stan had functin foo(a, b, c) that was called with these arguments and returned these results.

We haven’t spent a lot of time thinking about how to make Stan easier to program for programmers experimenting with algorithms. Being in C++ Is a hurdle, but just having log density and gradient on the outside isn’t enough.

It’s primarily lack of resources. If we could snap our fingers and have something easy to use for algorithm development, we’d do it! We’d probably even be the main consumers of those utilities.

We built Metropolis and ensemble methods, but couldn’t find a use case where they beat NUTS, so we never merged them. You’re right that our goal isn’t to become something like Scikit.learn or Weka with a comprehensive range of inference tools on hand for comparison projects. We’re only going to add something if it’s better than what we have for some sizable set of models that makes it worth including and maintaining it.

HMC vs. EM is easy in Stan if you can code the E step (marginalization) in the Stan program. We do that for a bunch of discrete parameter models in the user’s guide: mixtures, HMMs, change-points, Cormack-Jolly-Seber mark-recapture, etc. In these cases, running our optimize functions does EM because the model computes the E step and optimize does the M step.

Nothing contributes to the log density behind the scenes for optimization. For sampling, the log Jacobians of constrained variable transforms are added to the target log density, but not for optimization.

Yes. But if it’s decoupled, then it won’t necessarily affect other inferences. In those cases, we can usually move the implementation to generated quantities.

Do you have an example of how that’d work? The only thing that adds to target is ~, because a ~ foo(theta); is synonymous (up to constants) with target += foo_lpdf(a | theta);. But as you point out, there’s no way to induce randomness other than through the log density in Stan.

Are you using tools like TensorFlow or PyTorch to do this kind of thing now? They’re made for playing around with algorithms in Python.

Yes, but that’s what they mean when they say “Pythonic”—that it’s familiar. We’re not going to be turning Stan itself into something that looks Pythonic, though groups have built wrappers around Stan that do, the same way PyMC3 is a wrapper around Theano.

Maybe not controversial, but not actionable until we learn what’s wrong with the current way of doing it. I think @andrewgelman wants something in R that looks more like the behavior of random variables in Stan programs. That’s orthogonal to making these algorithms easier.

I don’t see why that’d be any different for maximum likelihood than MCMC. If you want to experiment with a bunch of models, you don’t want to be deriving gradients and Hessians by hand. I think the bigger drawback is that we don’t have automatic marginal optimization methods, so we can’t just suggest people replace lme4 with Stan’s optimization.

Here, the student used TensorFlow (or PyTorch or whatever they used) rather than implement finite differences.

We’re definitely not competitive in that market.

My original conception is that a Stan program is a function:

In C++ terms, the data is given to the constructor, so f(y) behaves like a functor. In lambda-calculus terms, this is curried to take the entire joint log density function f or the posterior log density function f(y).

The signature for the data and parameters arguments are given in their respective blocks and the model block computes the log density. If you start from there, it might be simpler to convey. That’s how I originally conceived and built it. The transformed data and parameters are just placeholders, and the generated quantities is just for posterior predictive generation. You can write all that down functionally, too, and if you do, it follows the way it’s implemented in C++.

Those functions can be anything, so I think going down that route to explain Stan will be confusing.

Right. It’s just a lightweight wrapper around the C++ services so we can do I/O.

How do you suggest we make it clearer? The only thing I don’t like about it is that you have to run sampling to get data associated with an model for sampling. I indicated above how I’d like to see it go with a function to compile a Stan program to a joint density, a function to take a joint density and data and return a posterior density, then something that took that posterior density and ran sampling on it. Right now, all these steps are part of a single function stan(), and there’s not a good way to get the posterior density function with only data—you need to run a sampler for 0 iterations.

This is how I teach it (after first teaching the joint model and how the joint density reduces to something proportional to the posterior density) and it has been successful. Oddly enough it helps to teach what the language was designed for and how it works behind the scenes instead of trying to bootstrap off of existing approaches to teaching Bayes!

Technically the curry doesn’t give the log posterior density but rather something equal up to a constant. That’s fine for the current algorithms, which all require just something in that equivalence class of log densities, but not for anything that requires normalization (like bridge sampling and other external algorithms). I’m not sure if there’s a form of currying that yields an equivalence class of functions (or a member of that equivalence class) instead of a particular function, but that would be the terminology to use.

Right. The only thing we need is that the model expresses the log density of the desired sampling distribution up to a constant. That’s usually a Bayesian log posterior density defined up to an additive constant through a oint density defined up to an additive constant. But if you unfold things like conjugacy, you don’t necessarily ever have to code the full joint density.

shoshievass:
What would be really cool/helpful (though I don’t know if valuable enough from Stan’s perspective) is if it were possible to build a standalone ‘target’ that can be outputted as Stan function or as a generated quantity.

I’d like to be able to do this: foo <- stan_compile(“foo.stan”) and this would create foo(), an R function that takes data and parameters as arguments and returns the target and its gradients. To me, this would make Stan much more transparent and useful for uses beyond simply running Nuts for Bayesian inference.

bgoodri:
If Andrew had just said “It should be easier to evaluate the log-kernel and its gradient from Stan interfaces”, I think that would have been non-controversial.

Maybe not controversial, but not actionable until we learn what’s wrong with the current way of doing it. I think @andrewgelman wants something in R that looks more like the behavior of random variables in Stan programs. That’s orthogonal to making these algorithms easier.

There are two different things I’m asking for. One thing is to be able to extract summaries or simulations from fitted Stan models that look like lists whose elements have the size and shape of the corresponding parameters in the Stan model. A completely different thing is to be able to consider a compiled Stan model as an R (or Python) function that returns target and gradient. I think the latter will make Stan more Pythonic or Rthonic. To me, it’s not about the syntax looking the same (although I agree that will help) but rather about having that functionality, so that it’s clear how to use Stan as a convenient way for an R or Python user to compute a target function and tis gradient.

bgoodri:
For people who want to do maximum likelihood, having autodiff gradients is not enough of an inducement to write the log-likelihood function in the Stan language. In most cases where someone wants to do maximum likelihood, the gradient and Hessian have already been worked out.

I don’t see why that’d be any different for maximum likelihood than MCMC. If you want to experiment with a bunch of models, you don’t want to be deriving gradients and Hessians by hand.

That’s right.

I think the bigger drawback is that we don’t have automatic marginal optimization methods, so we can’t just suggest people replace lme4 with Stan’s optimization.

GMO!

bgoodri:
those of us who teach the latter notation find it difficult to communicate to students that a Stan program defines a function because the function it defines does not look like a function in R,

But this is my point! With foo <- stan_compile(“foo.stan”), or something similar, foo() will look like a funciton in R!

My original conception is that a Stan program is a function:

In C++ terms, the data is given to the constructor, so f(y) behaves like a functor. In lambda-calculus terms, this is curried to take the entire joint log density function f or the posterior log density function f(y).

The signature for the data and parameters arguments are given in their respective blocks and the model block computes the log density. If you start from there, it might be simpler to convey. That’s how I originally conceived and built it. The transformed data and parameters are just placeholders, and the generated quantities is just for posterior predictive generation. You can write all that down functionally, too, and if you do, it follows the way it’s implemented in C++.

andrewgelman:
it’s not clear to people that a compiled Stan program gives you an R (or Python) function that computes a target function, and that calling this R (or Python) function gives you the function and its gradient.

How do you suggest we make it clearer? The only thing I don’t like about it is that you have to run sampling to get data associated with an model for sampling.

I’d like to go with your (Bob’s) plan to separate compilation, computation, and model fitting. Compilation creates this foo() function that returns target and gradient, then these can be computed at will. Also, Stan has an optimizer and an ADVi solver and a Nuts solver so it can do those too.

I indicated above how I’d like to see it go with a function to compile a Stan program to a joint density, a function to take a joint density and data and return a posterior density, then something that took that posterior density and ran sampling on it. Right now, all these steps are part of a single function stan(), and there’s not a good way to get the posterior density function with only data—you need to run a sampler for 0 iterations.

I agree 100% that it was a mistake to create this stan() function, and this was in part a result of my lack of imagination and understanding.

@andrewgelman Ok, here’s a crude version of your stan_compile() function, but it should only be used for Stan programs with no parameter constraints because it’s really interpreting params in foo(data, params) as being on the unconstrained space. It may have to wait until Stan3 for a convenient way to do this for arbitrary Stan programs.

# @param file Path to Stan program.
# @return A function that takes arguments ‘data’ (list) and ‘params’ (numeric vector),
# and returns a list containing the target (scalar) and its gradient (dims depends on number of params)
stan_compile <- function(file) {
comp <- rstan::stan_model(file)
function(data, params) {
suppressWarnings({
x <- rstan::sampling(comp, data = data, iter = 1, refresh = 0)
})
out <- list()
grad_and_target <- rstan::grad_log_prob(x, upars = params)
out$target <- attr(grad_and_target, "log_prob")
attr(grad_and_target, "log_prob") <- NULL
out$gradient <- grad_and_target
out
}
}

Demonstration

# will use a string for demo but preferably use a separate file
program <- "
data {
int<lower=0> N;
int<lower=0,upper=1> y[N];
}
parameters {
real theta_logit;
}
model {
target += bernoulli_logit_lpmf(y | theta_logit);
}
"
my_file <- tempfile(fileext = ".stan")
writeLines(program, my_file)
# create foo()
foo <- stan_compile(my_file)
# data to pass to foo()
my_data <- list(N = 10, y =c(0,1,0,0,0,0,0,0,0,1))
# call foo with my_data to get target and gradient when theta_logit=0
target_and_gradient <- foo(data = my_data, params = 0)
print(target_and_gradient)
$target
[1] -6.931472
$gradient
[1] -3

Is that what you were envisioning demonstrating in the case study?