Re: doing plain old OR optimization in Stan – I think part of the issue is that necessarily using the target as the objective for maximization makes some of the nice things about Stan clunky. It would work only if you didn’t invoke anything that contributes to the likelihood (“behind the scenes”) automatically.

That’s not a deal breaker, but say I wanted to run a GMM estimator with simulated moments (e.g. moment conditions that are composed with random draws) – this kind of thing is done often in economics where the moments are based on economic models of the DGP. If I were to generate draws with the model block, that would contribute to the target (unless I’m mistaken?). Not the end of the world, but it can get confusing quickly. What would be really cool/helpful (though I don’t know if valuable enough from Stan’s perspective) is if it were possible to build a standalone ‘target’ that can be outputted as Stan function or as a generated quantity.

Re: using stan functions in R – this is something that comes up a lot in economics since the goal of estimating DGPs is often to predict ‘counterfactuals’. Being able to access Stan data structures and functions (generally through expose_stan_function in R) is super helpful for this both because they’re just good and because this lets me reuse code/structures (which among other things, helps to avoid errors).

But I’m generally not using the target/gradient when doing this – I’d run maximum likelihood in Stan if I wanted to optimize that. Maybe I’m falling prey to exactly the problem you’re raising?

The title of the thread is a red herring that was triggered by a student that happened to be familiar with PyTorch. I don’t think there is much to say here about what is more Pythonic or Rthonic or whether that is better or worse but if so, it should be said by people who have written at least a little bit of code in both Python and R.

If Andrew had just said “It should be easier to evaluate the log-kernel and its gradient from Stan interfaces”, I think that would have been non-controversial. Indeed, we have been planning to do that for RStan since at least 2015, but it is not backward compatible with the 2013 RStan design where calling things like log_prob was an afterthought (and not a particularly well thought out afterthought).

Although we should make it easier to evaluate the log-kernel and its gradient from Stan interfaces, I don’t think it will make a big effect. It has always been possible to do this in RStan , I gave a presentation about it at the R conference in Brussels, Dustin tried to go down this road when implementing GMO, etc. and it hasn’t caught on for reasons that I think are deeper than it is currently clunky to do in RStan and PyStan:

For people who want to do maximum likelihood, having autodiff gradients is not enough of an inducement to write the log-likelihood function in the Stan language. In most cases where someone wants to do maximum likelihood, the gradient and Hessian have already been worked out.

For people who want to do what Andrew was doing in his class — finding the mode in a toy problem without implementing the gradient and Hessian — having autodiff gradients is not enough of an inducement to write the log-likelihood function in the Stan language because numerical gradients are good enough and allow you to write the log-likelihood function in the interface language.

For people who want to do supervised learning, having autodiffed gradients is not enough of an inducement to write the objective function in the Stan language because they want to implement specialized optimization algorithms that scale to problems with terrabytes of data

95% of people that Stan — and the percentage is even higher among people who are first learning Stan — do not recognize a Stan program as defining a function from the things declared in the parameters block to the real numbers that is defined by a log-kernel. This confusion is exacerbated by people continuing to write (and teach) the theta ~ normal(mu, sigma); notation rather than the target += normal_lpdf(theta | mu, sigma); notation, but even those of us who teach the latter notation find it difficult to communicate to students that a Stan program defines a function because the function it defines does not look like a function in R, Python, or whatever interface. When most people think of a function, they think of something that looks like (in R)

When I try to tell people that a Stan program defines a function they ask “WTF are the arguments to this function and what is returned?”, which are non-obvious unless you look at the generated C++ code.

The only case where ideas like this have caught on a little bit (in R; it hasn’t been implemented for Python or Julia yet) is with rstan::expose_stan_functions, in which case the Stan code implementing the function actually looks like the above normal_pdf function.

We should talk (also with Ben) about your specific models that you’re working with, to see if these can be done in Stan. The short answer is that it is easy to compute a standalone target function, as this is what compiling a Stan model does arleady.

I’m in the “teach target += first” camp and it has been very successful in my courses. I’ve also had no problems communicating the “a Stan program defines a function” concept, but at the same time I get to that only after four hours of background on statistics and computation, so that abstraction has already been introduced and reinforced. But I definitely agree that a tighter UX/pedagogical connection will continue to improve this, and trying to turn Stan into PyTorch will not.

Hi, just to clarify: I don’t think anyone was talking about turning Stan into Pytorch. At least, I wasn’t talking about that, as I don’t even know what Pytorch is! What I was talking about is what Ben was saying too, that it’s not clear to people that a compiled Stan program gives you an R (or Python) function that computes a target function, and that calling this R (or Python) function gives you the function and its gradient.

This is exactly where @andrewgelman was going with this.

This is something we want to add. We’re only just adding 1D integrators and algebraic solvers, so it’s slow going to get the right algorithms that can be differentiated. Is there a feature request somewhere that you detail what you want? One of the other problems we run into is imagining what people want. So concreteness helps, as in "it would be great if Stan had functin foo(a, b, c) that was called with these arguments and returned these results.

We haven’t spent a lot of time thinking about how to make Stan easier to program for programmers experimenting with algorithms. Being in C++ Is a hurdle, but just having log density and gradient on the outside isn’t enough.

It’s primarily lack of resources. If we could snap our fingers and have something easy to use for algorithm development, we’d do it! We’d probably even be the main consumers of those utilities.

We built Metropolis and ensemble methods, but couldn’t find a use case where they beat NUTS, so we never merged them. You’re right that our goal isn’t to become something like Scikit.learn or Weka with a comprehensive range of inference tools on hand for comparison projects. We’re only going to add something if it’s better than what we have for some sizable set of models that makes it worth including and maintaining it.

HMC vs. EM is easy in Stan if you can code the E step (marginalization) in the Stan program. We do that for a bunch of discrete parameter models in the user’s guide: mixtures, HMMs, change-points, Cormack-Jolly-Seber mark-recapture, etc. In these cases, running our optimize functions does EM because the model computes the E step and optimize does the M step.

Nothing contributes to the log density behind the scenes for optimization. For sampling, the log Jacobians of constrained variable transforms are added to the target log density, but not for optimization.

Yes. But if it’s decoupled, then it won’t necessarily affect other inferences. In those cases, we can usually move the implementation to generated quantities.

Do you have an example of how that’d work? The only thing that adds to target is ~, because a ~ foo(theta); is synonymous (up to constants) with target += foo_lpdf(a | theta);. But as you point out, there’s no way to induce randomness other than through the log density in Stan.

Are you using tools like TensorFlow or PyTorch to do this kind of thing now? They’re made for playing around with algorithms in Python.

Yes, but that’s what they mean when they say “Pythonic”—that it’s familiar. We’re not going to be turning Stan itself into something that looks Pythonic, though groups have built wrappers around Stan that do, the same way PyMC3 is a wrapper around Theano.

Maybe not controversial, but not actionable until we learn what’s wrong with the current way of doing it. I think @andrewgelman wants something in R that looks more like the behavior of random variables in Stan programs. That’s orthogonal to making these algorithms easier.

I don’t see why that’d be any different for maximum likelihood than MCMC. If you want to experiment with a bunch of models, you don’t want to be deriving gradients and Hessians by hand. I think the bigger drawback is that we don’t have automatic marginal optimization methods, so we can’t just suggest people replace lme4 with Stan’s optimization.

Here, the student used TensorFlow (or PyTorch or whatever they used) rather than implement finite differences.

We’re definitely not competitive in that market.

My original conception is that a Stan program is a function:

In C++ terms, the data is given to the constructor, so f(y) behaves like a functor. In lambda-calculus terms, this is curried to take the entire joint log density function f or the posterior log density function f(y).

The signature for the data and parameters arguments are given in their respective blocks and the model block computes the log density. If you start from there, it might be simpler to convey. That’s how I originally conceived and built it. The transformed data and parameters are just placeholders, and the generated quantities is just for posterior predictive generation. You can write all that down functionally, too, and if you do, it follows the way it’s implemented in C++.

Those functions can be anything, so I think going down that route to explain Stan will be confusing.

Right. It’s just a lightweight wrapper around the C++ services so we can do I/O.

How do you suggest we make it clearer? The only thing I don’t like about it is that you have to run sampling to get data associated with an model for sampling. I indicated above how I’d like to see it go with a function to compile a Stan program to a joint density, a function to take a joint density and data and return a posterior density, then something that took that posterior density and ran sampling on it. Right now, all these steps are part of a single function stan(), and there’s not a good way to get the posterior density function with only data—you need to run a sampler for 0 iterations.

This is how I teach it (after first teaching the joint model and how the joint density reduces to something proportional to the posterior density) and it has been successful. Oddly enough it helps to teach what the language was designed for and how it works behind the scenes instead of trying to bootstrap off of existing approaches to teaching Bayes!

Technically the curry doesn’t give the log posterior density but rather something equal up to a constant. That’s fine for the current algorithms, which all require just something in that equivalence class of log densities, but not for anything that requires normalization (like bridge sampling and other external algorithms). I’m not sure if there’s a form of currying that yields an equivalence class of functions (or a member of that equivalence class) instead of a particular function, but that would be the terminology to use.

Right. The only thing we need is that the model expresses the log density of the desired sampling distribution up to a constant. That’s usually a Bayesian log posterior density defined up to an additive constant through a oint density defined up to an additive constant. But if you unfold things like conjugacy, you don’t necessarily ever have to code the full joint density.

shoshievass:
What would be really cool/helpful (though I don’t know if valuable enough from Stan’s perspective) is if it were possible to build a standalone ‘target’ that can be outputted as Stan function or as a generated quantity.

I’d like to be able to do this: foo <- stan_compile(“foo.stan”) and this would create foo(), an R function that takes data and parameters as arguments and returns the target and its gradients. To me, this would make Stan much more transparent and useful for uses beyond simply running Nuts for Bayesian inference.

bgoodri:
If Andrew had just said “It should be easier to evaluate the log-kernel and its gradient from Stan interfaces”, I think that would have been non-controversial.

Maybe not controversial, but not actionable until we learn what’s wrong with the current way of doing it. I think @andrewgelman wants something in R that looks more like the behavior of random variables in Stan programs. That’s orthogonal to making these algorithms easier.

There are two different things I’m asking for. One thing is to be able to extract summaries or simulations from fitted Stan models that look like lists whose elements have the size and shape of the corresponding parameters in the Stan model. A completely different thing is to be able to consider a compiled Stan model as an R (or Python) function that returns target and gradient. I think the latter will make Stan more Pythonic or Rthonic. To me, it’s not about the syntax looking the same (although I agree that will help) but rather about having that functionality, so that it’s clear how to use Stan as a convenient way for an R or Python user to compute a target function and tis gradient.

bgoodri:
For people who want to do maximum likelihood, having autodiff gradients is not enough of an inducement to write the log-likelihood function in the Stan language. In most cases where someone wants to do maximum likelihood, the gradient and Hessian have already been worked out.

I don’t see why that’d be any different for maximum likelihood than MCMC. If you want to experiment with a bunch of models, you don’t want to be deriving gradients and Hessians by hand.

That’s right.

I think the bigger drawback is that we don’t have automatic marginal optimization methods, so we can’t just suggest people replace lme4 with Stan’s optimization.

GMO!

bgoodri:
those of us who teach the latter notation find it difficult to communicate to students that a Stan program defines a function because the function it defines does not look like a function in R,

But this is my point! With foo <- stan_compile(“foo.stan”), or something similar, foo() will look like a funciton in R!

My original conception is that a Stan program is a function:

In C++ terms, the data is given to the constructor, so f(y) behaves like a functor. In lambda-calculus terms, this is curried to take the entire joint log density function f or the posterior log density function f(y).

The signature for the data and parameters arguments are given in their respective blocks and the model block computes the log density. If you start from there, it might be simpler to convey. That’s how I originally conceived and built it. The transformed data and parameters are just placeholders, and the generated quantities is just for posterior predictive generation. You can write all that down functionally, too, and if you do, it follows the way it’s implemented in C++.

andrewgelman:
it’s not clear to people that a compiled Stan program gives you an R (or Python) function that computes a target function, and that calling this R (or Python) function gives you the function and its gradient.

How do you suggest we make it clearer? The only thing I don’t like about it is that you have to run sampling to get data associated with an model for sampling.

I’d like to go with your (Bob’s) plan to separate compilation, computation, and model fitting. Compilation creates this foo() function that returns target and gradient, then these can be computed at will. Also, Stan has an optimizer and an ADVi solver and a Nuts solver so it can do those too.

I indicated above how I’d like to see it go with a function to compile a Stan program to a joint density, a function to take a joint density and data and return a posterior density, then something that took that posterior density and ran sampling on it. Right now, all these steps are part of a single function stan(), and there’s not a good way to get the posterior density function with only data—you need to run a sampler for 0 iterations.

I agree 100% that it was a mistake to create this stan() function, and this was in part a result of my lack of imagination and understanding.

@andrewgelman Ok, here’s a crude version of your stan_compile() function, but it should only be used for Stan programs with no parameter constraints because it’s really interpreting params in foo(data, params) as being on the unconstrained space. It may have to wait until Stan3 for a convenient way to do this for arbitrary Stan programs.

# @param file Path to Stan program.
# @return A function that takes arguments ‘data’ (list) and ‘params’ (numeric vector),
# and returns a list containing the target (scalar) and its gradient (dims depends on number of params)
stan_compile <- function(file) {
comp <- rstan::stan_model(file)
function(data, params) {
suppressWarnings({
x <- rstan::sampling(comp, data = data, iter = 1, refresh = 0)
})
out <- list()
grad_and_target <- rstan::grad_log_prob(x, upars = params)
out$target <- attr(grad_and_target, "log_prob")
attr(grad_and_target, "log_prob") <- NULL
out$gradient <- grad_and_target
out
}
}

Demonstration

# will use a string for demo but preferably use a separate file
program <- "
data {
int<lower=0> N;
int<lower=0,upper=1> y[N];
}
parameters {
real theta_logit;
}
model {
target += bernoulli_logit_lpmf(y | theta_logit);
}
"
my_file <- tempfile(fileext = ".stan")
writeLines(program, my_file)
# create foo()
foo <- stan_compile(my_file)
# data to pass to foo()
my_data <- list(N = 10, y =c(0,1,0,0,0,0,0,0,0,1))
# call foo with my_data to get target and gradient when theta_logit=0
target_and_gradient <- foo(data = my_data, params = 0)
print(target_and_gradient)
$target
[1] -6.931472
$gradient
[1] -3

Is that what you were envisioning demonstrating in the case study?

The function I posted above also isn’t as efficient as it should be because it has to do 1 iteration of sampling every time foo is called. If you just wanted foo as a function of params (with data always the same) then we would only need to do 1 iteration once before creating foo, not every time. That’s an artifact of the current RStan implementation that dates back a long time but should be better in v3.

OK, I guess we can set treedepth=1 so it will take minimal time, but in any case the purpose of this exercise (we can talk and elaborate it a bit) is to demonstrate the functionality that we’d like to have in the future. So it’s fine if the implementation is a bit clunky.

That still requires you to provide the constrained parameters in order as a list. I don’t know if there’s an easy way to have structured parameters the way we’d use for inits. You can also drop the unconstrain_parameters thing in the function and do everything on the unconstrained scale.

@Bob_Carpenter I see a data argument to the anonymous function but I don’t see data getting used anywhere. In order to be able to change the data I think you’d need something like this:

stan_compile <- function(file) {
comp <- stan_model(file)
function(data, params) {
suppressWarnings({
x <- sampling(comp, data = data, chains = 0)
})
# should only be used if no constrained parameters since grad_log_prob()
# actually wants unconstrained params
grad_and_target <- grad_log_prob(x, upars = params)
out <- list()
out$target <- attr(grad_and_target, "log_prob")
attr(grad_and_target, "log_prob") <- NULL
out$gradient <- grad_and_target
out
}
}

Thanks for catching that. The data needs to go in the call to stan().

Why are you suppressing warnings?

I got really lost in this:

out <- list()
out$target <- attr(grad_and_target, "log_prob")
attr(grad_and_target, "log_prob") <- NULL
out$gradient <- grad_and_target
out

Why not just return grad_and_target? Given that you’re going to make a two-element list, then why not pull out just the log density value for $target and just the gradient vector for $gradient? As is, the gradient gets set to the gradient and target after nulling out the target. Is it more efficient in R to code it how you did? Will out$gradient just be a vector even though it looks like it starts as an object (or list?).

I guess because the target is stored as an attribute in grad_and_target, but for a user it’s easier to see that’s available if it’s returned as a separate element in a list.

out$gradient will have the same format as grad_and_target, but its log_prob attribute has been nulled out, since it’s returned separately.