Modularity for making Stan more Pythonic and Rthonic

My thoughts on this:

  1. The title of the thread is a red herring that was triggered by a student that happened to be familiar with PyTorch. I don’t think there is much to say here about what is more Pythonic or Rthonic or whether that is better or worse but if so, it should be said by people who have written at least a little bit of code in both Python and R.
  2. If Andrew had just said “It should be easier to evaluate the log-kernel and its gradient from Stan interfaces”, I think that would have been non-controversial. Indeed, we have been planning to do that for RStan since at least 2015, but it is not backward compatible with the 2013 RStan design where calling things like log_prob was an afterthought (and not a particularly well thought out afterthought).
  3. Although we should make it easier to evaluate the log-kernel and its gradient from Stan interfaces, I don’t think it will make a big effect. It has always been possible to do this in RStan , I gave a presentation about it at the R conference in Brussels, Dustin tried to go down this road when implementing GMO, etc. and it hasn’t caught on for reasons that I think are deeper than it is currently clunky to do in RStan and PyStan:
  • For people who want to do maximum likelihood, having autodiff gradients is not enough of an inducement to write the log-likelihood function in the Stan language. In most cases where someone wants to do maximum likelihood, the gradient and Hessian have already been worked out.

  • For people who want to do what Andrew was doing in his class — finding the mode in a toy problem without implementing the gradient and Hessian — having autodiff gradients is not enough of an inducement to write the log-likelihood function in the Stan language because numerical gradients are good enough and allow you to write the log-likelihood function in the interface language.

  • For people who want to do supervised learning, having autodiffed gradients is not enough of an inducement to write the objective function in the Stan language because they want to implement specialized optimization algorithms that scale to problems with terrabytes of data

  • 95% of people that Stan — and the percentage is even higher among people who are first learning Stan — do not recognize a Stan program as defining a function from the things declared in the parameters block to the real numbers that is defined by a log-kernel. This confusion is exacerbated by people continuing to write (and teach) the theta ~ normal(mu, sigma); notation rather than the target += normal_lpdf(theta | mu, sigma); notation, but even those of us who teach the latter notation find it difficult to communicate to students that a Stan program defines a function because the function it defines does not look like a function in R, Python, or whatever interface. When most people think of a function, they think of something that looks like (in R)

normal_pdf <- function(x, mu, sigma) {
  return( 1 / (sigma * sqrt(2 * pi)) * 
    exp( -0.5 * ( ( x - mu) / sigma ) ^ 2 )
}

When I try to tell people that a Stan program defines a function they ask “WTF are the arguments to this function and what is returned?”, which are non-obvious unless you look at the generated C++ code.

  • The only case where ideas like this have caught on a little bit (in R; it hasn’t been implemented for Python or Julia yet) is with rstan::expose_stan_functions, in which case the Stan code implementing the function actually looks like the above normal_pdf function.
  1. For someone who wants to write a R package that utilizes the log-kernel, gradient, Hessian, etc. you are likely better off writing C++ code by hand than Stan code, in which case this vignette is what you need
    https://cran.r-project.org/package=StanHeaders/vignettes/stanmath.html
  2. CmdStan (or really a shell scripting language) is not set up do things like this
7 Likes