Using Stan to compute the gradient of a function

Suppose I’m working in R and I want to compute the gradient of a function, for example log(lambda*normal(theta|mu1,sigma1) + (1-lambda)*normal(theta|mu2,sigma2)). Is there a way to write this as a function in Stan and then call it from R or Python to obtain the gradients?

1 Like

Because it’s a function that returns a single scalar, you could write a model which increments the target by the result and then the log_density_gradient function from BridgeStan or RStan should give you its gradient. I’m not sure of a method that would generalize to arbitrary functions

Not ideal as you have to write rcpp Automatic differentiation in R with Stan Math | A Random Walk.

I’ve been using Julia for all my AD needs. I’m sure others can show you how to do this in Python with Jax. R doesn’t have a great solution today.

I’d like to do it using Stan, because Stan allows me to just write the model using statistics notation. When using Stan, it would be cleanest to just create a function, rather than having to write a full program with data block, parameter block, and model block. For now, though, I guess that will be necessary. I’ll try to get BridgeStan running and see if this works.

I’m currently working on extensions for cmdstanr to handle this automatically, you’re welcome to try the development branch:

remotes::install_github("stan-dev/cmdstanr@function-gradients")

When exposing Stan functions in R, you can now also request to expose the gradients/Jacobians for the functions:

testfuncode <- "
functions {
  real scalar_fun(real lambda, real theta, real mu1, real sigma1, real mu2, real sigma2) {
    return log_mix(lambda,
                  normal_lpdf(theta | mu1, sigma1),
                  normal_lpdf(theta | mu2, sigma2));
  }
  
  matrix matrix_fun(vector x, row_vector y) {
    return x * y;
  }
}
"
mod <- cmdstan_model(write_stan_file(testfuncode), force_recompile = TRUE)
mod$expose_functions(gradients = TRUE)

# Gradient functions now available with "_grad" suffix:

> mod$functions$scalar_fun_grad(lambda = 0.5, theta = 5.1,
                                mu1 = -1.5, sigma1 = 3.6,
                                mu2 = 4.9, sigma2 = 0.5)
$function_value
[1] -0.9712985

$jacobian
          [,1]       [,2]       [,3]      [,4]      [,5]      [,6]
[1,] -1.890954 -0.7920739 0.01388322 0.0178799 0.7781907 -1.634201



> mod$functions$matrix_fun_grad(c(1.4, 6.1), c(3,1))
$function_value
     [,1] [,2]
[1,]  4.2  1.4
[2,] 18.3  6.1

$jacobian
     [,1] [,2] [,3] [,4]
[1,]    3    0  1.4  0.0
[2,]    0    3  6.1  0.0
[3,]    1    0  0.0  1.4
[4,]    0    1  0.0  6.1

It’s still a bit rough, but let me know if you run into any issues

9 Likes

Amazing!

1 Like

Nice! Any thoughts on the difficulty of including pre compiled functions and gradients in a package?

@spinkney Just to note that R does have a good solution for getting the gradient of statistical models using the RTMB package. For example here is a simple linear regression:

After line 16 you can just call obj$gr(pars) to get the gradient using AD for any parameter vector pars. This scales to much more complex models as well, including the gradient of the marginal likelihood using the Laplace approximation.

See the RTMB intro here.

1 Like

Does this work to differentiate matrices and matrix decompositions like cholesky, svd, LU, etc?

Yes I think this isan example of a Choleksy decomposition and matrix multiplication that you meant:

As before once the obj is defined in R you can get the gradient of the marginal likelihood with the obj$gr(par) function for any parameter vector of fixed effects 'par`.

Yes. But don’t write sums on the log scale like that—it’s not numerically stable due to potential underflow in the normals and loss of precision in the implicit exp/log cycle. The precise and stable way to write this as follows.

log_sum_exp(log(lambda) + normal_lpdf(theta | mu1, sigma1),
            log1m(lambda) + normal_lpdf(theta | mu2, sigma2));

This pattern is so common that we have convenient shorthand:

log_mix(lambda, normal_lpdf(theta | mu1, sigma1), normal_lpdf(theta | mu2, sigma2))

So you can write this program in Stan:

parameters {
  real theta;
  real lambda;
  real mu1, mu2;
  real sigma1, sigma2;
}
model {
  target += log_mix(lambda,
                    normal_lpdf(theta | mu1, sigma1),
                    normal_lpdf(theta | mu2, sigma2));   
}

and then you can call it like this in BridgeStan in Python:

import bridgestan as bs
import numpy as np

model = bs.StanModel('mixture.stan')
theta = np.array([1.5, 0.3, 1, 5, 0.3, 1.8])
val, grad = model.log_density_gradient(theta)
print(f"{val=} {grad=}")

And if you run it, you get the following output:

val=-2.096344137466215 grad=array([-4.29065754,  2.42563243,  4.49657117, -0.20591363,  4.79634258,
        0.29448918])

I intentionally did not include constraints so there are no implicit transforms plus Jacobians. But that means you have to provide inputs satisfying the relevant constraints (sigma1, sigma2 > 0, theta in (0, 1)).

As @WardBrian points out, this is only going to work for univariate functions. We don’t have a way to return general Jacobians this way, though Stan’s autodiff can do it in C++.

1 Like

wow, this is great!

@Bob_Carpenter Why is this a function wrt to theta and not the mu and sigma parameters? Isn’t theta here the observations? In R wouldn’t it be

log(lambda*dnorm(theta,mu1,sigma1,FALSE)+ (1-lambda)*dnorm(theta,mu2, sigma2,FALSE))

…ignoring the overflow issue you noted?

I’m looking for something like we have a function f(a, b, X) what is the derivative of f wrt to X where X is a matrix. For example, can it compute the gradient of the multivariate normal cholesky parameterization wrt to the Cholesky factor L?

In Bob’s complete Stan program, he explicitly declares theta as a parameter. Whether theta is a parameter or data depends completely on the modeling situation.

Yup.

@spinkney I don’t know of a way in the normal interface but it’s possible you can with some digging. I suggest asking at the RTMB issue board to get a better answer.

@jsocolar thanks for clarifying. I realize this is the Stan board but FYI below is quick R code that shows how to the the gradient in this particular example.

library(RTMB)
theta <- c(1.5, 0.3, 1, 5, 0.3, 1.8)
data <- list(theta=theta)
parameters <- list(mu1=0,sigma1=2, mu2=5,sigma2=1.5, lambda=.5)
f <- function(params){
  getAll(data,params)
  nll <- -log(lambda*dnorm(theta,mu1,sigma1,FALSE)+
              (1-lambda)*dnorm(theta,mu2, sigma2,FALSE))
  sum(nll)                              # total negative log-likelihood
}
f(unlist(parameters))                   # NLL in R
obj <- MakeADFun(f, parameters, silent=TRUE)
par <- c(0,3,5,1,.5)                    # arbitrary parameter vector
obj$fn(par)                                # NLL from TMB
obj$gr(par)                                # AD gradient of NLL
obj$he(par)                                # AD Hessian

1 Like