Using Stan to compute the gradient of a function

andrewgelman · May 8, 2024, 12:51pm

Suppose I’m working in R and I want to compute the gradient of a function, for example log(lambda*normal(theta|mu1,sigma1) + (1-lambda)*normal(theta|mu2,sigma2)). Is there a way to write this as a function in Stan and then call it from R or Python to obtain the gradients?

WardBrian · May 8, 2024, 1:30pm

Because it’s a function that returns a single scalar, you could write a model which increments the target by the result and then the log_density_gradient function from BridgeStan or RStan should give you its gradient. I’m not sure of a method that would generalize to arbitrary functions

spinkney · May 8, 2024, 2:44pm

Not ideal as you have to write rcpp Automatic differentiation in R with Stan Math | A Random Walk.

I’ve been using Julia for all my AD needs. I’m sure others can show you how to do this in Python with Jax. R doesn’t have a great solution today.

andrewgelman · May 8, 2024, 3:24pm

I’d like to do it using Stan, because Stan allows me to just write the model using statistics notation. When using Stan, it would be cleanest to just create a function, rather than having to write a full program with data block, parameter block, and model block. For now, though, I guess that will be necessary. I’ll try to get BridgeStan running and see if this works.

andrjohns · May 9, 2024, 11:20am

I’m currently working on extensions for cmdstanr to handle this automatically, you’re welcome to try the development branch:

remotes::install_github("stan-dev/cmdstanr@function-gradients")

When exposing Stan functions in R, you can now also request to expose the gradients/Jacobians for the functions:

testfuncode <- "
functions {
  real scalar_fun(real lambda, real theta, real mu1, real sigma1, real mu2, real sigma2) {
    return log_mix(lambda,
                  normal_lpdf(theta | mu1, sigma1),
                  normal_lpdf(theta | mu2, sigma2));
  }
  
  matrix matrix_fun(vector x, row_vector y) {
    return x * y;
  }
}
"
mod <- cmdstan_model(write_stan_file(testfuncode), force_recompile = TRUE)
mod$expose_functions(gradients = TRUE)

# Gradient functions now available with "_grad" suffix:

> mod$functions$scalar_fun_grad(lambda = 0.5, theta = 5.1,
                                mu1 = -1.5, sigma1 = 3.6,
                                mu2 = 4.9, sigma2 = 0.5)
$function_value
[1] -0.9712985

$jacobian
          [,1]       [,2]       [,3]      [,4]      [,5]      [,6]
[1,] -1.890954 -0.7920739 0.01388322 0.0178799 0.7781907 -1.634201



> mod$functions$matrix_fun_grad(c(1.4, 6.1), c(3,1))
$function_value
     [,1] [,2]
[1,]  4.2  1.4
[2,] 18.3  6.1

$jacobian
     [,1] [,2] [,3] [,4]
[1,]    3    0  1.4  0.0
[2,]    0    3  6.1  0.0
[3,]    1    0  0.0  1.4
[4,]    0    1  0.0  6.1

It’s still a bit rough, but let me know if you run into any issues

spinkney · May 9, 2024, 11:46am

Amazing!

Charles_Driver · May 9, 2024, 1:34pm

Nice! Any thoughts on the difficulty of including pre compiled functions and gradients in a package?

monnahc · May 28, 2024, 3:56pm

@spinkney Just to note that R does have a good solution for getting the gradient of statistical models using the RTMB package. For example here is a simple linear regression:

github.com

kaskr/RTMB/blob/master/tmb_examples/linreg.R

library(RTMB)
set.seed(123)
data <- list(Y = rnorm(10) + 1:10, x=1:10)
parameters <- list(a=0, b=0, logSigma=0)

f <- function(parms) {
    Y <- data$Y
    x <- data$x
    a <- parms$a
    b <- parms$b
    logSigma <- parms$logSigma
    ADREPORT(exp(2*logSigma));
    nll = -sum(dnorm(Y, a+b*x, exp(logSigma), TRUE))
    nll
}
obj <- MakeADFun(f, parameters)

obj$hessian <- TRUE
opt <- do.call("optim", obj)
opt

This file has been truncated. show original

After line 16 you can just call obj$gr(pars) to get the gradient using AD for any parameter vector pars. This scales to much more complex models as well, including the gradient of the marginal likelihood using the Laplace approximation.

See the RTMB intro here.

spinkney · May 28, 2024, 4:45pm

Does this work to differentiate matrices and matrix decompositions like cholesky, svd, LU, etc?

monnahc · May 28, 2024, 7:18pm

Yes I think this isan example of a Choleksy decomposition and matrix multiplication that you meant:

github.com

kaskr/RTMB/blob/master/tmb_examples/sdv_multi.R#L39


      
          sigma_init <- sigma/sqrt(1-phi^2)  
          h <- t(parms$h) ## Workaround: order as old TMB example for unittest
          
          nll = 0  # Start collecting contributions
          
          # Prior on state variable (y)
          nll <- nll - sum(dnorm(h[1,],0,sigma_init,log=TRUE))   # Start in stationary distribution
          for(j in 1:p) 
            nll <- nll - sum(dnorm(h[-1,j],phi[j]*h[-n,j],sigma[j],log=TRUE))  # AR(1) process
          
          # Parameterizes correlation matrix of X in terms of Cholesky factor
          L <- diag(p)
          L[lower.tri(L)] <- parms$off_diag_x   
          row_norms <- apply(L, 1, function(row) sqrt(sum(row^2)))
          L <- L / row_norms
          R <- L%*%t(L)  # Correlation matrix of X (guarantied positive definite)
          
          # Likelihood of data (X) given h
          for(i in 1:n){
            sigma_y <- exp(0.5*(parms$mu_x + h[i,]));
            Cov <- diag(sigma_y) %*% R %*% diag(sigma_y)

As before once the obj is defined in R you can get the gradient of the marginal likelihood with the obj$gr(par) function for any parameter vector of fixed effects 'par`.

Bob_Carpenter · May 28, 2024, 7:38pm

Yes. But don’t write sums on the log scale like that—it’s not numerically stable due to potential underflow in the normals and loss of precision in the implicit exp/log cycle. The precise and stable way to write this as follows.

log_sum_exp(log(lambda) + normal_lpdf(theta | mu1, sigma1),
            log1m(lambda) + normal_lpdf(theta | mu2, sigma2));

This pattern is so common that we have convenient shorthand:

log_mix(lambda, normal_lpdf(theta | mu1, sigma1), normal_lpdf(theta | mu2, sigma2))

So you can write this program in Stan:

parameters {
  real theta;
  real lambda;
  real mu1, mu2;
  real sigma1, sigma2;
}
model {
  target += log_mix(lambda,
                    normal_lpdf(theta | mu1, sigma1),
                    normal_lpdf(theta | mu2, sigma2));   
}

and then you can call it like this in BridgeStan in Python:

import bridgestan as bs
import numpy as np

model = bs.StanModel('mixture.stan')
theta = np.array([1.5, 0.3, 1, 5, 0.3, 1.8])
val, grad = model.log_density_gradient(theta)
print(f"{val=} {grad=}")

And if you run it, you get the following output:

val=-2.096344137466215 grad=array([-4.29065754,  2.42563243,  4.49657117, -0.20591363,  4.79634258,
        0.29448918])

I intentionally did not include constraints so there are no implicit transforms plus Jacobians. But that means you have to provide inputs satisfying the relevant constraints (sigma1, sigma2 > 0, theta in (0, 1)).

As @WardBrian points out, this is only going to work for univariate functions. We don’t have a way to return general Jacobians this way, though Stan’s autodiff can do it in C++.

andrewgelman · May 28, 2024, 7:39pm

wow, this is great!

monnahc · May 28, 2024, 7:55pm

@Bob_Carpenter Why is this a function wrt to theta and not the mu and sigma parameters? Isn’t theta here the observations? In R wouldn’t it be

log(lambda*dnorm(theta,mu1,sigma1,FALSE)+ (1-lambda)*dnorm(theta,mu2, sigma2,FALSE))

…ignoring the overflow issue you noted?

spinkney · May 28, 2024, 11:23pm

I’m looking for something like we have a function f(a, b, X) what is the derivative of f wrt to X where X is a matrix. For example, can it compute the gradient of the multivariate normal cholesky parameterization wrt to the Cholesky factor L?

jsocolar · May 29, 2024, 12:56pm

In Bob’s complete Stan program, he explicitly declares theta as a parameter. Whether theta is a parameter or data depends completely on the modeling situation.

Yup.

monnahc · June 1, 2024, 12:02am

@spinkney I don’t know of a way in the normal interface but it’s possible you can with some digging. I suggest asking at the RTMB issue board to get a better answer.

@jsocolar thanks for clarifying. I realize this is the Stan board but FYI below is quick R code that shows how to the the gradient in this particular example.

library(RTMB)
theta <- c(1.5, 0.3, 1, 5, 0.3, 1.8)
data <- list(theta=theta)
parameters <- list(mu1=0,sigma1=2, mu2=5,sigma2=1.5, lambda=.5)
f <- function(params){
  getAll(data,params)
  nll <- -log(lambda*dnorm(theta,mu1,sigma1,FALSE)+
              (1-lambda)*dnorm(theta,mu2, sigma2,FALSE))
  sum(nll)                              # total negative log-likelihood
}
f(unlist(parameters))                   # NLL in R
obj <- MakeADFun(f, parameters, silent=TRUE)
par <- c(0,3,5,1,.5)                    # arbitrary parameter vector
obj$fn(par)                                # NLL from TMB
obj$gr(par)                                # AD gradient of NLL
obj$he(par)                                # AD Hessian

Topic		Replies	Views
Computing the gradient of another gradient General	3	413	January 31, 2020
Automatic differentiation with stan math Modeling	12	4592	November 24, 2017
Jacobian/Gradient Function Developers rstan	3	680	September 20, 2018
Tutorial: Using R's Distribution Functions in Stan Models (external C++) General	2	186	July 31, 2024
Exposing the gradient of the log-likelihood General	12	874	October 28, 2020

Using Stan to compute the gradient of a function

Related topics