Loosely speaking, in this case we want to write a C++ function (that can be called from a Stan program) to evaluate a log-likelihood that
- Takes in all the necessary arguments
- Circumvents the usual autodifferentiation mechanism in Stan
- Sums up the
n
contributions to the log-likelihood in a numerically stable way
- Calculates the gradient of that log-likelihood value with respect to all of the parameters
- When returning, tells Stan what that gradient was for the calculations we just did while circumventing the usual autodiff mechanism.
But no one dives into modeling with Stan like that, which is why I was trying to start with a wikipedia-level Cox model that could be implemented just with Stan syntax. There is a lot of documentation for the Stan Math library and writing a Stan program as a text file. However, all of that is for the 99.99% of the time where the user is not the least bit concerned with derivatives because the autodifferentiation mechanism is handling them.
Maybe let’s start with writing a R function — which can be O(n) or O(n^2) or whatever — that evaluates the log-likelihood you want, and I can explain in terms of R what would be analogous to the C++. Suppose in R we had a S4 class called var
var <- setClass("var", slots = c(value = "numeric", derivatives = "externalptr"))
that is basically just a wrapper around a numeric vector but it also contains an external pointer to somewhere that tells you what the derivative of some (e.g. log-likelihood) function is with respect to this var
, evaluated at var@value
. We can then construct random instances of a var
:
alpha <- new("var", value = rnorm(1)) # but derivatives is uninitialized here
beta <- new("var", value = rnorm(5))
theta <- new("var", value = abs(rnorm(1)))
This is a lot like the var
C++ class in the Stan Math library that is used for unknown parameters, which automatically updates the external pointer with another link in the chain rule whenever some function is applied to a var
.
Except in cases like yours, we don’t want to use autodifferentiation and instead want to supply analytical gradients ourselves. So, we need a mechanism in R to coerce a var
back into a regular double-precision thing
setAs(from = "var", to = "numeric", def = function(from) return(from@value))
as(alpha, "numeric")
that we can then use without triggering any updating of the autodifferentiation mechanism until we are ready to do so.
The outline of a R function that corresponds to the outline of the C++ function I wrote before would be
log_lik <- function(alpha, # a var scalar
beta, # a var vector
theta, # a var scalar
y, # a numeric vector (or later a Surv)
X, # a dense numeric matrix
Z, # a dense numeric matrix (or later a sparse Matrix)
order) { # an integer vector that has to do with how you sum
# as() in R is basically the same as stan::math::value_of() in C++
alpha_ <- as(alpha, "numeric")
beta_ <- as(beta, "numeric")
theta_ <- as(theta, "numeric")
# initialize the scalar value for the log-likelihood and a gradient vector
log_lik <- 0
gradient <- rep(NA_real_, times = 2 + rows(beta))
# Do complicated stuff to increment log_lik over the data and fill in gradient
# using alpha_, beta_, theta_ (with underscores) rather than alpha, beta, theta
# glue all the parameters together and return a var
params <- vector("list", length = 2 + rows(beta))
params[[1]] <- alpha
for (k in 1:rows(beta)) params[[k + 1]] <- beta[k]
params[[2 +rows(beta)]] <- theta
return(precomputed_gradients(log_lik, params, gradient))
}
Here precomputed_gradients
is just a utility function implemented elsewhere that looks something like
precomputed_gradients <- function(value, # numeric scalar
params, # a list of length K containing vars
gradient) { # a numeric vector of length K
# manually update an external pointer called xPtr in light of params and gradient
return(new("var", value = value, derivatives = xPtr))
}
So, the only piece that we are missing is how to “Do complicated stuff to increment log_lik over the data and fill in gradient using alpha_, beta_, and theta_ (with underscores) rather than alpha, beta, theta”. I know it involves y
, X
, and Z
and some order
that tells you how to keep numerical stability when you loop over the data, but I don’t know the statistics here. If you could fill that part in for me in R or C, I could map that to C++. Really the C++ for that part is not going to be much different than regular C, except we can utilize the Eigen matrix library (which would be like using the Matrix package in R rather than a plain two-dimensional array
).