What to read -- interfacing the coxme package to STAN


#1

Terry Therneau here, author of the survival and coxme packages in R.

I am beginning to think about the next version of coxme = mixed effects Cox models. The function does a few things well and some others okay, but it has a limited palette and it looks to me like the sensible way to increase that is a move to MCMC methods. Four defining aspects are a. for some of my use cases the correlation matrices are large and sparse, e.g. the kinship matrix, and for these the likelihood and first derivative have fast efficient algorithms while the second derivative is a PITA, b. I have well tested code to evaluate the likelihood for Cox models (20 years of user support experience includes nearly every edge case imaginable), c. a+b suggests plugging in to a Hamiltonian MC library in some way as my back end.

STAN would appear to be a great candidate for c, but before hassling the list with detailed questions I’ll start with the most basic: what existing materials/documents could I read to begin thinking about this more concretely?

Footnote: The Cox partial likelihood (PL) is very well behaved numerically; it is almost (but not quite) as quadratic as the Gaussian within 2-4 se of the MLE. So MCMC should work well. Computation of the PL with speed and accuracy turns out to be all about bookkeeping. Subtle, dull, meticulous tracking of who is in or out of the risk set at any given time.

Kinship matrix: one row/col per subject with the expected amount of “common ancestor” genes; K[i,j] = .5 for mother/child or sib/sib; .25 for uncle/neice or grandparent/grandchild; 0 for unrelated; etc.


#2

Hi Terry. Good to see you around here. There has been a lot of interest in making it easier to do survival models in Stan but not a lot of expertise among core Stan developers, so we have been more reliant on user-developers to get things moving in the right direction. We’re in the process of merging a PR into rstantools that hopefully will make it easier to add Stan functionality to an existing R package (as opposed to starting from scratch) but you don’t need to wait for that to get started on the R and Stan code you will need.

One thing to start with might be how I imitated survival::clogit in rstanarm::stan_clogit. There is a user-facing R file that inputs a formula, a data.frame, etc. and sets up the design matrices:


Some differences are that strata is a separate argument and that I used lme4 to set up the Z matrix in the linear predictor \boldsymbol{\eta} = \mathbf{X} \boldsymbol{\beta} + \mathbf{Z} \mathbf{b} but I guess you already have a lot of code to do that in coxme. It then hands off to stan_glm.fit, which is kind of a beast of a function that a lot of things go through, but it ends up calling a Stan program for Bernoulli models with a clogit flag to tell it how to evaluate the log-likelihood. I wrote a couple of Stan functions to evaluate the log-likelihood at

Such functions can be exposed to the R environment using the rstan::expose_stan_functions() function, so you can and should test that all of your weird edge cases imply the same log-likelihood whether evaluated in your old R code and your new Stan code as in

I didn’t even implement the gradient for the clogit model, but it would execute faster had I done so instead of relying on the autodiff. I would suggest to get things working with the likelihood function first and then worry about the gradient later because you would have to write some C++ to do that. Similarly with the kinship matrices, you can get things to work somewhat inefficiently in the Stan language by making them dense and then we can worry about using Eigen or maybe some custom C++ to take advantage of the sparsity.

I should also say that @sambrilleman who implemented a lot of fantastic stuff for joint (survival + severity) models in rstanarm recently put up another branch with a bunch of “simpler” survival models that has a lot of complicated code to deal with all the things that make simple survival models complicated.

He and @ermeel have been talking a lot on Discourse and GitHub about various things including less-parametric functions for the baseline hazard rate. Finally, @jackinovik has written a Python library for survival models in Stan that you may want to look at

But certainly post when more questions come up.


#3

A few small questions and comments interspersed. The overall comment is that you almost cannot underestimate how little I know about STAN details.

What is a PR?  What is rstantools? Is there a reason everything is double spaced?  Was I supposed to learn something from this code fragment?   (Don't mean to complain, but it feels like a few sentences from the middle of a movie I've never seen).   The rest of the email is completely impenetrable to me.  I have no idea what "D_g", "vector" etc are, since I have no context.  Where can I find out the overall context of how an R interface would work -- think of a 1000 foot view? None of this is going to be the least help to me without an intro. Terry

#4

Ben

Hi Terry. Good to see you around here. There has been a lot of interest in making it
easier to do survival models in Stan but not a lot of expertise among core Stan
developers, so we have been more reliant on user-developers to get things moving in the
right direction. We’re in the process of merging a PR into rstantools that hopefully
will make it easier to add Stan functionality to an existing R package (as opposed to
starting from scratch) but you don’t need to wait for that to get started on the R and
Stan code you will need.

One thing to start with might be how I imitated |survival::clogit| in

rstanarm::stan_clogit|. There is a user-facing R file that inputs a formula, a
data.frame, etc. and sets up the design matrices:

A longer set of questions based on your prior message

  1. I’ll assume that stan does not have an evaluation function for the Cox partial likelihood that will pass all my tests. (I have an entire directory of “kill the evaluation” tests, mostly derived from user examples that did just that.)
    a. Where and how does this new code get added? Part of STAN proper of something my code plugs in just for itself?
    b. What language does it have to be in (I use C)? What headers?

  2. Since I won’t need second derivatives or Newton-Raphson in the STAN case, I can boil this down to two functions, one for the case of (time, status) data and one for (time1, time2, status) data. Input arguments to the first routine will be time, status, eta, strata, order; the second has two ordering vectors. Time and status are as expected, eta is the linear predictor (X\beta + Zb + offset) for each subject, strata in integer vector of stratum identifiers, and order contains sorting information. Output would be a partial likelihood and a vector of martingale residuals m; the first deriviative wrt the parameters is X’m where X is the covariate matrix. (The fact that the first derivative has this simple form, and that I only need eta, means that it can be extra fast when X is sparse.)
    How to best plug this into the STAN framework?
    Computation of eta and the derivatives seems like a task well suited to the parent that calls it, but of course that could be bundled in.

  3. I looked at rstantools, but the example left a lot unanswered, e.g., what is passed back from STAN, how are errors handled, how are options passed (if I wanted it to be less chatty for instance)? I was missing an overview.

  4. The random effects part of a coxme formula looks just like the one from lmer (because I copied it). That sort of flexibility does not seem to fit into the requirement that “the stan function must be precompiled”.

Terry T.


#5

OK, sorry! Let me recalibrate my original reply and then I’ll answer your other questions in separate followups.

The basic steps of using Stan are:

  1. A human writes a program in the Stan language as a text file (somewhat similar to BUGS) that defines the logarithm of the posterior kernel, i.e. the log-likelihood plus the logarithm of the prior density function. If we were merely optimizing, then the logarithm of the prior density function would serve as a penalty function.
  2. There is an automatic translation from this Stan program to an equivalent C++ program that you don’t need to worry much about.
  3. R packages call these compiled C++ programs (from step 2) to draw from the posterior distribution (defined by step 1). This is similar to using the .C() or .Call() functions in R, except that it actually uses the Rcpp package.

The good news is that it is (or at least can be) simpler than the C code you have in coxme because you don’t have to worry about handling the iterations, allocating or freeing memory, SEXPs, or any of that. Just worry about how to write down the log-likelihood and the prior log-densities.

As an example of step 1, here is a simple Stan program for a Weibull GLM with known shape, a log link, and standard normal priors on the unknown coefficients:

data {
  int<lower = 0> N;      // number of observations
  int<lower = 0> K;      // number of covariates
  matrix[N, K] X;        // design matrix
  vector<lower=0>[N] y;  // outcome
  real<lower = 0> shape; // shape parameter
}
transformed data {
  real<lower = 0> gamma_fn = tgamma(1 + 1 / shape);
}
parameters {
  real alpha;         // intercept
  vector[K] beta;     // coefficients
}
model {
  vector[N] eta = alpha + X * beta;
  vector[N] mu = exp(eta);
  vector[N] scale = mu / gamma_fn;

  target += weibull_lpdf(y | shape, scale); // log-likelihood
  target += std_normal_lpdf(alpha);         // prior on alpha
  target += std_normal_lpdf(beta);          // iid priors on the elements of beta
}

You can think of the model block of a Stan program as a function of the parameters (where the data and transformed data blocks are in scope but not changing) that implicitly returns target, which is automatically initialized to zero and gets incremented by the log-likelihood and the log-density of the priors (possibly ignoring constants).

One could autotranslate that Stan program into C++, compile it, and call it from R by passing a named list with elements that correspond to the symbol names in the data block, namely N, K, X, y, and shape. But we can get to that part another day.

OK. Let’s say you want to do something like this for a Cox model with Gaussian frailties. There are different ways of going about that, but here I am going to treat the frailties as if they were missing data (with an unknown scale parameter theta) and estimate their joint posterior distribution along with the intercept (alpha) and coefficients (beta):

data {
  int<lower = 0> N;      // number of observations
  int<lower = 0> K;      // number of covariates
  matrix[N, K] X;        // design matrix
  int<lower = 0> q;      // number of families
  matrix[N, q] Z;        // we'll make this a sparse matrix later
  positive_ordered[N] y; // outcome
  /* NOTE: I am requiring y to be sorted from smallest to largest and the rows
   of X and Z are assumed to have been sorted according to y, which makes
   evaluating the partial likelihood more efficient with tail() calls
   */
   
  // known inputs to priors on alpha, beta, and theta (in the parameters block)
  real alpha_loc;
  real<lower = 0> alpha_scale;
  vector[K] beta_loc;
  vector<lower = 0>[K] beta_scale;
  real<lower = 0> theta_rate;
}
parameters {
  real alpha;            // intercept
  vector[K] beta;        // coefficients
  vector[q] omega_std;   // frailties with unit scale
  real<lower = 0> theta; // scale for frailties
}
model {
  // frailties are implicitly theta * omega_std due to scale invariance
  vector[N] eta = alpha + X * beta + Z * (theta * omega_std);
  
  // partial likelihood contributions
  for (n in 1:(N - 1)) target += eta[n] - log_sum_exp(tail(eta, N - n + 1));
  
  // prior on observation specific parameters
  target += std_normal_lpdf(omega_std);
  
  // priors on common parameters (can change to whatever)
  target += normal_lpdf(alpha | alpha_loc, alpha_scale);
  target += normal_lpdf(beta  | beta_loc, beta_scale);
  target += exponential_lpdf(theta | theta_rate);
}

Now, we could get clever with sparse Z, ties, Laplace approximations, etc. but let’s make sure I didn’t lose you again before we go onto that.


#6

Brief answers to other questions

Pull Request (on GitHub) to add survival models to the existing (on CRAN) rstanarm package.

Another package on CRAN that has a rstan_package.skeleton function that is analogous to the package.skeleton function in utils , which sets up the directories so that one could build a new R package that utilizes Stan. We’re shortly going to upload a new version to CRAN that makes it easier to do so for an existing R package.

No

Essentially, the two options are the Stan language or C++. These are actually the same because a program written in the Stan language gets translated to C++, and you can call a C++ function from the Stan language. In principle, you could wrap a call to a C function and compile it with the C++ compiler, but your C function would have to compute gradients internally (it is way easier to let Stan handle gradients through its autodifferentiation
capabilities).

Correct. Such functions can either be housed in a R package or, if they are written in C++, can be contributed upstream to the Stan Math library. But one of the things I was alluding to in my original email is that functions written in the Stan language can be exported to R’s global workspace and called so that you can make sure the return value matches the values from your test suite.

It is all doable but looking too far ahead at the moment.

Its vignette is the main overview, but that is for the narrow but tedious task of getting a package that uses Stan prepared to upload for CRAN. What I failed to articulate originally is the basics of writing a program in the Stan language (and maybe some C++) that can then be put into a R package.

Right. The lme4 developers did us all a great service by writing R functions that can produce a wide variety of (possibly sparse) Z matrices in mixed effects models. Several R packages are constructing those Z matrices at runtime and passing them to a compiled function, much like you would do with .C() or .Call() except easier.

If Z were dense, then you could just pass its number of rows (integer), number of columns (integer), and elements (numeric vector) to the compiled function, which will dynamically allocate memory to copy it. Since Z is sparse, you can save RAM — and more importantly avoid a bunch of multiplications by zero — by representing it in compressed row storage form (there is a function facilitating this in the rstan R package) which is pretty much what the Matrix R package does. We can get to that latter but basically you end up passing to the compiled C++ the number of rows (integer), number of columns (integer), non-zero elements (numeric vector), and a bunch more indices (integer vectors) that indicate where in Z the non-zero elements are.


#7

Ben,

   Thanks for the patient answers.  We are quickly iterating towards the right conversation space.   Probably a bit faster and more fun over a beer... 

  1. I've written STAN programs to solve hierarchical Gaussian problems for an accelerated failure time model.   So I have a reasonable understanding of setting up a user level call.  Plugging into it as a back end is different.   Stan didn't work so well for that prior problem, but I attribute it partly to the "new beau" effect: a data set where none of the methods in your current toolbox are working drives you to finally try that new approach you heard of, and gee, it has trouble too.

  2. A primary question is how to even fit the Cox partial likelihood into the STAN paradym, since the partial likelihood does not factor into per-subject terms.  PL(set of obs) !=  sum_{i=1}^n  f(y_i), for any function f.  Your psuedo code below simply doesn't parse mathematically.

  // partial likelihood contributions

  for (n in 1:(N - 1)) target += eta[n] + log_sum_exp(tail(eta, n - 1));

  target += eta[N];

 

      One could instead do something "Cox like" with a parameterized baseline hazard, and then you can write it as a sum.  But that's not what I'm after.  (Whether the actual Cox PL or this other is "better" is an interesting side discussion.)

  Do we a. write it some other way   b. 'fake it'; the user writes it as  target += CoxPL(y_i, status_i) but the right thing is done behind the scenes or c. tell me up front that it can't be done.

  3. Secondary will be how to plug the function in (technical) and how to deal with the setup of creating the order vectors,  which you don't wan to redo at every MCMC evaluation. 

  4. Having my function build the variance matrix for the random effects and then pass it to Stan is a nice idea.  I can see how to map some models to a single Stan structure:  1|id and  1| hospital/center say, but not all the possible random effects.  

  5. I should have broken the sparse discussion into two parts.  More importantly, we should treat it as a footnote for now with details at some later time.  A first implementation doesn't need it.

    a. A sparse X matrix is one thing.  It's easy to handle as I laid out earlier: the PL only needs eta=X beta  (both fixed and random in one term for shorthand), and the first deriviative is mX where m is the vector of martingale residuals.  I expect that to fit into some general framework that you all have worked out.   When there is a random intercept per subject I would guess that STAN doesn't keep an n by n identity matrix around for the Zb term.  

    b. The kinship matrix K is an n by n covariance matrix for the per-subject random effects.  It is block diagonal with one block per family, zeros off the block diagonal.  My motivating example has 26000 subjects, 99.7% of the entries in K are zero.  Its stored as a Matrix object (compressed row storage).   It sounds like efficient handling for that part is already present.

#8

OK, basically all I know about Cox models I got from


So, just so I’m clear, you are saying that there aren’t any models that are interesting to you where the complete data likelihood factors?

I do realize that lme4 and most other implementations of hierarchical models are treating the coefficients on Z as error structure and are integrating them out of the complete data likelihood to obtain a likelihood function of alpha, beta, and theta only, which does not factor. We use non-factorable likelihoods a lot in Stan programs with things like Gaussian process regressions that integrate out the unknown functions analytically.

It sounds like it would be way easier to create the ordering vectors in R and pass them to the data block of the Stan program and then pass them along to whatever functions are needed. If that is not true, we could create them inside a Stan program. Here is a signature of a Stan function that could evaluate a complicated log-likelihood function and return a scalar:

functions {
  real log_lik(real alpha, vector beta, real theta, vector y, matrix X, matrix Z);
}

although it might need some additional arguments for ordering and whatnot. The question becomes: Can the body of this function be written using Stan syntax only or do we need to write a C++ implementation of it? For something like a Laplace approximation to the integrated likelihood, this could be done using Stan syntax only. The pseudo-code steps would be

  1. Define a function in the Stan language that returns a vector indicating how far the frailties are from the mode where you are doing the Laplace approximation at. This signature would be
vector equations(vector frailties, vector common, real [] xz, int [] indices);

Implementing a function like this is annoying because it can only take “flattened” data. So, when calling it, you have to pass all the elements of the X and Z matrices in the one-dimensional real array xz and then use the one-dimensional integer array indices to reconstruct the X and Z matrices on the inside of the function. And you have to glue all the common parameters together for the second argument and then on the inside separate alpha, beta, and sigma again.
2. Call vector[q] frailties = algebra_solver(equations, common, guess, xz, indices); from the body of your log_lik function to get the modal values of the frailties. This can be sensitive to the initial guess, but a vector of zeros or maybe a vector of BLUPs would be ok.
3. Inside the log_lik function, evaluate your Laplace approximation over all observations to build up a real scalar, called, for example log_Laplace_approx
4. Do return log_Laplace_approx; at the end of the log_lik function

If the log_lik function cannot be written entirely in Stan syntax (or it can but after we get that working we want to speed it up a bit), we could (re)implement it in C++, which is more flexible but means the gradient has to be calculated manually rather than relying on Stan’s autodifferentiation capabilities. There is a vignette about this at
https://cran.r-project.org/package=rstan/vignettes/external.html
that unfortunately chopped of the end of the critical sentence that starts out as “The sinc.hpp file reads as” which should have referred to


Here var is the special Stan type for a scalar that is autodifferentiable but we pull out the plain double precision value underlying it, do some calculations, and then rebuild a var by telling it what the derivatives are. But that is just a toy example with a unary trig function. In your case, a C++ implementation of the log_lik function would need to return a call to the precomputed_gradients function, which basically takes a double for the value of the log-likelihood, a std::vector<var> of parameters the log-likelihood is a function of (e.g. alpha, beta, sigma, etc.), and a std::vector<double> of partial derivatives of the log-likelihood with respect to those parameters. A std::vector in C++ is basically like a variable length array in C.

OK. Writing a function in the Stan language that builds a variance-covariance matrix is certainly possible. Can you point me to an example of something that can’t be done the lme4 way?

OK


#9
A primary question is how to even fit the Cox partial likelihood into the STAN
paradym, since the partial likelihood does not factor into per-subject terms.

OK, basically all I know about Cox models I got from wikipedia

So, just so I’m clear, you are saying that there aren’t any models that are interesting
to you where the /complete data likelihood/ factors?
Not quite that. Saying that I like steak and want to be able to order it is different
than saying I won’t eat anything else. However, I really, really do want to fit Cox PH
models with random effects. 1. The Cox partial likelihood (PL) don’t factor and 2. you
can’t evaluate the PL in STAN code, or at least not rationally. I can write an O(n^2)
algorithm in R code and my test suite does that for some small data sets, but the C code
is O(n).

From a math view think of the Cox model this way: \lamba_i(t) = \exp( \beta_0(t) +
\beta_1 X_{i1} + \beta_2 X_{i2} + \ldots). The intercept \beta_0(t) is a different
constant at each death time, does not need to be computed at all when creating the PL, and
is mostly uninteresting, at least until after the fit is done. Computationally sort of
like \sigma^2 in ordinarly linear regression, that isn’t needed nor computed until the
very end. But because it is there the PL does not factor.

I do realize that lme4 and most other implementations of hierarchical models are
treating the coefficients on |Z| as error structure and are integrating them out of the
complete data likelihood to obtain a likelihood function of |alpha|, |beta|, and |theta|
only, which does not factor. We use non-factorable likelihoods a lot in Stan programs
with things like Gaussian process regressions that integrate out the unknown functions
analytically.

Glad to hear that there is a facility for non-factorable.

therneau:

Secondary will be how to plug the function in (technical) and how to deal with the
setup of creating the order vectors

It sounds like it would be way easier to create the ordering vectors in R and pass them
to the data block of the Stan program and then pass them along to whatever functions are
needed. If that is not true, we could create them inside a Stan program. Here is a
signature of a Stan function that could evaluate a complicated log-likelihood function
and return a scalar:

functions { real log_lik(real alpha, vector beta, real theta, vector y, matrix X,
matrix Z); } |

although it might need some additional arguments for ordering and whatnot. The question
becomes: Can the body of this function be written using Stan syntax only or do we need
to write a C++ implementation of it?

I’ll provide C code. Said code is surprisingly short and simple looking after removing
Hessians and iteration. The edge cases often involve overflow or underflow of sums and
differences, however, and it turns out that the defense against this is to do those
additions and subtractions is just exactly the right order. Since it is so short a C++
translation should be trivial.

The extra ordering args can be pre-calculated. If indeed it can be set up so that all is
pre-compiled and the user does not have to type, the nuisance of adding them to an
argument list is no nuisance at all.

For something like a Laplace approximation to the integrated likelihood, this could be
done using Stan syntax only. The pseudo-code steps would be

The current coxme does a Laplace approx. It works fine for simple problems but often
breaks down in more complex analyses. It took me a while to fully understand the why, and
that told me that I can’t band-aid it. Avoiding it is one of the reasons to explore Stan.

  1. Define a function in the Stan language that returns a vector indicating how far the
    frailties are from the mode where you are doing the Laplace approximation at. This
    signature would be

vector equations(vector frailties, vector common, real [] xz, int [] indices); |

Implementing a function like this is annoying because it can only take “flattened” data.
So, when calling it, you have to pass all the elements of the |X| and |Z| matrices in
the one-dimensional real array |xz| and then use the one-dimensional integer array

indices> to reconstruct the |X| and |Z| matrices on the inside of the function. And you
have to glue all the |common| parameters together for the second argument and then on
the inside separate |alpha|, |beta|, and |sigma| again.

  1. Call |vector[q] frailties = algebra_solver(equations, common, guess, xz, indices);|
    from the body of your |log_lik| function to get the modal values of the frailties. This
    can be sensitive to the initial |guess|, but a vector of zeros or maybe a vector of
    BLUPs would be ok.
  2. Inside the |log_lik| function, evaluate your Laplace approximation over all
    observations to build up a |real| scalar, called, for example |log_Laplace_approx|
  3. Do |return log_Laplace_approx;| at the end of the |log_lik| function

If the |log_lik| function cannot be written entirely in Stan syntax (or it can but after
we get that working we want to speed it up a bit), we could (re)implement it in C++,
which is more flexible but means the gradient has to be calculated manually rather than
relying on Stan’s autodifferentiation capabilities. There is a vignette about this at
https://cran.r-project.org/package=rstan/vignettes/external.html
that unfortunately chopped of the end of the critical sentence that starts out as “The
sinc.hpp file reads as” which should have referred to

I’ll read the vignette. The C code for the first derivative is also straightforward.
Again the trick is to compute the martingale residual vector m in O(n) rather than the
naive algorithm which is O(n^2). Once we have m the first derivative is simply mX. (For
ordinarly least squares the derivative is rX where r is the residuals).

github.com https://github.com/stan-dev/rstan/blob/develop/rstan/rstan/vignettes/sinc.hpp#L5

therneau:

Having my function build the variance matrix for the random effects and then pass it
to Stan is a nice idea. I can see how to map some models to a single Stan structure:
1|id and 1| hospital/center say, but not all the possible random effects.

OK. Writing a function in the Stan language that builds a variance-covariance matrix is
certainly possible. Can you point me to an example of something that can’t be done the
lme4 way?

I was thinking of two models one with a simple effect and one complex. The first one is
(1 | center) say, but the second is (1| floor/hospital) + (1 + years| nurse/unit). I can
see how to write either in the STAN language, but not how I could have a single
precompiled STAN module that both of these map onto, especially if you factor in users
that want different priors for the variances. That does not at all mean that it can’t be
done.


#10

That sounds good. The outline of a a C++ function would be something like

stan::math::var
log_lik(const stan::math::var& alpha,
        const Eigen::Matrix<stan::math::var, Eigen::Dynamic,1>& beta,
        const stan::math::var& theta,
        const Eigen::Matrix<double, Eigen::Dynamic,1>& y,
        const Eigen::Matrix<double, Eigen::Dynamic,Eigen::Dynamic>& X,
        const Eigen::Matrix<double, Eigen::Dynamic,Eigen::Dynamic>& Z, 
        const std::vector<int>& order, // assume this comes from R originally
        std::ostream* pstream__) {     // you can print debug info to pstream__

  // pull out the double precision value from inputs that are originally var
  using stan::math::value_of;
  const double alpha_ = value_of(alpha);
  const Eigen::Matrix<double, Eigen::Dynamic, 1> beta_ = value_of(beta);
  const double theta_ = value_of(theta);

  double log_lik = 0;
  std::vector<double> gradient(2 + beta.rows());

  /* Do complicated stuff to increment log_lik over the data and fill in gradient
  using alpha_, beta_,  theta_ (with underscores) rather than alpha, beta, theta
  The Eigen matrix library makes this pretty easy: http://eigen.tuxfamily.org/dox/
  */

  // glue the parameters together, tell Stan what the gradient of log_lik was
  std::vector<stan::math::var> params(2 + beta.rows());
  params[0] = alpha;
  for (int k = 0; k < beta.rows(); k++) params[k +1] = beta.coeff(k);
  params[1 + beta.rows()] = theta;
  return stan::math::precomputed_gradients(log_lik, params, gradient);
}

Here I have used the convention that a training underscore in a symbol name (i.e. beta_) indicates it is a plain double-precision vector (really Eigen::Matrix with one column) rather than the var vector (i.e. beta) it was derived from.

This is quite possible. You can pass an integer called something like cov_struct from R to the data block of the Stan program that indicates what covariance structure to use and then put a cascade of if, else statements (that acts like a switch statement) in the transformed parameters block like

if (cov_struct == 0) {
  // set up for diagonal
} else if (cov_struct == 1) {
  // set up for AR1
} else {
  // set up for something else
}

This might be easier if you had different log-likelihood functions that take different arguments depending on the details of the model that were specified in R, but again you can call those different log-likelihood functions conditional on the value of cov_struct.


#11

Let’s go back to my original question. Are there documentation resources that I can read to learn what the code below means?

  I don't want to be ungrateful, but the stuff below is essentially strings of random characters.   You keep pulling me into a sidewalk cafe when I want a map of the city.   Something more like the "Writing R Extensions" manual that starts at the top.  Perhaps such resources don't exist? 

  Terry T.

#12

Loosely speaking, in this case we want to write a C++ function (that can be called from a Stan program) to evaluate a log-likelihood that

  1. Takes in all the necessary arguments
  2. Circumvents the usual autodifferentiation mechanism in Stan
  3. Sums up the n contributions to the log-likelihood in a numerically stable way
  4. Calculates the gradient of that log-likelihood value with respect to all of the parameters
  5. When returning, tells Stan what that gradient was for the calculations we just did while circumventing the usual autodiff mechanism.

But no one dives into modeling with Stan like that, which is why I was trying to start with a wikipedia-level Cox model that could be implemented just with Stan syntax. There is a lot of documentation for the Stan Math library and writing a Stan program as a text file. However, all of that is for the 99.99% of the time where the user is not the least bit concerned with derivatives because the autodifferentiation mechanism is handling them.

Maybe let’s start with writing a R function — which can be O(n) or O(n^2) or whatever — that evaluates the log-likelihood you want, and I can explain in terms of R what would be analogous to the C++. Suppose in R we had a S4 class called var

var <- setClass("var", slots = c(value = "numeric", derivatives = "externalptr"))

that is basically just a wrapper around a numeric vector but it also contains an external pointer to somewhere that tells you what the derivative of some (e.g. log-likelihood) function is with respect to this var, evaluated at var@value. We can then construct random instances of a var:

alpha <- new("var", value = rnorm(1)) # but derivatives is uninitialized here
beta <- new("var", value = rnorm(5))
theta <- new("var", value = abs(rnorm(1)))

This is a lot like the var C++ class in the Stan Math library that is used for unknown parameters, which automatically updates the external pointer with another link in the chain rule whenever some function is applied to a var.

Except in cases like yours, we don’t want to use autodifferentiation and instead want to supply analytical gradients ourselves. So, we need a mechanism in R to coerce a var back into a regular double-precision thing

setAs(from = "var", to = "numeric", def = function(from) return(from@value))
as(alpha, "numeric")

that we can then use without triggering any updating of the autodifferentiation mechanism until we are ready to do so.

The outline of a R function that corresponds to the outline of the C++ function I wrote before would be

log_lik <- function(alpha,   # a var scalar
                    beta,    # a var vector
                    theta,   # a var scalar
                    y,       # a numeric vector (or later a Surv)
                    X,       # a dense numeric matrix
                    Z,       # a dense numeric matrix (or later a sparse Matrix)
                    order) { # an integer vector that has to do with how you sum

  # as() in R is basically the same as stan::math::value_of() in C++
  alpha_ <- as(alpha, "numeric")
  beta_ <- as(beta, "numeric")
  theta_ <- as(theta, "numeric")

  # initialize the scalar value for the log-likelihood and a gradient vector
  log_lik <- 0
  gradient <- rep(NA_real_, times = 2 + rows(beta))

  # Do complicated stuff to increment log_lik over the data and fill in gradient
  # using alpha_, beta_, theta_ (with underscores) rather than alpha, beta, theta

  # glue all the parameters together and return a var
  params <- vector("list", length = 2 + rows(beta))
  params[[1]] <- alpha
  for (k in 1:rows(beta)) params[[k + 1]] <- beta[k]
  params[[2 +rows(beta)]] <- theta

  return(precomputed_gradients(log_lik, params, gradient))
}

Here precomputed_gradients is just a utility function implemented elsewhere that looks something like

precomputed_gradients <- function(value,      # numeric scalar
                                  params,     # a list of length K containing vars
                                  gradient) { # a numeric vector of length K

  # manually update an external pointer called xPtr in light of params and gradient

  return(new("var", value = value, derivatives = xPtr))
}

So, the only piece that we are missing is how to “Do complicated stuff to increment log_lik over the data and fill in gradient using alpha_, beta_, and theta_ (with underscores) rather than alpha, beta, theta”. I know it involves y, X, and Z and some order that tells you how to keep numerical stability when you loop over the data, but I don’t know the statistics here. If you could fill that part in for me in R or C, I could map that to C++. Really the C++ for that part is not going to be much different than regular C, except we can utilize the Eigen matrix library (which would be like using the Matrix package in R rather than a plain two-dimensional array).


#13

Ben,

   1. I'll work on packaging up the C routines, with documentation.  Where should I send them? 

   2. These are interesting tutorials, but I'm still waiting for an answer to my question of what formal documetation exists, if any. 

  Terry T.

#14

You can attach a .R or .cpp file to a message on Discourse by clicking the icon above the message window that looks like an arrow pointing upward.

The somewhat formal documentation for how to call an external C++ function from a Stan program is the vignette
https://cran.r-project.org/web/packages/rstan/vignettes/external.html
but it was missing an important embedded file toward the end. So, to summarize, if

functions {
  real sinc(real x); // declared but not defined yet
}
transformed data {
  real sinc_pi = sinc(pi());
}

is a Stan program in your working directory called sinc.stan, and you have a sinc.hpp file in your working directory with these lines to implement it

double
sinc(const double& x, std::ostream* pstream__) {
  return x != 0.0 ? sin(x) / x : 1.0;
}

var
sinc(const var& x, std::ostream* pstream__) {
  double x_ = x.val();
  double f = x_ != 0.0 ? sin(x_) / x_ : 1.0;
  double dfdx_ = x_ != 0.0 ? (cos(x_) - sin(x_)) / x_ : 0.0;
  return var(new precomp_v_vari(f, x.vi_, dfdx_));
}

then calling in R

library(rstan)
sm <- stan_model(file = "sinc.stan", allow_undefined = TRUE,
                 includes = paste0('\n#include "', 
                                   file.path(getwd(), 'sinc.hpp'), '"\n'))

will compile and use that C++ implementation of the sinc function. This handles including all of the other headers automatically.

However, implementing an unary trig function in C++ is much simpler than implementing a log-likelihood function of several parameters whose number is not known until runtime. When the arguments to that C++ involve vectors and matrices, then the best way to implement it is to utilize the Eigen matrix library whose formal documentation is at
http://eigen.tuxfamily.org/dox/
In order to use Eigen, you first need to coerce the unknown parameters (whose scalar type is var) to plain double scalars or vectors using the value_of function in the Stan Math C++ library. At the end, you need to return a call to the precomputed_gradients function in the Stan Math C++ library. I don’t think you need to use any other functions in the Stan Math C++ library, but it is documented at
https://arxiv.org/abs/1509.07164


#15

Ben,

    Thanksgiving vacation and a couple of must do projects have intervened, but I haven't forgotten the thread.   I intend to create a small R package that has the C-code, calls to it, documentation, and tests.   That way you have the entire "mileau" in a well described way.  I should be able to include one of the nasty test cases. 

  The goal of all this is to use the STAN library as a way to explore the likelihood surface (partial likelihood actually), which is one of the things that can be done efficiently with Hamiltonian MCMC, without needing to appeal to posteriors, credible intervals, etc.

  Terry