Speed-up for Mixed Discrete-Continuous Gaussian Copula - Reduce Sum?

Hello Stan Aficionados!

The gist of my request is as follows: how the heck can I speed-up a mixed gaussian copula so that it does not take 2 weeks to fit on an I9 with 14 cores and 32 gb of ram.

Details and Specification:
I am incredibly indebted to @spinkney 's code here as well as discussions with various folks throughout this forum. I have designed a model to learn the underlying dependency structure of a series of continuous and ordered probit traits. As such I adapt the functions from the link above for a Gaussian Copula and also adapt my own function for an ordered probit marginal based on the ordered_probit_lpmf already in Stan math. Note, I have missing data throughout each marginal and this is treated much like the manual suggests by introducing a parameter into the model for each missing case. I paste the full code below and it is heavily annotated for ease.

In terms of my data, I have 54 total dimensions - 18 continuous variables and 36 ordinal variables. The continuous marginals are real numbers greater than 0 with some ranging from 10-20mm, while others ranging from 50-450mm (they are growth metrics, so lengths and breadths). The ordinal data comes from 3 different threshold types with some variables have 3 categories, some with 7, and some with 12. The independent variable x is a real number ranging from 0-20 - it is just an individual’s chronological age.

I fit everything in cmdstanr with the code below:

fit <-stanmodel$sample(data = stan_dat,
                 refresh = 500,
                 chains = 4,
                 parallel_chains = 4,
                 iter_warmup = 2000,
                 iter_sampling = 7000,
                 max_treedepth = 12, 

My problem:
It takes at least 10 days to fit my largest model which is ~1200 rows of data, so 1200 x 54. I have fit each marginal type separately, and done the full model on subsets of 100-300 randomized data points. But, the full thing is killing me! I’ve tried openCL, but saw no efficiency gain. I know there are a few places where speed may be gained based on memory allocation by adjusting loops in some places and possibly vectorizing in others, but I wanted thoughts here from the community. I know the tree depth and adapt_delta parameters make things slow here and maybe suggestions there could help as well - I have noticed this to be quite a finicky model and depending on if I lower or increase the LKJ prior to regularize the strength of the correlations, the model may bog down or may throw ebfmi errors.

I would also ask if there is anyway to speed things up by spreading across more than the 4 cores and more threads such as reduce sum or mat rect, but I’m unsure how to adapt the custom lpdf here to anything of the sort…

Let me know if there any additional parts I can fill in that may help find a solution!

  // There are 2 functions that define a Gaussian Copula in Stan. First, `multi_
  // normal_cholesky_copula_lpdf` and second, `centered_gaussian_copula_choelsky_`
  // These work in tandom to first aggregate all the marginal calculations and 
  // jacobian adjustment and then increment the log-probability (the goal of Stan).
  real multi_normal_cholesky_copula_lpdf(matrix U, matrix L) {
    int N = rows(U); // total number of observations
    int J = cols(U); // total number of responses (J+K)
    matrix[J, N] G = mdivide_left_tri_low(L, U'); // equivalent to L*inverse(tri(U))
  return -N * sum(log(diagonal(L))) -  0.5 * ( sum( G^2 ) - sum(U^2) ); // log probability
  real centered_gaussian_copula_cholesky_(matrix[,] marginals, matrix L) {
    // Extract dimensions of final outcome matrix
    int N = rows(marginals[1][1]); // total number of observations
    int J = rows(L); // number of responses (J+K)
    matrix[N, J] U; // empty matrix for marginal calculations

    // Iterate through marginal arrays, concatenating the outcome matrices by column
    // and aggregating the log-likelihoods (from continuous marginals) and jacobian
    // adjustments (from discrete marginals)
    real adj = 0;
    int pos = 1;
    for (m in 1 : size(marginals)) {
      int curr_cols = cols(marginals[m][1]);

      U[ : , pos : (pos + curr_cols - 1)] = marginals[m][1];

      adj += sum(marginals[m][2]); //jacobian adjustments
      pos += curr_cols;

    // Return the sum of the log-probability for copula outcomes and jacobian adjustments
    return multi_normal_cholesky_copula_lpdf(U[ : ,] | L) + adj;
  // There are 2 functions that define each univariate marginal distribution depending 
  // on data type: polytomous ordinal or continuous. I have adapted a second
  // The ordinal data follows the data augmentation approach highlighted in Albert
  // & Chib, 1993. The continuous data is centered for efficiency. Note, I also
  // also include a slightly augmented version of the polytomous function for
  // use with carpals and tarsals.
  array[] matrix ordered_probit_marginal(array[,] int y, matrix mu_glm, matrix u_raw, array[] vector cutpoints) {
    int N = rows(mu_glm); // number of observations
    int K = cols(mu_glm); // number of polytomous variables
    array[2] matrix[N, K] rtn; // empty 2D array to return

    for(k in 1:K){
      for(n in 1:N){
        int C = num_elements(cutpoints[k,]) + 1; // total number of ord categories
        if(y[n,k] == 99){ // missing data
          rtn[1][n,k] = inv_Phi(u_raw[n,k]); // missing RV
          rtn[2][n,k] = 0;
        } else if(y[n,k] == 1){ // lowest bound
          real bound = Phi((cutpoints[k,1] - mu_glm[n,k])); // data augmentation
          rtn[1][n,k] = inv_Phi((bound*u_raw[n,k])); // latent RV
          rtn[2][n,k] = log(bound); // jacobian
        } else if (y[n,k] == C){ // highest bound
          real bound = Phi((cutpoints[k, C - 1] - mu_glm[n,k])); // data augmentation
          rtn[1][n,k] = inv_Phi(bound + (1-bound)*u_raw[n,k]); // latent RV
          rtn[2][n,k] = log1m(bound); // jacobian
        } else { // in between 
          real ub = Phi((cutpoints[k ,y[n,k]] - mu_glm[n,k])); // data augmentation
          real lb = Phi((cutpoints[k, y[n,k] - 1] - mu_glm[n,k])); // data augmentation
          rtn[1][n,k] = inv_Phi((lb + (ub-lb)*u_raw[n,k])); // latent RV
          rtn[2][n,k] = log(ub-lb); // jacobian
  return rtn;
  array[] matrix ordered_probit_marginal_uni(int[] y, real[] mu_glm, real[] u_raw, vector cutpoints) {
    int N = num_elements(mu_glm); // number of observations
    array[2] matrix[N, 1] rtn; // empty 2D array to return
    for(n in 1:N){
      int C = num_elements(cutpoints) + 1; // total number of ord categories
      if(y[n] == 99){ // missing data
        rtn[1][n,1] = inv_Phi(u_raw[n]); // missing RV
        rtn[2][n,1] = 0;
        } else if(y[n] == 1){ // lowest bound
        real bound = Phi((cutpoints[1] - mu_glm[n])); // data augmentation
        rtn[1][n,1] = inv_Phi((bound*u_raw[n])); // latent RV
        rtn[2][n,1] = log(bound); // jacobian
      } else if (y[n] == C){ // highest bound
        real bound = Phi((cutpoints[C - 1] - mu_glm[n])); // data augmentation
        rtn[1][n,1] = inv_Phi(bound + (1-bound)*u_raw[n]); // latent RV
        rtn[2][n,1] = log1m(bound); // jacobian
      } else { // in between 
        real ub = Phi((cutpoints[y[n]] - mu_glm[n])); // data augmentation
        real lb = Phi((cutpoints[y[n] - 1] - mu_glm[n])); // data augmentation
        rtn[1][n,1] = inv_Phi((lb + (ub-lb)*u_raw[n])); // latent RV
        rtn[2][n,1] = log(ub-lb); // jacobian
    return rtn;

  matrix[] normal_marginal(matrix y, matrix mu_glm, matrix sigma) {
    int N = rows(mu_glm); // number of observations
    int J = cols(mu_glm); // number of continuous variables
    matrix[N, J] rtn[2]; // empty 2D array to return
    // Initialise the jacobian adjustments to zero, as vectorised lpdf will be used
    rtn[2] = rep_matrix(0, N, J);

    for (j in 1 : J) {
      rtn[1][ : , j] = (y[ : , j] - mu_glm[ : , j]) ./ sigma[ :,j]; // center RV
      rtn[2][1, j] = normal_lpdf(y[ : , j] | mu_glm[ : , j], sigma[ :,j]); // "jacobian"
  return rtn;

  // size variables //
  int N; // total # of observations (rows)
  int J; // total # of continuous observations
  int K_dentition; // total # of dental variables
  int K_ef; // total # of ef variables 
  int K_pelvis; // total # of pelvic variables
  int K_all; // all ordinal variables
  //continuous prep //
  int present; // total # of complete data in continuous responses
  int missing; // total # of missing data in continuous responses
  real y_continuous[present]; // vector of reals of responses across all vars
  int index_pres[present, 2]; // Matrix (row, col) of non-missing value indices
  int index_miss[missing, 2]; // Matrix (row, col) of missing value indices
  // polytomous prep //
  int y_dentition[N,K_dentition]; // 2D array of integers n obs by k_dentition
  int y_ef[N,K_ef]; // 2D array of integers n obs by k_ef
  int y_pelvis[N,K_pelvis]; // 2D array of integers n obs by k_pelvis
  int y_carpal[N]; // 2D array of integers n obs by k_carpal
  int y_tarsal[N]; // 2D array of integers n obs by k_tarsal
  // predictor (age) //
  real X[N]; //scaled

  // continuous parameters //
  real y_missing_cont[missing]; // missing data parameter the size of all missing vars
  vector<lower=0>[J] constant; //constant in continuous mean function
  vector<lower=0>[J] exponent; // exponent in continuous mean function
  vector[J] offset; // offset in continuous mean function
  vector<lower=0>[J] noise1; // param of linear noise function
  vector<lower=0>[J] noise2; // param of linear noise function

  // Polytomous parameters //
  vector<lower=0>[K_dentition] constant_dentition;
  vector<lower=0>[K_ef] constant_ef;
  vector<lower=0>[K_pelvis] constant_pelvis;
  real<lower=0> constant_carpal;
  real<lower=0> constant_tarsal;
  matrix<lower=0, upper=1>[N,K_dentition] u_dentition; // nuisance dentition
  matrix<lower=0, upper=1>[N,K_ef] u_ef; // nuisance ef
  matrix<lower=0, upper=1>[N,K_pelvis] u_pelvis; // nuisance pelvis
  real<lower=0, upper=1> u_carpal[N]; // nuisance carpal
  real<lower=0, upper=1> u_tarsal[N]; // nuisance tarsal
  ordered[11] threshold_dentition[16]; // dental thresholds
  ordered[6] threshold_ef[16]; // ef thresholds
  ordered[2] threshold_pelvis[2]; // pelvis thresholds
  ordered[8] threshold_carpal; // carpal thresholds
  ordered[7] threshold_tarsal; // tarsal thresholds
  // Cholesky factor (copula parameter) //
  cholesky_factor_corr[J+K_all] L;
transformed parameters{
  // Parameter Declarations //
  matrix[N,J] mu_continuous; // continuous mean
  matrix[N,J] sd_continuous; // continuous sd
  matrix[N,J] y_full_continuous; // full y matrix including missing data parameters
  matrix[N,K_dentition] mu_dentition; // dental mean
  matrix[N,K_ef] mu_ef; // ef mean
  matrix[N,K_pelvis] mu_pelvis; // pelvis mean
  real mu_carpal[N]; // carpal mean
  real mu_tarsal[N]; // tarsal mean

  // Continuous Parameters //
  for(i in 1:N){
    for(j in 1:J){
      mu_continuous[i,j] = constant[j]*X[i]^exponent[j] + offset[j];
      sd_continuous[i,j] = noise1[j]*(1 + X[i]*noise2[j]);
  // Fill y_full continuous with present data
  for(n in 1:present) {
      y_full_continuous[index_pres[n,1]][index_pres[n,2]] = y_continuous[n];
  // Fill the rest of y_full continuous with missing value "parameters"
  for(n in 1:missing){
      y_full_continuous[index_miss[n,1]][index_miss[n,2]] = y_missing_cont[n];

 // Polytomous Parameters //
 for(i in 1:N){
   for(k in 1:K_dentition){
     mu_dentition[i,k] = constant_dentition[k]*X[i];
 for(i in 1:N){
   for(k in 1:K_ef){
     mu_ef[i,k] = constant_ef[k]*X[i];
 for(i in 1:N){
   for(k in 1:K_pelvis){
     mu_pelvis[i,k] = constant_pelvis[k]*X[i];
 for(i in 1:N){
     mu_carpal[i] = constant_carpal*X[i];
 for(i in 1:N){
     mu_tarsal[i] = constant_tarsal*X[i];
  // Priors // 
  constant ~ normal(0,1);
  exponent ~ normal(0,1);
  offset ~ normal(0,1);
  noise1 ~ normal(0,1);
  noise2 ~ normal(0,1);
  constant_dentition ~ normal(0,1);
  constant_ef ~ normal(0,1);
  constant_pelvis ~ normal(0,1);
  constant_carpal ~ normal(0,1);
  constant_tarsal ~ normal(0,1);

  // Threshold Priors //
  for(i in 1:K_dentition){
    for(t in 1:11){
      threshold_dentition[i,t] ~ normal(t+1,1);
  for(i in 1:K_ef){
    for(t in 1:6){
      threshold_ef[i,t] ~ normal(t+1,1);
  for(i in 1:K_pelvis){
    for(t in 1:2){
      threshold_pelvis[i,t] ~ normal(t+1,1);
    for(t in 1:8){
      threshold_carpal[t] ~ normal(t+1,1);
    for(t in 1:7){
      threshold_tarsal[t] ~ normal(t+1,1);
  for (n in 1:missing){
    y_missing_cont[n] ~ normal(mu_continuous[index_miss[n,1]][index_miss[n,2]],sd_continuous[index_miss[n,1]][index_miss[n,2]]);

  // Copula Parameter Prior //
  L ~ lkj_corr_cholesky(6);
  // Likelihood //
  target += centered_gaussian_copula_cholesky_(
    {normal_marginal(y_full_continuous, mu_continuous, sd_continuous),
    ordered_probit_marginal(y_dentition, mu_dentition, u_dentition, threshold_dentition),
    ordered_probit_marginal(y_ef, mu_ef, u_ef, threshold_ef),
    ordered_probit_marginal(y_pelvis, mu_pelvis, u_pelvis, threshold_pelvis),
    ordered_probit_marginal_uni(y_carpal, mu_carpal, u_carpal, threshold_carpal),
    ordered_probit_marginal_uni(y_tarsal, mu_tarsal, u_tarsal, threshold_tarsal)}, L);
generated quantities{
  // Here I put the correlation matrix back together for ease of interpretation
  corr_matrix[J+K_all] cor_mat = multiply_lower_tri_self_transpose(L);

Glad it’s helping! Some of the speed may be less the code and more numerical struggles. One thing to try is calculating on the log scale using ‘inv_Phi_log’ function. Replace all the ‘normal(0, 1)’ with ‘std_normal()’.

One thing I’m not sure about is how the compiler handles all those different 1:N loops. @WardBrian does the compiler merge those together? I always try to pack same index loops into one instead of many.

In the normal_marginal code you may be able to vectorize that J loop. Also in the model block I think you can avoid the J loop . See if you can avoid the ‘rep_matrix’ when you call ‘rtn[2]’.

Thanks Sean!

I’ll get to work vectorizing the j-loop - that has been something I’ve been meaning to do. Perhaps a dumb question - is it possible in Stan to do elementwise addition and subtraction? I know we can do products, quotients, and powers. Trying to figure out the best way to vectorize.

Related to inv_phi_log is it std_normal_log_qf or inv_phi_log? Regardless, will I have to put the nuisance parameter u and the bound parameters (bound, lb, ub) on the log scale too before using inv_phi_log? Trying to figure out the best way to rewrite what I have… And lastly, because Stan returns the log probability, can rtn[1] be left on the log scale, or do I need to exp(inv_phi_log(...))?

Stanc3 does not. It’s possible clang/gcc would, but it requires some sophisticated analysis to merge loops together. Whether you will get better performance out of merging loop together or not depends a lot on what operation you’re doing inside them; if the compiler can autovectorize and your problem fits in cache, combining two successive loops together might actually make things much slower by breaking this. This is the kind of thing where I’d avoid any overly general advice and just try it and benchmark both ways.

1 Like

The irony!

I ran a test combining that loop in the Transformed Parameters block to assign all of the means and sd’s with one pass of N instead of re-running for every data type (i.e., a loop for dentition, a loop for ef, etc.) and it did in fact slow things down. Not by a huge margin - ~ an hour or so - but enough to at least suggest the efficiency gains would be negligible at best at least in the loop construction.

I feel as though much of the major problems here in terms of real efficiency gain (efficiency being purely a metric of time here and not iterations) lie in the multi_normal_cholesky_copula_function and the J x N mdivide_left_tri_low and the subsequent return function…


You can try this as the inverse is just done on a J x J instead of J x N.

real gaussian_copula_cholesky_lpdf(
    matrix U, matrix L
  ) {
    int N                  = rows(U);
    int J                  = cols(U);
    matrix[N,J] Q          = inv_Phi(U);
    matrix[J,J] Gammainv   = crossprod(mdivide_left_tri_low(L, diag_matrix(rep_vector(1.0, J))));
    return -0.5 * ( N * sum(log(diagonal(L))) + sum( add_diag(Gammainv, -1.0) .* crossprod(Q) ) ) ;

What about this:

  real multi_normal_cholesky_copula_lpdf(matrix U, matrix L) {
    int N = rows(U); // total number of observations
    int J = cols(U); // total number of responses (J+K)
    matrix[J, J] Gammainv = chol2inv(L);
  return -N * sum(log(diagonal(L))) - 0.5 * sum(add_diag(Gammainv, -1.0) .* crossprod(U));

This is the function on you github. The one you shared has initialization problems related to inv_Phi(U). Doesn’t it expect the data to be on the interval [0,1]?

Yes, just depends on the input. I have like 10 versions of the multi normal copula, lol.

Mind if I ask the difference numerically?

I believe when I started this long ago I used the matrix[J, J] Gammainv = chol2inv(L); and then a suggestion was to try matrix[J, N] G = mdivide_left_tri_low(L, U');. I believe there is a slight slowdown when using mdivide_left_tri_low as compared to chol2inv (like 30 minutes at best). I am running the same model again to test this out. But, I believe the results are identical. In both cases U and L are the same - a matrix of size N x J and and the cholseky factors of the corr matrix. That term is purely related to the middle sum() statement in the return of the function - so wasn’t sure what the difference was or if one was preferred over the other?

Just an update for you:

Running a 6-variable model (2 continuous, 4 ordinal) with the exact code you have on gihub for the multi_normal_cholesky_function works great. This is using the chol2inv function I described in my last post with a J x J matrix.

Alternatively, if I run the same model with the code in my initial post using mdivide_left_tri_low and a J x N matrix, the model takes ~1 hour longer to run (the first takes 10 hours, this one takes 11 hours) AND there are issues related to E-BFMI. The specifications are identical - same seed, inits, treedepth, and adapt_delta, 2 chains.

This is interesting…

I have had a lot of difficulty with fitting multivariate normal models in Stan with a factor structure (or really in any case where there is a large amount of correlation between some of the variables), particularly as the number of variables increases.

As far as I can tell, the problem is associated with correlation in the estimated correlation matrix parameters. That’s a little confusing, but what I mean is if you have some correlation matrix C, then when you estimate parameters C[m, n] and C[p, q] with Stan, they will have some correlation. Moreover, since it is a correlation matrix (or a Cholesky matrix, or what have you), there are restrictions that are imposed on the values to ensure that it is a valid correlation matrix (e.g. can’t be bigger than or less than zero along with other restrictions on the relations between variables). This will induce some kind of non-linear correlation between the variables, particularly when the average C[m, n] increases to 1.

In the multivariate normal case, I can instead represent it as a series of regressions, which is kind of like allowing each part of the Cholesky matrix to be fit separately and without having all the constraints that a Cholesky matrix uses. This approach doesn’t work in your case because you don’t have normal marginals. However, the same underlying principle is there since you could be running into these problems with correlated data.

In the past, I suggested a C-vine copula approach as an alternative, but I never got around to writing it up. The C-vine is easier to understand visually for me (e.g. Vine copula - Wikipedia). The first layer is all the correlations with variable 1, the second layer is all the partial correlations with variable 2 conditional on variable 1, and the Nth layer is the partial correlations with variable N conditional on variables 1, … ,N-1. If you’re assuming the dependence is Gaussian for all variables, then it really simplifies things. You can instead model the partial correlations directly and then build up to a correlation matrix. Since each partial correlation layer is conditional on the other variables, you are able to reduce correlations between the correlation parameters (if that makes sense).

4 posts were split to a new topic: Vine examples in Stan