# Speed-up for Mixed Discrete-Continuous Gaussian Copula - Reduce Sum?

The gist of my request is as follows: how the heck can I speed-up a mixed gaussian copula so that it does not take 2 weeks to fit on an I9 with 14 cores and 32 gb of ram.

Details and Specification:
I am incredibly indebted to @spinkney 's code here as well as discussions with various folks throughout this forum. I have designed a model to learn the underlying dependency structure of a series of continuous and ordered probit traits. As such I adapt the functions from the link above for a Gaussian Copula and also adapt my own function for an ordered probit marginal based on the ordered_probit_lpmf already in Stan math. Note, I have missing data throughout each marginal and this is treated much like the manual suggests by introducing a parameter into the model for each missing case. I paste the full code below and it is heavily annotated for ease.

In terms of my data, I have 54 total dimensions - 18 continuous variables and 36 ordinal variables. The continuous marginals are real numbers greater than 0 with some ranging from 10-20mm, while others ranging from 50-450mm (they are growth metrics, so lengths and breadths). The ordinal data comes from 3 different threshold types with some variables have 3 categories, some with 7, and some with 12. The independent variable x is a real number ranging from 0-20 - it is just an individual’s chronological age.

I fit everything in cmdstanr with the code below:

``````fit <-stanmodel\$sample(data = stan_dat,
refresh = 500,
chains = 4,
parallel_chains = 4,
iter_warmup = 2000,
iter_sampling = 7000,
max_treedepth = 12,
``````

My problem:
It takes at least 10 days to fit my largest model which is ~1200 rows of data, so 1200 x 54. I have fit each marginal type separately, and done the full model on subsets of 100-300 randomized data points. But, the full thing is killing me! I’ve tried openCL, but saw no efficiency gain. I know there are a few places where speed may be gained based on memory allocation by adjusting loops in some places and possibly vectorizing in others, but I wanted thoughts here from the community. I know the tree depth and adapt_delta parameters make things slow here and maybe suggestions there could help as well - I have noticed this to be quite a finicky model and depending on if I lower or increase the LKJ prior to regularize the strength of the correlations, the model may bog down or may throw ebfmi errors.

I would also ask if there is anyway to speed things up by spreading across more than the 4 cores and more threads such as reduce sum or mat rect, but I’m unsure how to adapt the custom lpdf here to anything of the sort…

Let me know if there any additional parts I can fill in that may help find a solution!

``````functions{

// There are 2 functions that define a Gaussian Copula in Stan. First, `multi_
// normal_cholesky_copula_lpdf` and second, `centered_gaussian_copula_choelsky_`
// These work in tandom to first aggregate all the marginal calculations and
// jacobian adjustment and then increment the log-probability (the goal of Stan).

real multi_normal_cholesky_copula_lpdf(matrix U, matrix L) {
int N = rows(U); // total number of observations
int J = cols(U); // total number of responses (J+K)
matrix[J, N] G = mdivide_left_tri_low(L, U'); // equivalent to L*inverse(tri(U))
return -N * sum(log(diagonal(L))) -  0.5 * ( sum( G^2 ) - sum(U^2) ); // log probability
}

real centered_gaussian_copula_cholesky_(matrix[,] marginals, matrix L) {
// Extract dimensions of final outcome matrix
int N = rows(marginals); // total number of observations
int J = rows(L); // number of responses (J+K)
matrix[N, J] U; // empty matrix for marginal calculations

// Iterate through marginal arrays, concatenating the outcome matrices by column
// and aggregating the log-likelihoods (from continuous marginals) and jacobian
int pos = 1;
for (m in 1 : size(marginals)) {
int curr_cols = cols(marginals[m]);

U[ : , pos : (pos + curr_cols - 1)] = marginals[m];

pos += curr_cols;
}

// Return the sum of the log-probability for copula outcomes and jacobian adjustments
return multi_normal_cholesky_copula_lpdf(U[ : ,] | L) + adj;
}

// There are 2 functions that define each univariate marginal distribution depending
// on data type: polytomous ordinal or continuous. I have adapted a second
// The ordinal data follows the data augmentation approach highlighted in Albert
// & Chib, 1993. The continuous data is centered for efficiency. Note, I also
// also include a slightly augmented version of the polytomous function for
// use with carpals and tarsals.

array[] matrix ordered_probit_marginal(array[,] int y, matrix mu_glm, matrix u_raw, array[] vector cutpoints) {
int N = rows(mu_glm); // number of observations
int K = cols(mu_glm); // number of polytomous variables
array matrix[N, K] rtn; // empty 2D array to return

for(k in 1:K){
for(n in 1:N){
int C = num_elements(cutpoints[k,]) + 1; // total number of ord categories
if(y[n,k] == 99){ // missing data
rtn[n,k] = inv_Phi(u_raw[n,k]); // missing RV
rtn[n,k] = 0;
} else if(y[n,k] == 1){ // lowest bound
real bound = Phi((cutpoints[k,1] - mu_glm[n,k])); // data augmentation
rtn[n,k] = inv_Phi((bound*u_raw[n,k])); // latent RV
rtn[n,k] = log(bound); // jacobian
} else if (y[n,k] == C){ // highest bound
real bound = Phi((cutpoints[k, C - 1] - mu_glm[n,k])); // data augmentation
rtn[n,k] = inv_Phi(bound + (1-bound)*u_raw[n,k]); // latent RV
rtn[n,k] = log1m(bound); // jacobian
} else { // in between
real ub = Phi((cutpoints[k ,y[n,k]] - mu_glm[n,k])); // data augmentation
real lb = Phi((cutpoints[k, y[n,k] - 1] - mu_glm[n,k])); // data augmentation
rtn[n,k] = inv_Phi((lb + (ub-lb)*u_raw[n,k])); // latent RV
rtn[n,k] = log(ub-lb); // jacobian
}
}
}
return rtn;
}

array[] matrix ordered_probit_marginal_uni(int[] y, real[] mu_glm, real[] u_raw, vector cutpoints) {
int N = num_elements(mu_glm); // number of observations
array matrix[N, 1] rtn; // empty 2D array to return

for(n in 1:N){
int C = num_elements(cutpoints) + 1; // total number of ord categories
if(y[n] == 99){ // missing data
rtn[n,1] = inv_Phi(u_raw[n]); // missing RV
rtn[n,1] = 0;
} else if(y[n] == 1){ // lowest bound
real bound = Phi((cutpoints - mu_glm[n])); // data augmentation
rtn[n,1] = inv_Phi((bound*u_raw[n])); // latent RV
rtn[n,1] = log(bound); // jacobian
} else if (y[n] == C){ // highest bound
real bound = Phi((cutpoints[C - 1] - mu_glm[n])); // data augmentation
rtn[n,1] = inv_Phi(bound + (1-bound)*u_raw[n]); // latent RV
rtn[n,1] = log1m(bound); // jacobian
} else { // in between
real ub = Phi((cutpoints[y[n]] - mu_glm[n])); // data augmentation
real lb = Phi((cutpoints[y[n] - 1] - mu_glm[n])); // data augmentation
rtn[n,1] = inv_Phi((lb + (ub-lb)*u_raw[n])); // latent RV
rtn[n,1] = log(ub-lb); // jacobian
}
}
return rtn;
}

matrix[] normal_marginal(matrix y, matrix mu_glm, matrix sigma) {
int N = rows(mu_glm); // number of observations
int J = cols(mu_glm); // number of continuous variables
matrix[N, J] rtn; // empty 2D array to return
// Initialise the jacobian adjustments to zero, as vectorised lpdf will be used
rtn = rep_matrix(0, N, J);

for (j in 1 : J) {
rtn[ : , j] = (y[ : , j] - mu_glm[ : , j]) ./ sigma[ :,j]; // center RV
rtn[1, j] = normal_lpdf(y[ : , j] | mu_glm[ : , j], sigma[ :,j]); // "jacobian"
}
return rtn;
}

}
data{

// size variables //
int N; // total # of observations (rows)
int J; // total # of continuous observations
int K_dentition; // total # of dental variables
int K_ef; // total # of ef variables
int K_pelvis; // total # of pelvic variables
int K_all; // all ordinal variables

//continuous prep //
int present; // total # of complete data in continuous responses
int missing; // total # of missing data in continuous responses
real y_continuous[present]; // vector of reals of responses across all vars
int index_pres[present, 2]; // Matrix (row, col) of non-missing value indices
int index_miss[missing, 2]; // Matrix (row, col) of missing value indices

// polytomous prep //
int y_dentition[N,K_dentition]; // 2D array of integers n obs by k_dentition
int y_ef[N,K_ef]; // 2D array of integers n obs by k_ef
int y_pelvis[N,K_pelvis]; // 2D array of integers n obs by k_pelvis
int y_carpal[N]; // 2D array of integers n obs by k_carpal
int y_tarsal[N]; // 2D array of integers n obs by k_tarsal

// predictor (age) //
real X[N]; //scaled

}
parameters{

// continuous parameters //
real y_missing_cont[missing]; // missing data parameter the size of all missing vars
vector<lower=0>[J] constant; //constant in continuous mean function
vector<lower=0>[J] exponent; // exponent in continuous mean function
vector[J] offset; // offset in continuous mean function
vector<lower=0>[J] noise1; // param of linear noise function
vector<lower=0>[J] noise2; // param of linear noise function

// Polytomous parameters //
vector<lower=0>[K_dentition] constant_dentition;
vector<lower=0>[K_ef] constant_ef;
vector<lower=0>[K_pelvis] constant_pelvis;
real<lower=0> constant_carpal;
real<lower=0> constant_tarsal;
matrix<lower=0, upper=1>[N,K_dentition] u_dentition; // nuisance dentition
matrix<lower=0, upper=1>[N,K_ef] u_ef; // nuisance ef
matrix<lower=0, upper=1>[N,K_pelvis] u_pelvis; // nuisance pelvis
real<lower=0, upper=1> u_carpal[N]; // nuisance carpal
real<lower=0, upper=1> u_tarsal[N]; // nuisance tarsal
ordered threshold_dentition; // dental thresholds
ordered threshold_ef; // ef thresholds
ordered threshold_pelvis; // pelvis thresholds
ordered threshold_carpal; // carpal thresholds
ordered threshold_tarsal; // tarsal thresholds

// Cholesky factor (copula parameter) //
cholesky_factor_corr[J+K_all] L;

}
transformed parameters{

// Parameter Declarations //
matrix[N,J] mu_continuous; // continuous mean
matrix[N,J] sd_continuous; // continuous sd
matrix[N,J] y_full_continuous; // full y matrix including missing data parameters
matrix[N,K_dentition] mu_dentition; // dental mean
matrix[N,K_ef] mu_ef; // ef mean
matrix[N,K_pelvis] mu_pelvis; // pelvis mean
real mu_carpal[N]; // carpal mean
real mu_tarsal[N]; // tarsal mean

// Continuous Parameters //
for(i in 1:N){
for(j in 1:J){
mu_continuous[i,j] = constant[j]*X[i]^exponent[j] + offset[j];
sd_continuous[i,j] = noise1[j]*(1 + X[i]*noise2[j]);
}
}
// Fill y_full continuous with present data
for(n in 1:present) {
y_full_continuous[index_pres[n,1]][index_pres[n,2]] = y_continuous[n];
}
// Fill the rest of y_full continuous with missing value "parameters"
for(n in 1:missing){
y_full_continuous[index_miss[n,1]][index_miss[n,2]] = y_missing_cont[n];
}

// Polytomous Parameters //
for(i in 1:N){
for(k in 1:K_dentition){
mu_dentition[i,k] = constant_dentition[k]*X[i];
}
}
for(i in 1:N){
for(k in 1:K_ef){
mu_ef[i,k] = constant_ef[k]*X[i];
}
}
for(i in 1:N){
for(k in 1:K_pelvis){
mu_pelvis[i,k] = constant_pelvis[k]*X[i];
}
}
for(i in 1:N){
mu_carpal[i] = constant_carpal*X[i];
}
for(i in 1:N){
mu_tarsal[i] = constant_tarsal*X[i];
}

}
model{

// Priors //
constant ~ normal(0,1);
exponent ~ normal(0,1);
offset ~ normal(0,1);
noise1 ~ normal(0,1);
noise2 ~ normal(0,1);
constant_dentition ~ normal(0,1);
constant_ef ~ normal(0,1);
constant_pelvis ~ normal(0,1);
constant_carpal ~ normal(0,1);
constant_tarsal ~ normal(0,1);

// Threshold Priors //
for(i in 1:K_dentition){
for(t in 1:11){
threshold_dentition[i,t] ~ normal(t+1,1);
}
}
for(i in 1:K_ef){
for(t in 1:6){
threshold_ef[i,t] ~ normal(t+1,1);
}
}
for(i in 1:K_pelvis){
for(t in 1:2){
threshold_pelvis[i,t] ~ normal(t+1,1);
}
}

for(t in 1:8){
threshold_carpal[t] ~ normal(t+1,1);
}

for(t in 1:7){
threshold_tarsal[t] ~ normal(t+1,1);
}

for (n in 1:missing){
y_missing_cont[n] ~ normal(mu_continuous[index_miss[n,1]][index_miss[n,2]],sd_continuous[index_miss[n,1]][index_miss[n,2]]);
}

// Copula Parameter Prior //
L ~ lkj_corr_cholesky(6);

// Likelihood //
target += centered_gaussian_copula_cholesky_(
{normal_marginal(y_full_continuous, mu_continuous, sd_continuous),
ordered_probit_marginal(y_dentition, mu_dentition, u_dentition, threshold_dentition),
ordered_probit_marginal(y_ef, mu_ef, u_ef, threshold_ef),
ordered_probit_marginal(y_pelvis, mu_pelvis, u_pelvis, threshold_pelvis),
ordered_probit_marginal_uni(y_carpal, mu_carpal, u_carpal, threshold_carpal),
ordered_probit_marginal_uni(y_tarsal, mu_tarsal, u_tarsal, threshold_tarsal)}, L);

}
generated quantities{

// Here I put the correlation matrix back together for ease of interpretation
corr_matrix[J+K_all] cor_mat = multiply_lower_tri_self_transpose(L);
}

``````

Glad it’s helping! Some of the speed may be less the code and more numerical struggles. One thing to try is calculating on the log scale using ‘inv_Phi_log’ function. Replace all the ‘normal(0, 1)’ with ‘std_normal()’.

One thing I’m not sure about is how the compiler handles all those different 1:N loops. @WardBrian does the compiler merge those together? I always try to pack same index loops into one instead of many.

In the normal_marginal code you may be able to vectorize that J loop. Also in the model block I think you can avoid the J loop . See if you can avoid the ‘rep_matrix’ when you call ‘rtn’.

Thanks Sean!

I’ll get to work vectorizing the j-loop - that has been something I’ve been meaning to do. Perhaps a dumb question - is it possible in Stan to do elementwise addition and subtraction? I know we can do products, quotients, and powers. Trying to figure out the best way to vectorize.

Related to inv_phi_log is it `std_normal_log_qf` or `inv_phi_log`? Regardless, will I have to put the nuisance parameter `u` and the bound parameters (`bound`, `lb`, `ub`) on the log scale too before using inv_phi_log? Trying to figure out the best way to rewrite what I have… And lastly, because Stan returns the log probability, can rtn be left on the log scale, or do I need to `exp(inv_phi_log(...))`?

Stanc3 does not. It’s possible clang/gcc would, but it requires some sophisticated analysis to merge loops together. Whether you will get better performance out of merging loop together or not depends a lot on what operation you’re doing inside them; if the compiler can autovectorize and your problem fits in cache, combining two successive loops together might actually make things much slower by breaking this. This is the kind of thing where I’d avoid any overly general advice and just try it and benchmark both ways.

1 Like

The irony!

I ran a test combining that loop in the Transformed Parameters block to assign all of the means and sd’s with one pass of N instead of re-running for every data type (i.e., a loop for dentition, a loop for ef, etc.) and it did in fact slow things down. Not by a huge margin - ~ an hour or so - but enough to at least suggest the efficiency gains would be negligible at best at least in the loop construction.

I feel as though much of the major problems here in terms of real efficiency gain (efficiency being purely a metric of time here and not iterations) lie in the `multi_normal_cholesky_copula_function` and the J x N `mdivide_left_tri_low` and the subsequent `return` function…

2 Likes

You can try this as the inverse is just done on a J x J instead of J x N.

``````real gaussian_copula_cholesky_lpdf(
matrix U, matrix L
) {
int N                  = rows(U);
int J                  = cols(U);
matrix[N,J] Q          = inv_Phi(U);
matrix[J,J] Gammainv   = crossprod(mdivide_left_tri_low(L, diag_matrix(rep_vector(1.0, J))));
return -0.5 * ( N * sum(log(diagonal(L))) + sum( add_diag(Gammainv, -1.0) .* crossprod(Q) ) ) ;
}
``````

``````  real multi_normal_cholesky_copula_lpdf(matrix U, matrix L) {
int N = rows(U); // total number of observations
int J = cols(U); // total number of responses (J+K)
matrix[J, J] Gammainv = chol2inv(L);
return -N * sum(log(diagonal(L))) - 0.5 * sum(add_diag(Gammainv, -1.0) .* crossprod(U));
}
``````

This is the function on you github. The one you shared has initialization problems related to `inv_Phi(U)`. Doesn’t it expect the data to be on the interval [0,1]?

Yes, just depends on the input. I have like 10 versions of the multi normal copula, lol.

Mind if I ask the difference numerically?

I believe when I started this long ago I used the ` matrix[J, J] Gammainv = chol2inv(L);` and then a suggestion was to try `matrix[J, N] G = mdivide_left_tri_low(L, U');`. I believe there is a slight slowdown when using `mdivide_left_tri_low` as compared to `chol2inv` (like 30 minutes at best). I am running the same model again to test this out. But, I believe the results are identical. In both cases `U` and `L` are the same - a matrix of size N x J and and the cholseky factors of the corr matrix. That term is purely related to the middle `sum()` statement in the `return` of the function - so wasn’t sure what the difference was or if one was preferred over the other?

Just an update for you:

Running a 6-variable model (2 continuous, 4 ordinal) with the exact code you have on gihub for the multi_normal_cholesky_function works great. This is using the `chol2inv` function I described in my last post with a J x J matrix.

Alternatively, if I run the same model with the code in my initial post using `mdivide_left_tri_low` and a J x N matrix, the model takes ~1 hour longer to run (the first takes 10 hours, this one takes 11 hours) AND there are issues related to E-BFMI. The specifications are identical - same seed, inits, treedepth, and adapt_delta, 2 chains.

This is interesting…

I have had a lot of difficulty with fitting multivariate normal models in Stan with a factor structure (or really in any case where there is a large amount of correlation between some of the variables), particularly as the number of variables increases.

As far as I can tell, the problem is associated with correlation in the estimated correlation matrix parameters. That’s a little confusing, but what I mean is if you have some correlation matrix C, then when you estimate parameters C[m, n] and C[p, q] with Stan, they will have some correlation. Moreover, since it is a correlation matrix (or a Cholesky matrix, or what have you), there are restrictions that are imposed on the values to ensure that it is a valid correlation matrix (e.g. can’t be bigger than or less than zero along with other restrictions on the relations between variables). This will induce some kind of non-linear correlation between the variables, particularly when the average C[m, n] increases to 1.

In the multivariate normal case, I can instead represent it as a series of regressions, which is kind of like allowing each part of the Cholesky matrix to be fit separately and without having all the constraints that a Cholesky matrix uses. This approach doesn’t work in your case because you don’t have normal marginals. However, the same underlying principle is there since you could be running into these problems with correlated data.

In the past, I suggested a C-vine copula approach as an alternative, but I never got around to writing it up. The C-vine is easier to understand visually for me (e.g. Vine copula - Wikipedia). The first layer is all the correlations with variable 1, the second layer is all the partial correlations with variable 2 conditional on variable 1, and the Nth layer is the partial correlations with variable N conditional on variables 1, … ,N-1. If you’re assuming the dependence is Gaussian for all variables, then it really simplifies things. You can instead model the partial correlations directly and then build up to a correlation matrix. Since each partial correlation layer is conditional on the other variables, you are able to reduce correlations between the correlation parameters (if that makes sense).

4 posts were split to a new topic: Vine examples in Stan