Looking for suggetions to make codes more efficient

I’ve been working on a project and have written some stan codes. It’s working fine, but I’ve been wondering what I could do to make it more efficient. I’ve read through the documentation and tried to make changes to my codes. I just want to post it out here, together with my considerations, to see whether there’s anything else I could do to optimize it.

functions {
  // competing risks cox proportional hazards model, log likelihood
  real crcox_lpdf(vector y, matrix Xbeta, matrix delta, matrix lambda_cal, matrix H) {
    return sum(delta .* (log(lambda_cal) + Xbeta) - exp(Xbeta) .* H);
  }
}

data {
  int<lower=1> n; // number of data points
  int<lower=1> p; // number of covariates
  int<lower=1> k; // number of bins for piecewise linear baseline hazard rates
  int<lower=1> m; // number of risk types
  vector[n] y; // time to event data
  vector[k+1] s; // knots for the piecewise linear baseline hazard rates
  matrix[n,m] delta; // indicator of risk type
  matrix[n,p] X; // design matrix
  real<lower=0> kappa0; // for multiplicative gamma process
  real<lower=0> kappa1; // for multiplicative gamma process
}

transformed data {
  array[n] int<lower=1,upper=k> ki; // bin for each event
  matrix[k,m] s_diff; // bin widths, matrix form for calculation
  matrix[n,m] T; // time since last knot, matrix form for calculation

  for (i in 1:n) {
    for (l in 1:k){
      if (y[i]>s[l] && y[i] <= s[l+1]){
        ki[i] = l;
        T[i] = rep_row_vector(y[i] - s[l],m);
      }
    }
  }
  for (l in 1:k){
    s_diff[l] = rep_row_vector(s[l+1]-s[l],m);
  }
}

parameters {
  matrix[p,m] beta;
  matrix<lower=0>[k,m] psi;
  vector[p] mu;
  real<lower=0> s2;
}

transformed parameters {
  matrix[k,m] lambda; // baseline hazard rates
  
  lambda[1] = psi[1];
  for (l in 2:k){
    lambda[l] = psi[l] .* lambda[l-1];
  }
}

model {
  matrix[n,m] H; // baseline cumulative hazard rates for each ti
  matrix[n,m] lambda_cal; // lambda for each ti
  matrix[k,m] cum_lambda1;
  matrix[k,m] cum_lambda2;
  matrix[p,m] mu_rep;
  matrix[n,m] Xbeta;
  vector[(k-1)*m] psi_1d;
  vector[p*m] mu_1d;
  vector[p*m] beta_1d;

  // get lambda_cal
  lambda_cal = lambda[ki];
  
  // get H
  cum_lambda1 = lambda .* s_diff;
  cum_lambda2[1] = zeros_row_vector(m);

  for (l in 2:k){
    cum_lambda2[l] = cum_lambda2[l-1] + cum_lambda1[l-1];
  }
  H = cum_lambda2[ki];

  H = H + lambda_cal .* T;
  mu_rep = rep_matrix(mu,m);
  mu_1d = to_vector(mu_rep);
  beta_1d = to_vector(beta);
  psi_1d = to_vector(psi[2:]);
  Xbeta = X * beta;

  target += crcox_lpdf(y | Xbeta, delta, lambda_cal, H);
  target += cauchy_lpdf(s2 |0,1);
  target += normal_lpdf(beta_1d | mu_1d, sqrt(s2));
  target += normal_lpdf(mu | 0, 10);
  target += gamma_lpdf(psi[1] | kappa0, kappa0);
  target += gamma_lpdf(psi_1d | kappa1, kappa1);

}

Apologies for posting such long codes and asking people to read … Here is the main question I have in mind: I’m mostly using matrices and vectors in my codes, although for several variables (e.g., lambda, and psi), I’m mainly looping them by row. I understand from the documentation that in such cases, it’d be more efficient to define them as arrays of row vectors. I tried to use arrays of row vectors, but decided to change back to matrix/ vector setup because:

  1. ultimately, my loglikelihood calculation (see the customized function) requires element-wise multiplication. And I believe having everything in matrix would make this more efficient;
  2. I wantd to make use of the vectorized form of normal_lpdf and gamma_lpdf functions. So I wanted to be able to convert everything into a 1-dim array or vector. There is a built-in function to convert matrix into vector, but there isn’t a straightforward way to put array of vectors into 1-dim array.

Some other questions I had:

  1. Does it make a difference whether I subset a row from a matrix using syntax A[j,] or A[j]?
  2. Any suggestions on how to benchmark the time for different versions of the code? I’m looking for similar functionality like the benchmark package in R where I can time two code chunks to tell which one is more efficient. For now I’ve been repeatedly running different versions of my codes to compare their time, but it’s hard to tell for subtle changes.

If you spot anything else that’s inefficient, I’d also be super interested to know!!!

Thanks. I find it much easier seeing all the code.

  • You can move zeros_row_vector(m) to be defined in transforms data so it only gets allocated once.

  • You can just allocate mu_1d using rep_vector rather than generating mu_rep and then flattening.
    mu_1d = to_vector(mu_rep);

  • You can replace the target + = foo_lpdf(y | ...) with y ~ foo(...) and it drops normalizing constants, which can be expensive. Alternatively, use foo_lupdf, where the u is for unnormalized.

  • target += normal_lpdf(beta_1d | mu_1d, sqrt(s2)); is a problem because beta_1d and mu_1d and s2 are paraemters. This leads to what we call a centered parameterization, and unless you have a lot of data, it’s usually better to do this in a non-centered way, which would involve declaring beta as matrix<offset=mu_1d, multiplier=sqrt(s2)>[p,m] beta;

  • The Cauchy priors on scales can be too broad and slow down computation—unless you’re expecting really large values, like 10K or something, use something with narrower tails

No. Either way, it’s inefficient for two reasons. One, it has to allocate a new vector and then it has to fill it, but it’s pulling out a row, which is not memory local because Stan uses column-major indexing. An array of vectors would be more efficient here.

Then they probably don’t matter much unless you really want to scale up production of this. If you are trying to compare, use the same inits.

Rearranging things is fast in Stan—it just gets compiled down to C++ loops. The expensive things are functions that involve gradients (i.e., functions of parameters rather than just rearrangements). So this is worth the cost of the to-vector functions. It allows Stan to use a compressed representation of the expression graph for computing derivatives.

Ultimately, this is an issue with how I first defined memory access for arrays vs. vectors and matrices in Stan. Had I followed something like NumPy’s ndarray strategy, this would all be a lot easier (though that doesn’t cure the fact that you need to choose an ordering of matrix elements).

1 Like

Thanks a lot!! This is very helpful! I’ll make sure to go through everything on the list.

A quick question for now. You mentioned:

  • You can just allocate mu_1d using rep_vector rather than generating mu_rep and then flattening.
    mu_1d = to_vector(mu_rep);

I wonder how to directly use rep_vector since my mu is itself a vector. I tried but couldn’t get it to work, and that’s why I was using rep_matrix then flatten out.

Thanks!

Thanks a lot again for your detailed suggestions!! I have a couple of follow-up questions:

Q1

I tried to implement the non-centralized parameterization as you suggested. But couldn’t get it to work so far. Here are my codes. I wonder what I have missed.

Note:

  1. the first two code blocks are exactly the same as what I posted originally, so I didn’t include them
  2. as I mentioned in my last post, I’m not sure how to directly use rep_vector on mu to get to mu_1d since mu is a vector. Therefore I’m still using mu_rep in the following code. I guess I can create a customized function for this purpose. Would you suggest using customized function here? Like your answer in this post: What is the equivalent to the R function rep(x, each = n) in Stan? - Stack Overflow
  3. With the following code, I’m able to compile the stan codes, but whenever I try to run it, I get the following error msg:

Chain 1: Rejecting initial value:
Chain 1: Error evaluating the log probability at the initial value.
Chain 1: Exception: normal_lpdf: Random variable[1] is nan, but must be not nan! (in ‘anon_model’, line 78, column 2 to column 36)

transformed data {
  array[n] int<lower=1,upper=k> ki; // bin for each event
  matrix[k,m] s_diff; // bin widths, matrix form for calculation
  matrix[n,m] T; // time since last knot, matrix form for calculation
  row_vector[m] zeros; 

  zeros = zeros_row_vector(m);

  for (i in 1:n) {
    for (l in 1:k){
      if (y[i]>s[l] && y[i] <= s[l+1]){
        ki[i] = l;
        T[i] = rep_row_vector(y[i] - s[l],m);
      }
    }
  }
  for (l in 1:k){
    s_diff[l] = rep_row_vector(s[l+1]-s[l],m);
  }
}

parameters {
  vector[p] mu;
  matrix<lower=0>[k,m] psi;
  real<lower=0> s2;
}

transformed parameters {
  matrix[k,m] lambda; // baseline hazard rates
  vector[p*m] mu_1d;
  mu_1d = to_vector(rep_matrix(mu,m));
  vector<offset=mu_1d, multiplier=sqrt(s2)>[p*m] beta_1d;
  
  lambda[1] = psi[1];
  for (l in 2:k){
    lambda[l] = psi[l] .* lambda[l-1];
  }
}

model {
  matrix[n,m] H; // baseline cumulative hazard rates for each ti
  matrix[n,m] lambda_cal; // lambda for each ti
  matrix[k,m] cum_lambda1;
  matrix[k,m] cum_lambda2;

  matrix[n,m] Xbeta;
  vector[(k-1)*m] psi_1d;
  matrix[p,m] beta;

  beta = to_matrix(beta_1d,p,m);
  
  // get lambda_cal
  lambda_cal = lambda[ki];
  
  // get H
  cum_lambda1 = lambda .* s_diff;
  cum_lambda2[1] = zeros;

  for (l in 2:k){
    cum_lambda2[l] = cum_lambda2[l-1] + cum_lambda1[l-1];
  }
  H = cum_lambda2[ki];

  H = H + lambda_cal .* T;
  psi_1d = to_vector(psi[2:]);
  Xbeta = X * beta;

  y ~ crcox(Xbeta,delta,lambda_cal,H);
  s2 ~ inv_gamma(1,1);
  beta_1d ~ normal(mu_1d, sqrt(s2));
  mu ~ normal(0,10);
  psi[1] ~ gamma(kappa0,kappa0);
  psi_1d ~ gamma(kappa1,kappa1);

}

Q2:

Will it be better if I change my matrix setup so that I use the columns A[,j] instead? Or this will still be inferior compared to using array of row vectors? The reason why I’ve been resisting to change to use array of vectors is partly related to the next question.

Q3:

If I understand correctly, here you confirmed that it’d be worthwhile to always convert a matrix to an one-dim vector, or to convert an array of vectors to an one-dim array, before calculating the likelihood? Or did you mean that it’d better to use 1-dim vector compared to 1-dim array in the likelihood calculation?

In my original post (the two quoted there), I was trying to explain that part of the reasons why I decided to setup all variables as matrix instead of arrays of vectors was because, at the end of the day, I wanted to convert them into 1-dim vector or 1-dim array so that I could use the vectorized likelihood calculation. I chose to use matrix because I can easily use the to_vector function to get them into 1-dim vector. But there isn’t a built-in function for arrays of vectors. Though I guess I can also potentially write a customized function to do that…

I’m not sure what’s the best configuration. I’ll try to set things up in both ways and see how the time compares. In the meantime, would appreciate any thoughts or comments!

Some more updates. I tried non-centered parameterization using the following codes (not with the offset syntax, but using the usual way, I’m still very curious how to make the offset syntax work). However, it takes almost two times the time of the codes using centered parameterization. Moreover, I got the following warning after the run:

Warning messages:
1: There were 21 divergent transitions after warmup. See
Runtime warnings and convergence problems
to find out why this is a problem and how to eliminate them.
2: Examine the pairs() plot to diagnose sampling problems

Maybe my data is in the situation that non-centered parameterization is not suitable?

functions {
  // competing risks cox proportional hazards model, log likelihood
  real crcox_lpdf(vector y, matrix Xbeta, matrix delta, matrix lambda_cal, matrix H) {
    return sum(delta .* (log(lambda_cal) + Xbeta) - exp(Xbeta) .* H);
  }

}

data {
  int<lower=1> n; // number of data points
  int<lower=1> p; // number of covariates
  int<lower=1> k; // number of bins for piecewise linear baseline hazard rates
  int<lower=1> m; // number of risk types
  vector[n] y; // time to event data
  vector[k+1] s; // knots for the piecewise linear baseline hazard rates
  matrix[n,m] delta; // indicator of risk type
  matrix[n,p] X; // design matrix
  real<lower=0> kappa0; // for multiplicative gamma process
  real<lower=0> kappa1; // for multiplicative gamma process
}

transformed data {
  array[n] int<lower=1,upper=k> ki; // bin for each event
  matrix[k-1,m] s_diff; // bin widths, matrix form for calculation
  matrix[n,m] T; // time since last knot, matrix form for calculation
  row_vector[m] zeros; 

  zeros = zeros_row_vector(m);

  for (i in 1:n) {
    for (l in 1:k){
      if (y[i]>s[l] && y[i] <= s[l+1]){
        ki[i] = l;
        T[i] = rep_row_vector(y[i] - s[l],m);
      }
    }
  }
  for (l in 1:(k-1)){
    s_diff[l] = rep_row_vector(s[l+1]-s[l],m);
  }
}

parameters {
  vector[p*m] beta_1d_raw;
  matrix<lower=0>[k,m] psi;
  vector[p] mu;
  real<lower=0> s2;
}

transformed parameters {
  matrix[k,m] lambda; // baseline hazard rates
  vector[p*m] beta_1d;

  lambda[1] = psi[1];
  for (l in 2:k){
    lambda[l] = psi[l] .* lambda[l-1];
  }
  beta_1d = to_vector(rep_matrix(mu,m)) + sqrt(s2) * beta_1d_raw;
}

model {
  matrix[n,m] H; // baseline cumulative hazard rates for each ti
  matrix[n,m] lambda_cal; // lambda for each ti
  matrix[k,m] cum_lambda;
  matrix[n,m] Xbeta;
  vector[(k-1)*m] psi_1d;
  matrix[p,m] beta;

  beta = to_matrix(beta_1d,p,m);

  // get lambda_cal
  lambda_cal = lambda[ki];
  
  // get H
  cum_lambda[1] = zeros;
  cum_lambda[2:] = block(lambda,1,1,(k-1),m) .* s_diff;

  for (j in 1:m){
    cum_lambda[,j] = cumulative_sum(cum_lambda[,j]);
  }

  H = cum_lambda[ki];

  H = H + lambda_cal .* T;
  psi_1d = to_vector(psi[2:]);
  Xbeta = X * beta;

  target += crcox_lupdf(y | Xbeta, delta, lambda_cal, H);
  target += inv_gamma_lupdf(s2 |1,1);
  target += normal_lupdf(beta_1d_raw | 0, 1);
  target += normal_lupdf(mu | 0, 10);
  target += gamma_lupdf(psi[1] | kappa0, kappa0);
  target += gamma_lupdf(psi_1d | kappa1, kappa1);

}