Vectorization and improving speed for a complex mixture type latent trait model for response times

I have the following model syntax I have tested using both simulated datasets and real datasets. It works fine, the model converges, and I get nice interpretable parameters.

The issue is speed. It takes about an hour or so for a few thousand observations (e.g., 200 individuals x 20 items = 4000 observations). However, I have a dataset with about 11,000 individuals and 60 items. Based on my observation, it will take 10-15 days.

I would like to ask if anybody has any suggestion about improving the speed. For instance, how can I vectorize the for loop in the model statement? Is there anything else you would recommend changing to improve the speed?

Thank you.

// Stan model syntax with modified deterministic gated lognormal response time model

data{
    int <lower=1> I;                       // number of examinees          
    int <lower=1> J;                       // number of items
    int <lower=1> n_obs;                   // number of observations
    int <lower=1> p_loc[n_obs];            // person indicator   
    int <lower=1> i_loc[n_obs];            // item indicator
    real Y[n_obs];                         // vector of log response time
}

parameters {
  real mu_beta;                 // mean for time intensity parameters
  real<lower=0> sigma_beta;     // sd for time intensity parameters
  
  real mu_alpha;                // mean for log of time discrimination parameters
  real<lower=0> sigma_alpha;    // sd for log of time discrimination parameters
  
  real<lower=0> sigma_taut;     // sd for tau_t
  real<lower=0> sigma_tauc;     // sd for tau_c
  
  corr_matrix[2] omega_P;       // 2 x 2 correlation matrix for person parameters
  corr_matrix[2] omega_I;       // 2 x 2 correlation matrix for item parameters
  
  vector<lower=0,upper=1>[J] pC; // vector of length J for the probability of item compromise status
  
  vector<lower=0,upper=1>[I] pH; // vector of length I for the probability of examinee item peknowledge 
  
  ordered[2] person[I];          // an array with length I for person specific latent parameters
  // Each array has two elements
  // first element is tau_t
  // second element is tau_c
  // ordered vector assures that tau_c > tau_t for every person
  // to make sure chains are exploring the same mode and 
  // multiple chains do not go east and west leading multi-modal posteriors
 
  
  vector[2] item[J];    // an array with length J for item specific parameters
  // each array has two elements
  // first element is alpha
  // second element is beta
}


transformed parameters{
  
  vector[2] mu_P;                        // vector for mean vector of person parameters 
  vector[2] mu_I;                        // vector for mean vector of item parameters
  
  vector[2] scale_P;                     // vector of standard deviations for person parameters
  vector[2] scale_I;                     // vector of standard deviations for item parameters
  
  cov_matrix[2] Sigma_P;                 // covariance matrix for person parameters
  cov_matrix[2] Sigma_I;                 // covariance matrix for item parameters
  
  mu_P[1] = 0;
  mu_P[2] = 0;
  
  scale_P[1] = sigma_taut;               
  scale_P[2] = sigma_tauc;
  
  Sigma_P = quad_form_diag(omega_P, scale_P); 
  
  mu_I[1] = mu_alpha;
  mu_I[2] = mu_beta;
  
  scale_I[1] = sigma_alpha;               
  scale_I[2] = sigma_beta;
  
  Sigma_I = quad_form_diag(omega_I, scale_I); 
}



model{

  sigma_taut  ~ exponential(1);
  sigma_tauc  ~ exponential(1);
  sigma_beta  ~ exponential(1);
  sigma_alpha ~ exponential(1);
  
  mu_beta      ~ normal(4,1);
  mu_alpha     ~ lognormal(0,0.5);
  
  pC ~ beta(1,1);
  pH ~ beta(1,1);
  
  omega_P   ~ lkj_corr(1);
  omega_I   ~ lkj_corr(1);
  
  person  ~ multi_normal(mu_P,Sigma_P);
  
  item    ~ multi_normal(mu_I,Sigma_I);
  
  
  for (i in 1:n_obs) {
      
      // item[i_loc[i],1] represents log of parameter alpha of the (i_loc[i])th item
          // that's why we use exp(item[i_loc[i],1]) below 
      // item[i_loc[i],1] represents parameter beta of the (i_loc[i])th item
      
      //person[p_loc[i],1] represents parameter tau_t of the (p_loc[i])th person
      //person[p_loc[i],2] represents parameter tau_c of the (p_loc[i])th person
      
      
      real p_t = item[i_loc[i],2] - person[p_loc[i],1];   //expected response time for non-cheating response
      real p_c = item[i_loc[i],2] - person[p_loc[i],2];  //expected response time for cheating response
      
      // log of probability densities for each combination of two discrete parameters
      // (C,T) = {(0,0),(0,1),(1,0),(1,1)}
      
      real lprt1 = log1m(pC[i_loc[i]]) + log1m(pH[p_loc[i]]) + normal_lpdf(Y[i] | p_t, 1/exp(item[i_loc[i],1]));  // T = 0, C=0
      real lprt2 = log1m(pC[i_loc[i]]) + log(pH[p_loc[i]])   + normal_lpdf(Y[i] | p_t, 1/exp(item[i_loc[i],1]));  // T = 1, C=0
      real lprt3 = log(pC[i_loc[i]])   + log1m(pH[p_loc[i]]) + normal_lpdf(Y[i] | p_t, 1/exp(item[i_loc[i],1]));  // T = 0, C=1
      real lprt4 = log(pC[i_loc[i]])   + log(pH[p_loc[i]])   + normal_lpdf(Y[i] | p_c, 1/exp(item[i_loc[i],1]));  // T = 1, C=1 
      
      target += log_sum_exp([lprt1, lprt2, lprt3, lprt4]);
  }
  
}

So two slightly conflicting suggestions:

  1. Try to avoid declaring variables to store the result of an intermediate computation wherever possible. (Apparently more declared variables can induce substantial slowdown for auto-differentiation)

  2. Try to look for redundant computations and instead of doing them redundantly, store as a declared variable and refer to them where they’re needed.

And presumably you’re aware of the various parallelization options?

Here’s a partly vectorized (annoyingly normal_lpdf doesn’t have a non-reducing variant and log_sum_exp doesn’t have a row-wise variant for matrix inputs) and less-compute-redundant version of your model block for loop:

  vector[n_obs] p_t = item[i_loc[,2] - person[p_loc,1];   
  vector[n_obs] p_c = item[i_loc[,2] - person[p_loc,2]; 
      
  vector[n_obs] one_div_exp_item_loc = 1/exp(item[i_loc,1]) ;

  vector[n_obs] norm_p_t ;
  vector[n_obs] norm_p_c;
  for (i in 1:n_obs) {
    norm_p_t[i] = normal_lpdf(Y[i] | p_t[i] , one_div_exp_item_loc[i]);  
    norm_p_c[i] = normal_lpdf(Y[i] | p_c[i] , one_div_exp_item_loc[i]);  
  }

  vector[n_obs] log1m_pC_loc = log1m(pC[i_loc]) ;
  vector[n_obs] log_pC_loc = log(pC[i_loc]) ;
  vector[n_obs] log1m_pH_loc = log1m(pH[i_loc]) ;
  vector[n_obs] log_pH_loc = log(pH[i_loc]) ;

  matrix[n_obs,4] lprt;
  lprt[,1] = log1m_pC_loc + log1m_pH_loc + norm_p_t;  // T = 0, C=0
  lprt[,2] = log1m_pC_loc + log_pH_loc   + norm_p_t;  // T = 1, C=0
  lprt[,3] = log_pC_loc   + log1m_pH_loc + norm_p_t;  // T = 0, C=1
  lprt[,4] = log_pC_loc   + log_pH_loc   + norm_p_c;  // T = 1, C=1 
  vector[n_obs] lps;
  for (i in 1:n_obs) {
    lps[i] = log_sum_exp(lprt[i,]) ;
  }
  target += lps ;

It’s possible to hack a non-reducing normal_lpdf (inspiration) yourself via:


  vector[n_obs] norm_p_t = (
    // std-normal after shifting/scaling
    -0.5*pow(
      (Y-p_t)/one_div_exp_item_loc
      , 2
    ) 
    // jacobian for division:
    - log(one_div_exp_item_loc)
  ) ;

  vector[n_obs] norm_p_c = (
    // std-normal after shifting/scaling
    -0.5*pow(
      (Y-p_c)/one_div_exp_item_loc
      , 2
    ) 
    // jacobian for division:
    - log(one_div_exp_item_loc)
  ) ;

But I’m not confident that would be any more performant than the loop above.

2 Likes

Thank you so much for the suggestions. I will work on them. I am aware of within-chain parallelization, and I tried them with this model in the past. For some reason, I didn’t know they didn’t make much difference. I will revisit it and see if I can post the version with the within-chain parallelization and the performance differences.

1 Like

@jsocolar pointed me to the SUG section on log_sum_exp, inspiring replacement of this:

  vector[n_obs] lps;
  for (i in 1:n_obs) {
    lps[i] = log_sum_exp(lprt[i,]) ;
  }

with:

  vector[n_obs] rowmax_lprt;
  for (i in 1:n_obs) {
    rowmax_lprt[i] = max(lprt[i,]) ;
  }
  matrix[n_obs,4] exp_lprt_minus_max = exp(lprt-rep_matrix(rowmax_lprt,4));
  vector[n_obs] rowsum_of_exp_lprt_minus_max ;
  for (i in 1:n_obs) {
    rowsum_of_exp_lprt_minus_max = sum(exp_lprt_minus_max[i,]) ;
  }
  target += rowmax_lprt + log(rowsum_of_exp_lprt_minus_max) ;

The latter version has potential benefit of vectorized computation of the exp_lprt_minus_max term and the operations in the final target line, but at the cost of two loops rather than one and declaration of more intermediate variables, so I’m not sure if it’d be any faster or possibly slower.

Since there’s only 4 columns in lprt, I guess the second loop and it’s associated intermediate variable could be eliminated:

  target += rowmax_lprt + log(
      exp_lprt_minus_max[,1] 
    + exp_lprt_minus_max[,2] 
    + exp_lprt_minus_max[,3] 
    + exp_lprt_minus_max[,4] 
  ) ;
1 Like

Just to report back on this. This modification on log_sum_exp didn’t actually make any difference and was indeed slower. So, you guessed it right!

Reporting back on this. This partly vectorized version using normal_lpdf indeed cut the running time in almost half (based on two separate experiments).

It made it worse when I added the non-reducing normal_lpdf (vectorized), which surprised me.

The code below was the best performance in terms of all the suggestions made above (about half the time of the original code). I also had to update a few things as the newer version of cmdstan forced me to rewrite the syntax for arrays, and didn’t like the old vector[] language. Due to that, I had to add another loop for p_t, p_c, etc., because things like vector[n_obs] p_t = item[i_loc,2] - person[p_loc,1]; didn’t function as expected. I suppose because of how they are now defined as arrays.

data{
  int <lower=1> J;                       // number of examinees          
  int <lower=1> I;                       // number of items
  int <lower=1> n_obs;                   // number of observations (I xJ - missing responses)
  array[n_obs] int<lower=1> p_loc;       // person indicator   
  array[n_obs] int<lower=1> i_loc;       // item indicator
  array[n_obs] real Y;                   // vector of log of responses
}

parameters {
  real mu_beta;                 // mean for time intensity parameters
  real<lower=0> sigma_beta;     // sd for time intensity parameters
  
  real mu_alpha;                // mean for log of time discrimination parameters
  real<lower=0> sigma_alpha;    // sd for log of time discrimination parameters
  
  real<lower=0> sigma_taut;     // sd for tau_t
  real<lower=0> mu_tauc;        // mean for tau_c
  real<lower=0> sigma_tauc;     // sd for tau_c
  
  corr_matrix[2] omega_P;       // 2 x 2 correlation matrix for person parameters
  corr_matrix[2] omega_I;       // 2 x 2 correlation matrix for item parameters
  
  vector<lower=0,upper=1>[I] pC; // vector of length J for the probability of item compromise status
  
  vector<lower=0,upper=1>[J] pH; // vector of length I for the probability of examinee item peknowledge 
  
  array[J] ordered[2] person;    // an array with length I for person specific latent parameters
  // Each array has two elements
  // first element is tau_t
  // second element is tau_c
  // ordered vector assures that tau_c > tau_t for every person
  // to make sure chains are exploring the same mode and 
  // multiple chains do not go east and west leading multi-modal posteriors
  
  
  array[I] vector[2] item;    // an array with length J for item specific parameters
  // each array has two elements
  // first element is alpha
  // second element is beta
}


transformed parameters{
  
  vector[2] mu_P;                        // vector for mean vector of person parameters 
  vector[2] mu_I;                        // vector for mean vector of item parameters
  
  vector[2] scale_P;                     // vector of standard deviations for person parameters
  vector[2] scale_I;                     // vector of standard deviations for item parameters
  
  cov_matrix[2] Sigma_P;                 // covariance matrix for person parameters
  cov_matrix[2] Sigma_I;                 // covariance matrix for item parameters
  
  mu_P[1] = 0;
  mu_P[2] = mu_tauc;
  
  scale_P[1] = sigma_taut;               
  scale_P[2] = sigma_tauc;
  
  Sigma_P = quad_form_diag(omega_P, scale_P); 
  
  mu_I[1] = mu_alpha;
  mu_I[2] = mu_beta;
  
  scale_I[1] = sigma_alpha;               
  scale_I[2] = sigma_beta;
  
  Sigma_I = quad_form_diag(omega_I, scale_I); 
}


model{
  
  sigma_taut  ~ exponential(1);
  sigma_tauc  ~ exponential(1);
  sigma_beta  ~ exponential(1);
  sigma_alpha ~ exponential(1);
  
  mu_tauc      ~ normal(0,2);
  
  mu_beta      ~ normal(4,1);
  mu_alpha     ~ normal(0,0.5);
  
  pC ~ beta(1,1);
  pH ~ beta(1,1);
  
  omega_P   ~ lkj_corr(1);
  omega_I   ~ lkj_corr(1);
  
  person  ~ multi_normal(mu_P,Sigma_P);
  
  item    ~ multi_normal(mu_I,Sigma_I);

  vector[n_obs] norm_p_t ;
  vector[n_obs] norm_p_c;
  for (i in 1:n_obs) {
    real p_t    = item[i_loc[i]][2] - person[p_loc[i]][1];
    real p_c    = item[i_loc[i]][2] - person[p_loc[i]][2];
    real a      = 1/exp(item[i_loc[i]][1]);
    norm_p_t[i] = normal_lpdf(Y[i] | p_t, a);  
    norm_p_c[i] = normal_lpdf(Y[i] | p_c, a);  
  }
  
  vector[n_obs] log1m_pC_loc = log1m(pC[i_loc]) ;
  vector[n_obs] log_pC_loc   = log(pC[i_loc]) ;
  vector[n_obs] log1m_pH_loc = log1m(pH[p_loc]) ;
  vector[n_obs] log_pH_loc   = log(pH[p_loc]) ;
  
  matrix[n_obs,4] lprt;
  lprt[,1] = log1m_pC_loc + log1m_pH_loc + norm_p_t;  // T = 0, C=0
  lprt[,2] = log1m_pC_loc + log_pH_loc   + norm_p_t;  // T = 1, C=0
  lprt[,3] = log_pC_loc   + log1m_pH_loc + norm_p_t;  // T = 0, C=1
  lprt[,4] = log_pC_loc   + log_pH_loc   + norm_p_c;  // T = 1, C=1 
  
  vector[n_obs] lps;
  for (i in 1:n_obs) {
    lps[i] = log_sum_exp(lprt[i,]) ;
  }
  
  target += lps ;

}

}



One last follow-up question.

I followed the instructions on this page, and was able to reproduce the example. For this specific example provided on this page, the GPU time was about 6 times faster than the CPU time based on what my computer has.

I then repeated the experiment with this model (both the original model code and the partly vectorized faster version) using a CPU and a GPU. GPU time was much slower than the CPU time. It was also surprising. Any idea or insight about why GPU efficiency didn’t replicate for this specific model.

First, you want to replace all the dense covariance stuff with Cholesky factors. It’s a huge speedup as it reduces cubic operations to quadratic. These are only two by two, so you could alternatively parameterize with two scales and a correlation, which gives you a little more flexiblity in defining priors than either Wishart or LKJ + scale.

You can use vector assigns to simplify your transsform code,

transformed parameters{
  vector[2] mu_P = [0, 0]';
  vector[2] mu_I = [mu_alpha, mu_beta]';
  ...

It’s a little less efficient, but won’t be a big deal here and it’s a lot more compact and less error prone.

One thing you can do to speed things up is compute all of these log1m and log terms ahead of time to avoid redundant calculations, which you’ll get a lot of with the random effect indexing. Instead, try this:

vector[J] log1m_pC = log1m(pC);
vector[J] log_pC = log(pC);

and then replace, e.g., log1m(pC[i_loc[i]]) with log1m_pC[i_loc[i]].

You can probably do this with p_t and p_c.

vector[J] item_m_person_t = item[i_loc, 2] - person[p_loc, 1];

[quote="mike-lawrence, post:3, topic:27000"]
annoyingly normal_lpdf doesn’t have a non-reducing variant and log_sum_exp doesn’t have a row-wise variant for matrix inputs
[/quote]

There's a will, but no easy way.  The C++ coding is a bottleneck.  If someone contributes a patch for these, we'd merge them. There's been an issue for the former for years.

I'd recommend laying out code in a more standard way for readability, e.g., 

```stan
vector[n_obs] norm_p_c
  = -0.5 * ((Y - p_c) / one_div_exp_item_loc)^2 
    - log(one_div_exp_item_loc);                

I hadn’t thought about deriving the normalization constant (1 / sigma) as a Jacobian on the transform, but it’s another way to get to the standard result. This will be more efficient than a loop calling our built-in normal_lupdf (the u is for unnormalized). You can also turn this into a function:

vector normal_vec_lpdf(vector Y, real mu, real sigma) {
  return -0.5 * ((Y - mu) / sigma)^2 - log(sigma);
}
...
vector[n_obs] norm_p_c = normal_lupdf_vec(Y | p_c, one_div_exp_item_loc);

There’s no measurable overhead for the function call.

Thank you for these additional suggestions. I didn’t know how to use Cholesky factors in Stan, and I don’t know if you meant this, but I found this helpful post.

https://stla.github.io/stlapblog/posts/StanLKJprior.html

I replaced the covariance terms with Cholesky factors; this is the model syntax’s final version. This modification provided an additional 23% reduction in total run time (for n_obs = 25000, I=50, J=500, single chain with 150 warm up iterations + 150 sampling iterations) with very close estimates.

data{
  int <lower=1> J;                       // number of examinees          
  int <lower=1> I;                       // number of items
  int <lower=1> n_obs;                   // number of observations (I xJ - missing responses)
  array[n_obs] int<lower=1> p_loc;       // person indicator   
  array[n_obs] int<lower=1> i_loc;       // item indicator
  array[n_obs] real Y;                   // vector of log of responses
}

parameters {
  real mu_beta;                 // mean for time intensity parameters
  real mu_alpha;                // mean for log of time discrimination parameters
  vector<lower=0>[2] sigma_I; 
  
  real<lower=0> mu_tauc;        // mean for tau_c
  vector<lower=0>[2] sigma_P; 

  cholesky_factor_corr[2] Lomega_P;
  cholesky_factor_corr[2] Lomega_I;
  
  vector<lower=0,upper=1>[I] pC; // vector of length J for the probability of item compromise status
  
  vector<lower=0,upper=1>[J] pH; // vector of length I for the probability of examinee item peknowledge 
  
  array[J] ordered[2] person;    // an array with length I for person specific latent parameters
  // Each array has two elements
  // first element is tau_t
  // second element is tau_c
  // ordered vector assures that tau_c > tau_t for every person
  // to make sure chains are exploring the same mode and 
  // multiple chains do not go east and west leading multi-modal posteriors
  
  array[I] vector[2] item;    // an array with length J for item specific parameters
  // each array has two elements
  // first element is alpha
  // second element is beta
}


transformed parameters{
  vector[2] mu_P = [0,mu_tauc]';          // vector for mean vector of person parameters 
  vector[2] mu_I = [mu_alpha,mu_beta]';   // vector for mean vector of item parameters
}


model{
  
  sigma_P  ~ exponential(1);
  sigma_I  ~ exponential(1);

  mu_tauc      ~ normal(0,2);
  
  mu_beta      ~ normal(4,1);
  mu_alpha     ~ normal(0,0.5);
  
  pC ~ beta(1,1);
  pH ~ beta(1,1);
  
  Lomega_P ~ lkj_corr_cholesky(1);
  Lomega_I ~ lkj_corr_cholesky(1);
  
  person  ~ multi_normal_cholesky(mu_P,diag_pre_multiply(sigma_P, Lomega_P));
  item    ~ multi_normal_cholesky(mu_I,diag_pre_multiply(sigma_I, Lomega_I));
  
  vector[n_obs] norm_p_t ;
  vector[n_obs] norm_p_c;
  for (i in 1:n_obs) {
    real p_t = item[i_loc[i]][2] - person[p_loc[i]][1];
    real p_c = item[i_loc[i]][2] - person[p_loc[i]][2];
    real a   = 1/exp(item[i_loc[i]][1]);
    norm_p_t[i] = normal_lpdf(Y[i] | p_t, a);  
    norm_p_c[i] = normal_lpdf(Y[i] | p_c, a);  
  }
  
  vector[I] log1m_pC = log1m(pC) ;
  vector[I] log_pC   = log(pC) ;
  vector[J] log1m_pH = log1m(pH) ;
  vector[J] log_pH   = log(pH) ;
  
  matrix[n_obs,4] lprt;
  lprt[,1] = log1m_pC[i_loc] + log1m_pH[p_loc] + norm_p_t;  // T = 0, C=0
  lprt[,2] = log1m_pC[i_loc] + log_pH[p_loc]   + norm_p_t;  // T = 1, C=0
  lprt[,3] = log_pC[i_loc]   + log1m_pH[p_loc] + norm_p_t;  // T = 0, C=1
  lprt[,4] = log_pC[i_loc]   + log_pH[p_loc]   + norm_p_c;  // T = 1, C=1 
  
  vector[n_obs] lps;
  for (i in 1:n_obs) {
    lps[i] = log_sum_exp(lprt[i,]) ;
  }
  
  target += lps ;

}

I also tried this. @mike-lawrence suggested something similar in an earlier post

For some reason, it gives me this error message when compiling model. I don’t understand why.

That’s why I still keep these in the for loop.

 ------------------------------------------------- 
    65:    vector[n_obs] p_t   = item[i_loc,2] - person[p_loc,1];
                                 ^
    66:    vector[n_obs] p_t   = item[i_loc,2] - person[p_loc,2];
    67:    vector[n_obs] alpha = 1/exp(item[i_loc,1]);
   -------------------------------------------------

Ill-typed arguments supplied to infix operator -. Available signatures: 
(int, int) => int
(real, real) => real
(real, vector) => vector
(vector, real) => vector
(vector, vector) => vector
(complex, complex) => complex
(real, row_vector) => row_vector
(row_vector, real) => row_vector
(row_vector, row_vector) => row_vector
(real, matrix) => matrix
(matrix, real) => matrix
(matrix, matrix) => matrix
(complex, complex_vector) => complex_vector
(complex_vector, complex) => complex_vector
(complex_vector, complex_vector) => complex_vector
(complex, complex_row_vector) => complex_row_vector
(complex_row_vector, complex) => complex_row_vector
(complex_row_vector, complex_row_vector) => complex_row_vector
(complex, complex_matrix) => complex_matrix
(complex_matrix, complex) => complex_matrix
(complex_matrix, complex_matrix) => complex_matrix
Instead supplied arguments of incompatible type: array[] real, array[] real.

What this is saying is that Stan doesn’t implement subtraction of arrays. You need to declare item so that item[i_loc, 2] is a vector rather than array.