Vectorization and improving speed for a complex mixture type latent trait model for response times

I have the following model syntax I have tested using both simulated datasets and real datasets. It works fine, the model converges, and I get nice interpretable parameters.

The issue is speed. It takes about an hour or so for a few thousand observations (e.g., 200 individuals x 20 items = 4000 observations). However, I have a dataset with about 11,000 individuals and 60 items. Based on my observation, it will take 10-15 days.

I would like to ask if anybody has any suggestion about improving the speed. For instance, how can I vectorize the for loop in the model statement? Is there anything else you would recommend changing to improve the speed?

Thank you.

// Stan model syntax with modified deterministic gated lognormal response time model

    int <lower=1> I;                       // number of examinees          
    int <lower=1> J;                       // number of items
    int <lower=1> n_obs;                   // number of observations
    int <lower=1> p_loc[n_obs];            // person indicator   
    int <lower=1> i_loc[n_obs];            // item indicator
    real Y[n_obs];                         // vector of log response time

parameters {
  real mu_beta;                 // mean for time intensity parameters
  real<lower=0> sigma_beta;     // sd for time intensity parameters
  real mu_alpha;                // mean for log of time discrimination parameters
  real<lower=0> sigma_alpha;    // sd for log of time discrimination parameters
  real<lower=0> sigma_taut;     // sd for tau_t
  real<lower=0> sigma_tauc;     // sd for tau_c
  corr_matrix[2] omega_P;       // 2 x 2 correlation matrix for person parameters
  corr_matrix[2] omega_I;       // 2 x 2 correlation matrix for item parameters
  vector<lower=0,upper=1>[J] pC; // vector of length J for the probability of item compromise status
  vector<lower=0,upper=1>[I] pH; // vector of length I for the probability of examinee item peknowledge 
  ordered[2] person[I];          // an array with length I for person specific latent parameters
  // Each array has two elements
  // first element is tau_t
  // second element is tau_c
  // ordered vector assures that tau_c > tau_t for every person
  // to make sure chains are exploring the same mode and 
  // multiple chains do not go east and west leading multi-modal posteriors
  vector[2] item[J];    // an array with length J for item specific parameters
  // each array has two elements
  // first element is alpha
  // second element is beta

transformed parameters{
  vector[2] mu_P;                        // vector for mean vector of person parameters 
  vector[2] mu_I;                        // vector for mean vector of item parameters
  vector[2] scale_P;                     // vector of standard deviations for person parameters
  vector[2] scale_I;                     // vector of standard deviations for item parameters
  cov_matrix[2] Sigma_P;                 // covariance matrix for person parameters
  cov_matrix[2] Sigma_I;                 // covariance matrix for item parameters
  mu_P[1] = 0;
  mu_P[2] = 0;
  scale_P[1] = sigma_taut;               
  scale_P[2] = sigma_tauc;
  Sigma_P = quad_form_diag(omega_P, scale_P); 
  mu_I[1] = mu_alpha;
  mu_I[2] = mu_beta;
  scale_I[1] = sigma_alpha;               
  scale_I[2] = sigma_beta;
  Sigma_I = quad_form_diag(omega_I, scale_I); 


  sigma_taut  ~ exponential(1);
  sigma_tauc  ~ exponential(1);
  sigma_beta  ~ exponential(1);
  sigma_alpha ~ exponential(1);
  mu_beta      ~ normal(4,1);
  mu_alpha     ~ lognormal(0,0.5);
  pC ~ beta(1,1);
  pH ~ beta(1,1);
  omega_P   ~ lkj_corr(1);
  omega_I   ~ lkj_corr(1);
  person  ~ multi_normal(mu_P,Sigma_P);
  item    ~ multi_normal(mu_I,Sigma_I);
  for (i in 1:n_obs) {
      // item[i_loc[i],1] represents log of parameter alpha of the (i_loc[i])th item
          // that's why we use exp(item[i_loc[i],1]) below 
      // item[i_loc[i],1] represents parameter beta of the (i_loc[i])th item
      //person[p_loc[i],1] represents parameter tau_t of the (p_loc[i])th person
      //person[p_loc[i],2] represents parameter tau_c of the (p_loc[i])th person
      real p_t = item[i_loc[i],2] - person[p_loc[i],1];   //expected response time for non-cheating response
      real p_c = item[i_loc[i],2] - person[p_loc[i],2];  //expected response time for cheating response
      // log of probability densities for each combination of two discrete parameters
      // (C,T) = {(0,0),(0,1),(1,0),(1,1)}
      real lprt1 = log1m(pC[i_loc[i]]) + log1m(pH[p_loc[i]]) + normal_lpdf(Y[i] | p_t, 1/exp(item[i_loc[i],1]));  // T = 0, C=0
      real lprt2 = log1m(pC[i_loc[i]]) + log(pH[p_loc[i]])   + normal_lpdf(Y[i] | p_t, 1/exp(item[i_loc[i],1]));  // T = 1, C=0
      real lprt3 = log(pC[i_loc[i]])   + log1m(pH[p_loc[i]]) + normal_lpdf(Y[i] | p_t, 1/exp(item[i_loc[i],1]));  // T = 0, C=1
      real lprt4 = log(pC[i_loc[i]])   + log(pH[p_loc[i]])   + normal_lpdf(Y[i] | p_c, 1/exp(item[i_loc[i],1]));  // T = 1, C=1 
      target += log_sum_exp([lprt1, lprt2, lprt3, lprt4]);

So two slightly conflicting suggestions:

  1. Try to avoid declaring variables to store the result of an intermediate computation wherever possible. (Apparently more declared variables can induce substantial slowdown for auto-differentiation)

  2. Try to look for redundant computations and instead of doing them redundantly, store as a declared variable and refer to them where they’re needed.

And presumably you’re aware of the various parallelization options?

Here’s a partly vectorized (annoyingly normal_lpdf doesn’t have a non-reducing variant and log_sum_exp doesn’t have a row-wise variant for matrix inputs) and less-compute-redundant version of your model block for loop:

  vector[n_obs] p_t = item[i_loc[,2] - person[p_loc,1];   
  vector[n_obs] p_c = item[i_loc[,2] - person[p_loc,2]; 
  vector[n_obs] one_div_exp_item_loc = 1/exp(item[i_loc,1]) ;

  vector[n_obs] norm_p_t ;
  vector[n_obs] norm_p_c;
  for (i in 1:n_obs) {
    norm_p_t[i] = normal_lpdf(Y[i] | p_t[i] , one_div_exp_item_loc[i]);  
    norm_p_c[i] = normal_lpdf(Y[i] | p_c[i] , one_div_exp_item_loc[i]);  

  vector[n_obs] log1m_pC_loc = log1m(pC[i_loc]) ;
  vector[n_obs] log_pC_loc = log(pC[i_loc]) ;
  vector[n_obs] log1m_pH_loc = log1m(pH[i_loc]) ;
  vector[n_obs] log_pH_loc = log(pH[i_loc]) ;

  matrix[n_obs,4] lprt;
  lprt[,1] = log1m_pC_loc + log1m_pH_loc + norm_p_t;  // T = 0, C=0
  lprt[,2] = log1m_pC_loc + log_pH_loc   + norm_p_t;  // T = 1, C=0
  lprt[,3] = log_pC_loc   + log1m_pH_loc + norm_p_t;  // T = 0, C=1
  lprt[,4] = log_pC_loc   + log_pH_loc   + norm_p_c;  // T = 1, C=1 
  vector[n_obs] lps;
  for (i in 1:n_obs) {
    lps[i] = log_sum_exp(lprt[i,]) ;
  target += lps ;

It’s possible to hack a non-reducing normal_lpdf (inspiration) yourself via:

  vector[n_obs] norm_p_t = (
    // std-normal after shifting/scaling
      , 2
    // jacobian for division:
    - log(one_div_exp_item_loc)
  ) ;

  vector[n_obs] norm_p_c = (
    // std-normal after shifting/scaling
      , 2
    // jacobian for division:
    - log(one_div_exp_item_loc)
  ) ;

But I’m not confident that would be any more performant than the loop above.

1 Like

Thank you so much for the suggestions. I will work on them. I am aware of within-chain parallelization, and I tried them with this model in the past. For some reason, I didn’t know they didn’t make much difference. I will revisit it and see if I can post the version with the within-chain parallelization and the performance differences.

1 Like

@jsocolar pointed me to the SUG section on log_sum_exp, inspiring replacement of this:

  vector[n_obs] lps;
  for (i in 1:n_obs) {
    lps[i] = log_sum_exp(lprt[i,]) ;


  vector[n_obs] rowmax_lprt;
  for (i in 1:n_obs) {
    rowmax_lprt[i] = max(lprt[i,]) ;
  matrix[n_obs,4] exp_lprt_minus_max = exp(lprt-rep_matrix(rowmax_lprt,4));
  vector[n_obs] rowsum_of_exp_lprt_minus_max ;
  for (i in 1:n_obs) {
    rowsum_of_exp_lprt_minus_max = sum(exp_lprt_minus_max[i,]) ;
  target += rowmax_lprt + log(rowsum_of_exp_lprt_minus_max) ;

The latter version has potential benefit of vectorized computation of the exp_lprt_minus_max term and the operations in the final target line, but at the cost of two loops rather than one and declaration of more intermediate variables, so I’m not sure if it’d be any faster or possibly slower.

Since there’s only 4 columns in lprt, I guess the second loop and it’s associated intermediate variable could be eliminated:

  target += rowmax_lprt + log(
    + exp_lprt_minus_max[,2] 
    + exp_lprt_minus_max[,3] 
    + exp_lprt_minus_max[,4] 
  ) ;
1 Like