Within chain parallelization for a latent variable model

I am fitting a latent variable model. My original code runs with no problem for both real and simulated data; however, it takes too much time. I have access to a computing cluster with as many as 196 cores, so I think I can significantly speed up things. I tried to follow the example at this link to do within-chain parallelization, but I am stuck and don’t want to waste more time if it is not something possible.

Below, I will briefly describe the model. I will provide the original code (may not be optimal, but best I could write and runs with no issues). At the very end, I will also provide my attempt to rewrite it for within-chain parallelization and errors I am getting. I appreciate any advice or guidance in the right direction.

Thank you in advance.

Model specification

Y_{ij} | \tau_{tj}, \tau_{cj}, \beta_i, \alpha_i, T_j, C_i \sim N(\mu_{ij},\sigma_i)

\mu_{ij} = (\beta_i - \tau_{tj})^{1-T_j} \times \left [(1-C_i) \times (\beta_i - \tau_{tj}) + C_i \times (\beta_i - \tau_{cj})\right ]^{T_j}
\sigma_i = \frac{1}{\alpha_i^2}

Observed Data

  • Y_{ij}, log of observed response time for individual j on item i
  • C_i, an indicator variable for the status of item i (1 = compromised, 0: not compromised)

To be estimated:

  • \beta_i, time-intensity parameter for item i
  • \alpha_i, time discrimination parameter for item i
  • \tau_{tj}, latent speed for individual j when responding to an uncompromised item
  • \tau_{cj}, latent speed for individual j when responding to a compromised item
  • T_j, an indicator variable for person status (1: a cheater individual with prior knowledge of item, 0: an honest examinee with no prior knowledge of item)

T_j is deterministically defined at every draw such that

T_j = 1 if \tau_{cj} > \tau_{tj} and T_j = 0 if \tau_{tj} > \tau_{cj}.

Original model syntax that runs successfully

data{
 int <lower=1> I;                           // number of items                   
 int <lower=1> J;                           // number of individuals                     
 int <lower=1> n_obs;                       // number of observations (I x J)
 int <lower=1> ind_person_obs[n_obs];       // index variable for individuals positions
 int <lower=1> ind_item_obs[n_obs];         // index variable for items' position
 int <lower=0,upper=1> i_status_obs[n_obs]; // index variable for item status
 real Y[n_obs];                             // log of observed response time                  
}

parameters {
 vector[J] beta;                            // time-intensity parameters
 vector <lower=0> [J]  alpha;               //time discrimination parameters
 vector[I] tau_t;                           // individual latent speed parameters for uncompromised items 
 vector[I] tau_c;                          // individual latent speed parameters for compromised items 
 real mu1;                                  // hyperparameter for the mean of beta
 real<lower=0> sigma1;            // hyperparameter for the sd of beta 
 real<lower=0> sigma_t;           // hyperparameter for the sd of tau_t
 real<lower=0> sigma_c;          // hyperparameter for the sd of tau_c 
}

transformed parameters {

 vector[I] T;
 
 for (i in 1:n_obs) {
   if(tau_t[ind_person_obs[i]]>tau_c[ind_person_obs[i]])
     
     T[ind_person_obs[i]] = 0;
   
   else 
     
     T[ind_person_obs[i]] = 1;
 }
}


model{
 
 sigma_t ~ exponential(1);
 sigma_c ~ exponential(1);
 
 tau_t    ~ normal(0,sigma_t);
 tau_c    ~ normal(0,sigma_c);
 
   mu1      ~ normal(3.98,1);
   sigma1   ~ exponential(1);
 beta     ~ normal(mu1,sigma1);

 alpha    ~ inv_gamma(800,1550);
 
 for (i in 1:n_obs) {
   
   real p_t = beta[ind_item_obs[i]]-tau_t[ind_person_obs[i]];
   real p_c = beta[ind_item_obs[i]]-tau_c[ind_person_obs[i]];
   
   real p = (p_t^(1-T[ind_person_obs[i]]))*
     (((1-i_status_obs[i])*p_t + 
         (i_status_obs[i])*p_c)^T[ind_person_obs[i]]);

   Y[i] ~ normal(p,1/(alpha[ind_item_obs[i]]^2));
 }
}

My (unsuccessful) attempt for within-chain parallelization

I tried to replace i with start:end and put everything in normal_lpdf function, but I think I don’t have a good understanding of how this works. This is the error message I received:

Compiling Stan program...
|
Semantic error in 'C:/Users/cengiz/AppData/Local/Temp/RtmpgLAk9E/model-3c2462812197.stan', line 13, column 40 to column 63:
   -------------------------------------------------
    11:                     vector T) {
    12:      
    13:      return normal_lpdf(slice_Y | ((beta[ind_item_obs[start:end]]-tau_t[ind_person_obs[start:end]])^(1-T[ind_person_obs[start:end]]))*
                                                 ^
    14:                      (((1-i_status_obs[start:end])*(beta[ind_item_obs[start:end]]-tau_t[ind_person_obs[start:end]]) + 
    15:                          (i_status_obs[start:end])*beta[ind_item_obs[start:end]]-tau_c[ind_person_obs[start:end]])^T[ind_person_obs[start:end]]),
   -------------------------------------------------

Only expressions of array, matrix, row_vector and vector type may be indexed. Instead, found type int.

mingw32-make.exe: *** [make/program:53: C:/Users/cengiz/AppData/Local/Temp/RtmpgLAk9E/model-3c2462812197.hpp] Error 1
Error: An error occured during compilation! See the message above for more information.

Within-chain parallelization code:

functions {
  real partial_sum(real[] slice_Y,
                   int start, int end,
                   int ind_person_obs,
                   int ind_item_obs,
                   int i_status_obs,
                   vector beta,
                   vector alpha,
                   vector tau_t,
                   vector tau_c,
                   vector T) {
    
    return normal_lpdf(slice_Y | ((beta[ind_item_obs[start:end]]-tau_t[ind_person_obs[start:end]])^(1-T[ind_person_obs[start:end]]))*
                    (((1-i_status_obs[start:end])*(beta[ind_item_obs[start:end]]-tau_t[ind_person_obs[start:end]]) + 
                        (i_status_obs[start:end])*beta[ind_item_obs[start:end]]-tau_c[ind_person_obs[start:end]])^T[ind_person_obs[start:end]]),
                  1/(alpha[ind_item_obs[start:end]]^2));
  }
                   
  }


data{
  int <lower=1> I;                      
  int <lower=1> J;                      
  int <lower=1> n_obs;                  
  int <lower=1> ind_person_obs[n_obs];  
  int <lower=1> ind_item_obs[n_obs];    
  int <lower=0,upper=1> i_status_obs[n_obs];
  real Y[n_obs];                       
}

parameters {
  vector[J] beta;             
  vector <lower=0> [J]  alpha;  
  vector[I] tau_t;            
  vector[I] tau_c;            
  real mu1;
  real<lower=0> sigma1;
  real<lower=0> sigma_t;
  real<lower=0> sigma_c;
}

transformed parameters {
  vector[I] T;
  
  for (i in 1:n_obs) {
    if(tau_t[ind_person_obs[i]]>tau_c[ind_person_obs[i]])
      
      T[ind_person_obs[i]] = 0;
    
    else 
      
      T[ind_person_obs[i]] = 1;
  }
}


model{
  
  sigma_t ~ exponential(1);
  sigma_c ~ exponential(1);
  
  tau_t    ~ normal(0,sigma_t);
  tau_c    ~ normal(0,sigma_c);
  
  mu1      ~ normal(3.98,1);
  sigma1   ~ exponential(1);
  beta     ~ normal(mu1,sigma1);
  
  alpha    ~ inv_gamma(800,1550);
  
  target += reduce_sum(partial_sum, Y , grainsize, ind_person_obs, ind_item_obs,
                       i_status_obs, beta, alpha, tau_t, tau_c, T);
}









One thing is that in your reduce_sum function, you need to do element-wise operations: so your ^ methods should be cast to .^ to that the powers are done element-wise.

A good practice is to do an intermediate step of substituting your reduce_sum logic into your model block. What I mean by that is to replace your i loop in your model block with a loop over your normal_lpdf vectorized code.

Then when/if that works, your can move to the reduce_sum logic.

2 Likes