Parallelization via reduce_sum for hierarchical model

Hi,

I’m attempting to parallelize a set of hierarchical models using reduce_sum. As a simple and illustrative example, the code below was an attempt to extend the multilevel 2PL IRT model from the Stan user guide (pg. 29-31), with the only difference being that the data I’m using are structured in long/melted format, analogous to the data structure in the hierarchical logistic regression example (pg. 25). So, if we have J students all administered K items (no missingness), then the N observations for the response vector, y, are J*K in length.

In the 1st code example below, slicing is done over y (y_slice) for the arbitrary reason that this is what’s been chosen for slicing in almost every example I’ve come across. Because y varies across the N observations, my understanding is that slicing in this manner is done over each of the N=J*K observations. Let me know if I’m wrong on that.

Problem: I’m interested in slicing over the J students instead of the N observations to see how that affects run time. I’ve spent a good amount of time and a variety of approaches trying to set the model up, but nothing has worked – either the estimates are wrong or the model fails for other reasons. As mentioned, I’ve also not seen any straightforward examples that show how slicing is done at levels above individual observations, so there’s not much to adapt from. (It looks like the Stan user guide has an example of parallelizing a hierarchical model using map_rect, but I don’t believe there’s an analogous example using reduce_sum.) When I’ve seen this issue discussed elsewhere online, I’ve seen the suggestion to slice over observed predictor data x. However, there are no observed predictor data for a 2PL IRT model, so I’m not sure this applies. I’ve tried various other approaches, including slicing over the alpha vector (as per the 2nd example below), using an approach similar to what brms does of generating a sequence of indices to slice over, etc. None has worked, so clearly I’m approaching this wrong somewhere.

Could you provide any thoughts as to how to approach this? Thanks for any ideas.

Original model (slicing over y)

function {
  real partial_sum(int[] y_slice,
	                            int start, 
                                    int end, 
                                    vector gamma,
                                    vector alpha,
                                    vector beta,
                                    int[] kk,
                                    int[] jj) {
    int N = end - start + 1;
    vector[N] mu;		
    for (n in 1:N) {
      mu[n] = gamma[kk[n]] * (alpha[jj[n]] - beta[kk[n]]);
    }
	  return bernoulli_logit_lpmf(y_slice | mu);
	}
}
data {
  int<lower=1> J;              // number of students
  int<lower=1> K;              // number of questions
  int<lower=1> N;              // number of observations
  int<lower=1,upper=J> jj[N];  // student associated with observation n
  int<lower=1,upper=K> kk[N];  // question associated with observation n
  int<lower=0,upper=1> y[N];   // correctness for observation n
}
parameters {
  real mu_beta;                // mean question difficulty
  vector[J] alpha;             // ability for j - mean
  vector[K] beta;              // difficulty for k
  vector<lower=0>[K] gamma;    // discrimination of k
  real<lower=0> sigma_beta;    // scale of difficulties
  real<lower=0> sigma_gamma;   // scale of log discrimination
}
model {
  int grainsize = 1;
  alpha ~ std_normal();
  beta ~ normal(mu_beta, sigma_beta);
  gamma ~ lognormal(0, sigma_gamma);
  mu_beta ~ cauchy(0, 5);
  sigma_beta ~ cauchy(0, 5);
  sigma_gamma ~ cauchy(0, 5);
  target += reduce_sum(partial_sum, y, grainsize, gamma, alpha, beta);
}

Slicing using alpha

function {
  real partial_sum(vector[] a_slice,
                                     int start, 
                                     int end, 
                                     int[] y,
                                     vector gamma,
                                     vector beta,
                                     int[] kk) {
	  return bernoulli_logit_lpmf(y[start:end] | gamma[kk] .* 
                                                                  (a_slice - beta[kk]));
	}
}
data {
  int<lower=1> J;                               // number of students
  int<lower=1> K;                              // number of questions
  int<lower=1> N;                             // number of observations
  int<lower=1,upper=J> jj[N];     // student associated with observation n
  int<lower=1,upper=K> kk[N];  // question associated with observation n
  int<lower=0,upper=1> y[N];    // correctness for observation n
}
parameters {
  real mu_beta;                                      // mean question difficulty
  vector[J] alpha;                                   // ability for j - mean
  vector[K] beta;                                     // difficulty for k
  vector<lower=0>[K] gamma;         // discrimination of k
  real<lower=0> sigma_beta;           // scale of difficulties
  real<lower=0> sigma_gamma;    // scale of log discrimination
}
model {
  int grainsize = 1;
  alpha ~ std_normal();
  beta ~ normal(mu_beta, sigma_beta);
  gamma ~ lognormal(0, sigma_gamma);
  mu_beta ~ cauchy(0, 5);
  sigma_beta ~ cauchy(0, 5);
  sigma_gamma ~ cauchy(0, 5);
	target += reduce_sum(partial_sum, alpha, grainsize, y, gamma, beta, kk);
}
1 Like

Hi,
sorry for not getting to you earlier - did you manage to resolve the problem in the meantime?

In particular, it appears to me that already your “original model” is likely wrong as you will start from jj[1] and kk[1] in all your partial sums, i.e. the results will change depending on how the sceduler slices your data. You probably should have something like:

 for (n in 1:N) {
     int index = n + start - 1;
      mu[n] = gamma[kk[index]] * (alpha[jj[index]] - beta[kk[index]]);
    }

Does that make sense?

Best of luck with your model!

Hi!

I don’t know whether this topic is still relevant, but over the last few days I was working on the same model to get a grasp of the reduce_sum functionality (I want to apply this to a more complex model later on, so I wanted to start simple).

In the first variant, I also sliced over y. Compared to the serial model, the parallel model ran 3.6 times faster, which is impressive (I = 20 items, J = 3600 individuals and 2 parallel chains with 6 threads per chain on a Ryzen9 5900X with 32 GB RAM).

In the second variant, I sliced over the theta parameter. Or at least I think I did:

functions {
  real partial_sum_lpmf(int[] slice_jj,
                        int start, int end,
                        int[] y,
                        int[] ii,
                        vector alpha,
                        vector beta,
                        vector theta) {
    
    return bernoulli_logit_lpmf( y[start:end] | alpha[ii[start:end]] .* (theta[slice_jj] - beta[ii[start:end]]) );
  }
}
data {
  int<lower=1> I;               // # questions
  int<lower=1> J;               // # persons
  int<lower=1> N;               // # observations
  int<lower=1, upper=I> ii[N];  // question for n
  int<lower=1, upper=J> jj[N];  // person for n
  int<lower=0, upper=1> y[N];   // correctness for n
}
[...]

The speedup was similar to the first variant of the model. I was wondering whether this might be due to me misunderstanding the slicing over theta (as I am using the index jj to slice), or due to the data structure, which stays the same, or maybe both. I would be very happy if someone could provide further insights (although, as already mentioned, the speedup achieved by slicing over observations is already really impressive)!

Best wishes
Chris

1 Like