Try to find a better way to speed up a hierarchical logistic model and reduce_sum only helped a little

Hello everyone,

I have two groups of subjects who participated in a binary choice experiment. The number of observations for each subject varies, from ~300 to ~1700. Here I’ve tried to model their choice behaviors for each subject using a straightforward logistic model and compare the differences of decision variables (i.e., independent variables in regression) on the group level. So everyone gets their own set of parameters but is constrained by the group-level priors.

The model seems to work fine, and the result is as what I expect and not too different from the estimates of the MLE method. I’m wondering if there is a better way to speed it up? I’d like to use vectorization as much as possible, but I think it’s a little tricky because each subject has a set of beta and design matrix, I can’t think of a way to do just one matrix multiplication in total but one for each subject.

Here is my code,

data {
  int<lower=1, upper = 2> n_group; # number of groups
  int<lower=1> n_sub; # number of subjects
  int<lower=1> n_dv; # number of decision variables (or independent variables)
  int<lower=1> obs_start[n_sub]; # row index of the beginning of each subject's observations
  int<lower=1> obs_end[n_sub]; # row index of the end of each subject's observations
  int<lower=1> n_obs_total; # number of total observations, all subjects' combined
  int<lower=1, upper=2> group[n_sub]; # group id for each subject
  int<lower=0, upper=1> y[n_obs_total]; # binary response
  matrix[n_obs_total,n_dv] dv; # design matrix
}

parameters {
  vector[n_dv] mu_grp[n_group];
  vector<lower=0>[n_dv] sigma_grp[n_group];
  vector[n_dv] beta[n_sub];
}

model {
  vector[n_obs_total] dv_x_beta;

  for (g in 1:n_group) {
    mu_grp[g] ~ std_normal();
    sigma_grp[g] ~ gamma(2, 0.1);
  }

  for (s in 1:n_sub) {
    beta[s] ~ student_t(3, mu_grp[group[s]], sigma_grp[group[s]]);
    dv_x_beta[obs_start[s]:obs_end[s]] = dv[obs_start[s]:obs_end[s]] * beta[s];
  }

  y ~ bernoulli_logit(dv_x_beta);

}

generated quantities {
  vector[n_dv] mu_grp_diff;
  vector[n_dv] sigma_grp_diff;

# compare the differences of parameter estimates on group level
  mu_grp_diff = mu_grp[1] - mu_grp[2]; 
  sigma_grp_diff = sigma_grp[1] - sigma_grp[2];
}

With 53 subjects and 48503 observations in total, this model running with 4 parallel chains and 500 iters for both warmup and sampling (1000 in total) took ~450s to finish.

I also tried a model using reduce_sum, but it didn’t improve as much as expected. My computer has 16 physical cores, and it took ~420s to finish running with 4 parallel chains and 4 threads per chain. I guess it’s probably due to the for-loop to calculate the matrix multiplication for each observation in partial_sum function, which is not very efficient.

The reduce_sum version is as follows:

functions {
  real partial_sum_lpmf(int[] y, int start, int end, int[] sub, matrix dv, vector[] beta) {
    int N = end - start + 1;
    vector[N] dv_x_beta;
    for (n in 1:N) {
      int nn = n + start - 1;
      dv_x_beta[n] = dv[nn] * beta[sub[nn]];
    }
    return bernoulli_logit_lupmf(y | dv_x_beta);
  }
}

data {
  int<lower=1, upper = 2> n_group;
  int<lower=1> n_sub;
  int<lower=1> n_dv;
  int<lower=1> obs_start[n_sub];
  int<lower=1> obs_end[n_sub];
  int<lower=1> n_obs_total;
  int<lower=1, upper=2> group[n_sub];
  int<lower=1> sub[n_obs_total]; # subject id for each observation
  int<lower=0, upper=1> y[n_obs_total];
  matrix[n_obs_total,n_dv] dv;
  int<lower=1> grainsize; # set to 1 through data
}

parameters {
  vector[n_dv] mu_grp[n_group];
  vector<lower=0>[n_dv] sigma_grp[n_group];
  vector[n_dv] beta[n_sub];
}

model {
  for (g in 1:n_group) {
    mu_grp[g] ~ std_normal();
    sigma_grp[g] ~ gamma(2, 0.1);
  }

  for (s in 1:n_sub) {
    beta[s] ~ student_t(3, mu_grp[group[s]], sigma_grp[group[s]]);
  }

  target += reduce_sum(partial_sum_lupmf, y, grainsize, sub, dv, beta);
}

generated quantities {
  vector[n_dv] mu_grp_diff;
  vector[n_dv] sigma_grp_diff;

  mu_grp_diff = mu_grp[1] - mu_grp[2];
  sigma_grp_diff = sigma_grp[1] - sigma_grp[2];
}

Any suggestions or good ideas?

Thanks!

Presumably this is something you need to run many times for many datasets?

Take a look at the tricks I used here; you’ll have to add the group structure but otherwise I think the reduced-redundant-computation and sufficient-stats tricks should give you substantial speed ups.

Oh, I just noticed that that demo doesn’t Include use of rows_dot_product, which I recently learned provides an additional speed up. Lemme edit it…

Oh, I should ask: is there much redundancy in the contents of the dv matrix? For example, if it encodes contrasts for a categorical predictor there will be few unique entries, hence high redundancy and the tricks I mention can apply. If it’s a numeric variable with completely unique values for each entry, the tricks won’t work (and will probably slow things down).

Thanks for the reply!

This model actually just runs once for the only one dataset (containing the observations of 53 participants), though I have other expanded models based on this simplest one.

There is definitely redundancy in the dv matrix just like you pointed out there are contrasts for a categorical predictor. There are some numeric variables as well, but the values are not completely unique but in a limited range. Here’s a part of the data; maybe it could help the discussion.
data.csv (19.9 KB)

The dv matrix contains variables from kConst to online_cost, stop is the binary dependent variable. kConst is the intercept, cost0_1 and cost0_4 are the contrasts encoded for a three-level categorical variable (0.0, 0.1, and 0.4). draw is an integer variable ranging from 0 to 20 (so only 21 levels), and online_draw is actually the interaction between/product of the three-level categorical variable and draw so that it takes on a limited range of values too.

So for the current data, I guess these tricks would apply? But if there is a continuous variable that is like completely random (as it’s the case for my other more complex models), these tricks would not work?

I took a look at the sufficient statistic trick and tried to turn the logistic regression into binomial regression. It’s about 9 times faster! This is what I did (and I hope I did it right): I counted the unique combination of the values of decision variables and the times of stop == 1 per combination for each participant such that the total number of observations was reduced to only 2491 entries. Here is the example data:
data_binom.csv (1.8 KB) where n_stopped is the number of times of which subject chose to stop and n_total is the total times of which subject had to choose stop or not.

And this is the revised model:

data {
  int<lower=1, upper = 2> n_group;
  int<lower=1> n_sub;
  int<lower=1> n_dv;
  int<lower=1> obs_start[n_sub];
  int<lower=1> obs_end[n_sub];
  int<lower=1> n_obs_total;
  int<lower=1, upper=2> group[n_sub];
  int<lower=1> sub[n_obs_total];
  int<lower=0> n_stopped[n_obs_total]; # times of happened events
  int<lower=0> n_total[n_obs_total]; # the total times
  matrix[n_obs_total,n_dv] dv;
}

parameters {
  vector[n_dv] mu_grp[n_group];
  vector<lower=0>[n_dv] sigma_grp[n_group];
  vector[n_dv] beta[n_sub];
}

model {
  vector[n_obs_total] dv_x_beta;

  for (g in 1:n_group) {
    mu_grp[g] ~ std_normal();
    sigma_grp[g] ~ gamma(2, 0.1);
  }

  for (s in 1:n_sub) {
    beta[s] ~ student_t(3, mu_grp[group[s]], sigma_grp[group[s]]);
    dv_x_beta[obs_start[s]:obs_end[s]] = dv[obs_start[s]:obs_end[s]] * beta[s];
  }

  n_stopped ~ binomial_logit(n_total, dv_x_beta);

}

generated quantities {
  vector[n_dv] mu_grp_diff;
  vector[n_dv] sigma_grp_diff;
 
  mu_grp_diff = mu_grp[1] - mu_grp[2];
  sigma_grp_diff = sigma_grp[1] - sigma_grp[2];
}
1 Like