Multilevel hurdle model -- no performance increase with reduce_sum()

Hi all,

New to Stan and could use some help. I have 96 threads available to me, but I can’t seem to specify reduce_sum in a way that speeds up my model. I’m too new to Stan to understand how I should properly use reduce_sum, so if there’s an answer I would appreciate as much detail as possible.

With N at a 10K sample, the original takes 15 minutes. With N at the full 900K, it takes about 72 hours.

Thanks in advance!

Original model spec

data {
  int<lower=1> N;
  matrix[N, 5] X;
  vector[N]    y;
  int<lower=1> K1[1];
  int<lower=1,upper=max(K1)> L1[1, N];
}
parameters {
  real mu_a;
  real mu_b1;
  real mu_b2;
  real log_sigma_a;
  real log_sigma_b;
  real theta_a;
  real theta_b;
  real theta_n;
  real theta_n0;
  real<lower=0> dmu_sc_tau;
  vector[K1[1]] dmu_sc;
}
model {
  vector[N] mu = mu_a + mu_b1 * X[:, 1] + mu_b2 * square(X[:, 1]) + dmu_sc_tau * dmu_sc[L1[1]] ;
  vector[N] sigma = exp( (log_sigma_a  - 3) + (log_sigma_b  - 3) * X[:, 1] );
  vector[N] theta = theta_a + theta_b  * X[:,1] + theta_n  * X[:,2] + theta_n0 * X[:,4] ;
  vector[N] p = inv_logit(theta);

  mu_a  ~ normal(0, 0.3);
  mu_b1 ~ normal(0, 1);
  mu_b2 ~ normal(0, 1);
  log_sigma_a  ~ normal(0, 2);
  log_sigma_b  ~ normal(0, 6);
  theta_a  ~ normal(0, 1);
  theta_b  ~ normal(0, 3);
  theta_n  ~ normal(0, 6);
  theta_n0 ~ normal(6, 6);

  dmu_sc_tau ~ normal(0, 0.2);
  dmu_sc ~ std_normal();

  for (i in 1:N) {
    if (y[i] == 0) {
       target += log(p[i]);
    } else {
      target += log(1 - p[i]) + normal_lpdf(y[i] | mu[i], sigma[i]);
    }
  }
}

Attempt at reduce_sum, runs about 50% longer

functions {
  real partial_sum(int[] dummy_index, int start, int end,
                   vector y, vector p, vector mu, vector sigma) {
    real psum = 0;
    for (i in start:end) {
      if (y[i] == 0) {
         psum += log(p[i]);
      } else {
         psum += log(1 - p[i]) + normal_lpdf(y[i] | mu[i], sigma[i]);
      }
    }
    return psum;
  }
}
data {
  int<lower=1> N;
  matrix[N, 5] X;
  vector[N]    y;
  int<lower=1> grainsize;
  int<lower=1> K1[1];
  int<lower=1,upper=max(K1)> L1[1, N];
}
transformed data {
  int dummy_index[N] = rep_array(1, N);
}
parameters {
  real mu_a;
  real mu_b1;
  real mu_b2;
  real log_sigma_a;
  real log_sigma_b;
  real theta_a;
  real theta_b;
  real theta_n;
  real theta_n0;
  real<lower=0> dmu_sc_tau;
  vector[K1[1]] dmu_sc;
}
model {
  vector[N] mu = mu_a + mu_b1 * X[:, 1] + mu_b2 * square(X[:, 1]) + dmu_sc_tau * dmu_sc[L1[1]] ;
  vector[N] sigma = exp( (log_sigma_a  - 3) + (log_sigma_b  - 3) * X[:, 1] );
  vector[N] theta = theta_a + theta_b  * X[:,1] + theta_n  * X[:,2] + theta_n0 * X[:,4] ;
  vector[N] p = inv_logit(theta);

  mu_a  ~ normal(0, 0.3);
  mu_b1 ~ normal(0, 1);
  mu_b2 ~ normal(0, 1);
  log_sigma_a  ~ normal(0, 2);
  log_sigma_b  ~ normal(0, 6);
  theta_a  ~ normal(0, 1);
  theta_b  ~ normal(0, 3);
  theta_n  ~ normal(0, 6);
  theta_n0 ~ normal(6, 6);
  dmu_sc_tau ~ normal(0, 0.2);
  dmu_sc ~ std_normal();

  target += reduce_sum(partial_sum, dummy_index, grainsize, y, p, mu, sigma);
}

Hi,

first you should figure out which part of your model is the bottleneck. You can get that with profiling.

data {
  int<lower=1> N;
  matrix[N, 5] X;
  vector[N]    y;
  int<lower=1> K1[1];
  int<lower=1,upper=max(K1)> L1[1, N];
}
parameters {
  real mu_a;
  real mu_b1;
  real mu_b2;
  real log_sigma_a;
  real log_sigma_b;
  real theta_a;
  real theta_b;
  real theta_n;
  real theta_n0;
  real<lower=0> dmu_sc_tau;
  vector[K1[1]] dmu_sc;
}
model {
  vector[N] mu;
  vector[N] sigma;
  vector[N] theta;
  vector[N] p;
  
  profile("mu") {
    mu = mu_a + mu_b1 * X[:, 1] + mu_b2 * square(X[:, 1]) + dmu_sc_tau * dmu_sc[L1[1]] ;
  }
  profile("sigma") {
    sigma = exp( (log_sigma_a  - 3) + (log_sigma_b  - 3) * X[:, 1] );
  }
  profile("theta") {
    theta = theta_a + theta_b  * X[:,1] + theta_n  * X[:,2] + theta_n0 * X[:,4] ;
  }
  profile("p") {
    p = inv_logit(theta);
  }
  mu_a  ~ normal(0, 0.3);
  mu_b1 ~ normal(0, 1);
  mu_b2 ~ normal(0, 1);
  log_sigma_a  ~ normal(0, 2);
  log_sigma_b  ~ normal(0, 6);
  theta_a  ~ normal(0, 1);
  theta_b  ~ normal(0, 3);
  theta_n  ~ normal(0, 6);
  theta_n0 ~ normal(6, 6);

  dmu_sc_tau ~ normal(0, 0.2);
  dmu_sc ~ std_normal();
  profile("for_loop") {
    for (i in 1:N) {
      if (y[i] == 0) {
         target += log(p[i]);
      } else {
        target += log(1 - p[i]) + normal_lpdf(y[i] | mu[i], sigma[i]);
      }
    }
  }
}

That will give you a good feeling or where to focus with your parallelization efforts.

Took me a minute to realize I needed to switch from rstan (dev) to cmdstanr. Here’s an image of the profiling output when running on a sample of N = 10K. (Full N = 900K).

After seeing each bit was taking its fair share of time, I moved everything from the model block into the reduce_sum call. It took me awhile to realize that even though global data passed to reduce_sum is big, it is significantly faster to just access it directly than it is to extract/save small slices of it inside the partial sum function. Once I avoided that, I achieved the expected performance gains.

Thanks for the profiling tip! Looking forward to it being in rstan someday.

Yeah, slicing data is really not worth it. One should always slice parameters if possible.

Glad you figured it out!

I think it is in rstan, just not the CRAN version. You can get latest rstan via Repository for distributing (some) stan-dev R packages | r-packages OK, my bad, it looks like it is not supported in rstan, not sure where I got the idea from, sorry for confusion.