Reduce-sum parallelization

I’ve noticed some peculiar behavior for a relatively simple model whose likelihood is computed via reduce-sum: The larger the dataset, the lower the computational resource usage, both on the GPU via OpenCL, and on the CPU via TBB. In fact, for a very large dataset, performance using the GPU is even worse than without any parallelization. This is counter to what I expected, since larger chunks of data should keep individual cores busy for longer. Also, the copy load to and from the GPU that the Windows task manager displays is never above 1%, so it hardly seems like that would be the bottleneck. This problem does not seem to be tied to a specific hardware configuration, as both my personal Windows computer, as well as two large (disjunct) Linux machines exhibit the same behavior.

Can this really all be attributed to the overhead caused by copying data back and forth between the master and its worker threads?
Is this a common behavior, and if so, is there any way to improve it?
Is there any more Stan-specific information for improving parallelization efficiency beyond the official documentation?

For me, this poses a major obstacle, because for large datasets (100k observations, 10 variables), the estimator is running effectively on a single CPU core per chain.

For reference, I’ve simply rewritten the multivariate probit model from the documentation using reduce-sum for the likelihood

functions {
  int sum2d(array[,] int a) {
    int s = 0;
    for (i in 1 : size(a)) {
      s += sum(a[i]);
    }
    return s;
  }
  real partial_sum_lpdf(array[] matrix comb, int start, int end, vector beta,
                        matrix lchol, int D) {
    array[end - start + 1] vector[D] xb;
    for (i in 1 : (end - start + 1)) {
      xb[i] = comb[i,  : , 2 : (D + 2)] * beta;
    }
    return multi_normal_cholesky_lupdf(comb[ : ,  : , 1] | xb, lchol);
  }
}
data {
  int<lower=1> D;
  int<lower=1> N;
  array[N, D] int<lower=0, upper=1> y;
  // array[N] matrix[D, D + K] x;
  matrix[N * D, D + 1] x;
  int M;
}
transformed data {
  int<lower=0> N_pos;
  array[sum2d(y)] int<lower=1, upper=N> n_pos;
  array[size(n_pos)] int<lower=1, upper=D> d_pos;
  int<lower=0> N_neg;
  array[(N * D) - size(n_pos)] int<lower=1, upper=N> n_neg;
  array[size(n_neg)] int<lower=1, upper=D> d_neg;
  
  N_pos = size(n_pos);
  N_neg = size(n_neg);
  {
    int i;
    int j;
    i = 1;
    j = 1;
    for (n in 1 : N) {
      for (d in 1 : D) {
        if (y[n, d] == 1) {
          n_pos[i] = n;
          d_pos[i] = d;
          i += 1;
        } else {
          n_neg[j] = n;
          d_neg[j] = d;
          j += 1;
        }
      }
    }
  }
}
parameters {
  vector[D + 1] beta;
  cholesky_factor_corr[D] L_Omega;
  vector<lower=0>[N_pos] z_pos;
  vector<upper=0>[N_neg] z_neg;
}
transformed parameters {
  array[N] vector[D] z;
  profile("utilparsepos") {
    for (n in 1 : N_pos) {
      z[n_pos[n], d_pos[n]] = z_pos[n];
    }
  }
  profile("utilparseneg") {
    for (n in 1 : N_neg) {
      z[n_neg[n], d_neg[n]] = z_neg[n];
    }
  }
  array[N] matrix[D, D + 2] comb;
  profile("combine") {
    for (i in 1 : N) {
      comb[i] = append_col(z[i], x[((i - 1) * D + 1) : (i * D),  : ]);
    }
  }
}
model {
  profile("params") {
    L_Omega ~ lkj_corr_cholesky(10);
    beta ~ normal(-0.5, 0.5);
  }
  profile("util") {
    target += reduce_sum(partial_sum_lupdf, comb, M, beta, L_Omega, D);
  }
}
generated quantities {
  corr_matrix[D] Omega = multiply_lower_tri_self_transpose(L_Omega);
}

How big are N, D, and M?

M is just the grainsize parameter, D is 10, and N I’ve been varying between 1k and 100k. For higher N, resource utilization decreases significantly.

Is the indexing getting done right? It seems that you’re always indexing comb starting at 1, but shouldn’t it be starting at start?

So that’s what I figured. I asked about the grainsize parameter because I wanted to know what grainsize you’re using. But, if I’m understanding your code right, the length of z is N*D. If that’s correct, it would mean that not only are you sampling up to 1000k parameters from improper prior distirbutions for every leapfrog step, but you’re also doing 1000k iterative steps to build z sequentially for every leapfrog step. And this is before it gets to the parallelized part of your code in model. I’m wondering if the computational time for the sequential part of your code simply completely dwarfs the computational time for the parallelized part of your code, so you’re never actually catching it in the parallelized regime to see it using more resources.

Thanks for the replies. @mhollanders The partial arrays that reduce_sum provides the worker function with are separate arrays, ie not “array views” as some libraries handle them.

@Corey.Plate That’s a good point, but the “utilparsepos” and “utilparseneg” profile sections are actually 2-3 orders of magnitude less computationally intensive than the “util” and “params” section. Can you elaborate on why the impropriety of the prior distribution increases computational complexity? That sounds really intriguing!

With regard to the input M, my understanding of the docs was that a grainsize of 1 would let the internal load scheduler handle things, and I did not see any significant variation with higher values.

As a side note: I’ve rewritten the model now using normal_id_glm_lupdf (by normalizing z and X with the Cholesky root), and that uses the GPU beautifully.

https://mc-stan.org/docs/reference-manual/blocks.html#gradient-calculation

“The amount of work done by the sampler does depend on the number of unconstrained parameters”

This is the section relevant to your improper priors on the components of z. Having one million such improper prior distribution draws may make the following statement (emphasis mine) untrue for your model:

“but this is usually dwarfed by the gradient calculations.”

In general, a conservative approach can be to divide the length of the thing you are slicing (e.g., number of observations) by the number of cores or threads (these mean different things on different machines) you’d like to use for each chain. Assuming roughly equal compute for each slice, each core roughly gets the largest set of information to compute with the least amount of copying. Thus, if 100K observations are sliced and you have two threads or cores available per chain, then 100K / 2 or grain size = 50K. But if the compute time varies a lot across threads, then if you really need to optimize, perhaps start dividing that amount by 2 and retime until it stops improving.

That’s just one rough approach to consider, and in my experience frequently works better than letting the sampler figure it out on its own.

1 Like

I understand your point that “proposing” new values for z may be more costly than evaluating its gradient, but I’m not sure whether I can follow on the impropriety. The model as I understand it is a direct representation of the corresponding Gibbs sampler, which from what I know, has (conditionally) conjugate and proper priors. This also holds with the z variables adhering to truncated normal distributions. Do you mean that specifically in the HMC context, where constrained variables need to be mapped to the unconstrained space, the constraints on z effect an impropriety of the priors on beta and sigma?

Since z_pos and z_neg have no sampling statements in your code, and are only either upper or lower bounded, they are uniform half-lines, which are improper priors.

"Examples of improper priors include:

The uniform distribution on an infinite interval (i.e., a half-line or the entire real line)."

Oh the prior on z (to the best of my knowledge) is the truncated normal, parametrized with beta and the Cholesky root. In that sense, the normal for beta and the chol(LKJ) are hierarchical priors. There is no distributional change from z_pos/z_neg to z, they are simply reorganized into one joint matrix.

The model block in the above is exactly equivalent to the original version in the docs

model {
  L_Omega ~ lkj_corr_cholesky(4);
  to_vector(beta) ~ normal(0, 5);
  {
    array[N] vector[D] beta_x;
    for (n in 1:N) {
      beta_x[n] = beta * x[n];
    }
    z ~ multi_normal_cholesky(beta_x, L_Omega);
  }
}

This does not match the model in your original post, where z is not part of a sampling statement, which would default it to a uniform prior.

z is packed with the X matrix and passed to the partial_sum_lpdf function. The statement multi_normal_cholesky_lupdf(comb[ : , : , 1] | xb, lchol) then computes the likelihood contribution in parallel, which is passed back and added to target. The comb[:,:,1] statement extracts the portion of z that the individual worker is supposed to handle. Both versions recover the parameters of a simulated dataset well.

The equivalence of target += dist_lupdf(var|...) and var ~ dist(...) I trust you are familiar with.If your general idea holds true in a different or a broader context, I very much appreciate your thoughts.

z_pos and z_neg still need to be drawn from uniform half-line distributions, the likelihood function depending on the z parameters and x data notwithstanding.

Wouldn’t your statement apply as well to the original version using z ~ multi_normal_cholesky(beta_x, L_Omega) ?

Do you mean to say that both the original and my version imply a uniform prior on z_pos and z_neg or is it that z ~ multi_normal... is not equivalent to target += ...? If the latter is the case, can you elaborate why you suppose that is?

z, the tranformed parameter, would be constrained, but z_pos and z_neg would still be drawn from a uniform half-line (i.e. improper prior) and, together, have the same number of elements as z.

I do not think this is accurate. If a distribution is imposed on a transformed or constrained parameter (as it is for z), then I see no reason why the original/unconstrained parameter would not be sampled from the same distribution subject to the Jacobian adjustment (computed via autodiff). If it really were implemented such that a uniform prior would be assumed for any transformed parameter that a distribution is imposed on directly, then surely the docs for a change of variable would mention this. I’m happy to change my mind about this if you can provide a reference to the docs or the backend implementation.

I do agree that if no distribution is given for a parameter, then a uniform prior is the default (although I’d also appreciate a reference to the docs), but I assume that you would not want to extend your statement to a structure as the below, which in essence captures what happens in the model.

parameters{
  real x;
}
transformed parameters{
  real y = x;
}
model{
  y ~ normal();
}

There is no sensible reason why the Stan devs would not use the distribution induced on x via the one imposed on y to sample x.

@ssp3nc3r Sorry for the rewrite, but as in my previous 4 replies, the questions were more of a Socratic nature, as I did not mean to be upfront about this.

In this example, x is y, which is given an incomplete normal prior (did you mean std_normal()). These are easy questions to test. Just run the code and graph x to confirm.

I’m not an expert on the Stan backend, and I do not know the answer to your question. However, theoretically, a likelihood imposes some constraint on the draws from a distribution, but the priors to that likelihood can still be unbounded uniform distributions, which is still an improper prior, regardless of the data-driven portion of the model. Unless there is some serious issue with your code that I have not found, I do not see why reduce_sum would not be working here, but I do suspect that the many unparallelized operations in your code are eclipsing it. Unfortunately, there is no deeper documentation for me to refer to and I defer to the Stan devs at this point. Perhaps they can see something that I can’t.