Map_rect in the Stan language (MPI is coming)

We’re getting really close to releasing a version of Stan with multi-core support for parallel likelihood evaluations. There’s a code-complete, tested branch that works. I’m finishing up the doc now and should have a pull request very soon. I just wanted to share the first working example, which translates a simple fixed-size logistic regression

data {
  int y[12];
  real x[12];
}
parameters {
  vector[2] beta;
}
model {
  beta ~ normal(0, 1);
  y ~ bernoulli_logit(beta[1] + beta[2] * to_vector(x));
}

to the following fixed-size/fixed-shard-size mapped version:

functions {
  // logistic regression lpmf
  vector lr(vector beta, vector theta, real[] x, int[] y) {
    real lp = bernoulli_logit_lpmf(y | beta[1] + to_vector(x) * beta[2]);
    return [lp]';
  }
}
data {
  // N = 12 data points
  int y[12];
  real x[12];
}
transformed data {
  // K = 3 shards
  int ys[3, 4] = { y[1:4], y[5:8], y[9:12] };
  real xs[3, 4] = { x[1:4], x[5:8], x[9:12] };
  vector[0] theta[3];
}
parameters {
  vector[2] beta;
}
model {
  beta ~ normal(0, 1);
  target += sum(map_rect(lr, beta, theta, xs, ys));
}

Like the ODE integrator and algebraic solver, dealing with the types and dummies to pack/unpack is clunky.

I’m already seeing that it’d be nice to have a version of map_rect that maps a function returning a single real value to avoid that final bit of syntactic cruft in the return, [lp]'. That should be very easy to build using a simple adaptor.

7 Likes

Really great to see the progress!

The map_rect has been designed with hierarchical models in mind such that these types of models look a bit more natural.

Adding a map_rect_lpdf (or in this case map_rect_lpmf) would make sense and would even allow for some more tweaks as the reduction step can then happen locally on the workers. This would come at the price of loosing exact reproducibility as floating point arithmetic depends on order of execution which we would change… but to emphasize: The current map_rect gives exactly the same results just a lot faster if the problem scales well.

The packing/unpacking business can be remedied a lot with some utility functions. Maybe this calls for a case study which has these in it. We also need to teach users what would limit the performance and what not:

  • always try to return the lpdf or lpmf value if possibly (reduction of back-ward communication)
  • avoid assigning the x_r or x_i data to variables inside the function, but rather subset the data directly. Otherwise one ends up propagating data to var variables which is never a good idea.

And there is presumably more.

BTW, while it will be a pain to “squeeze” a model through the map_rect due to packing/unpacking one does even get on a single core speedups for large models as the overall scale-abilty is apparently improved by the evaluation scheme used in map_rect.

1 Like

Great to see that stan soon can parallelize computations!
Quick question: Could you explain what the function of theta is?

the theta’s are job-specific parameters which Bob did not use in this example, but the map_rect expects that you give an array of vectors as job-specific parameters as input. The array must have the same length as there are number of jobs which is 3 in this example. As the model does not need job-specific parameters the vector has length 0.

In a hierarchical model you would use theta to transfer the group specific parameters if you parallelize over the groups of your hierarchical model which usually makes a lot of sense.

And it’d just use ~ or target += internally? We’d then have to buffer up those target contributions, which wouldn’t be too hard—the basic implementation is an accumulator that then returns an efficient sum at the end.

Exactly why that’s the second example I coded up for the manual.

Here’s the basic version:

data {
  int<lower = 0> K;
  int<lower = 0> N;
  int<lower = 1, upper = K> kk[N];
  vector[N] x;
  int<lower = 0, upper = 1> y[N];
}
parameters {
  matrix[50, 2] beta;
  vector[2] mu;
  vector<lower=0>[2] sigma;
}
model {
  mu ~ normal(0, 2);
  sigma ~ normal(0, 2);
  for (j in 1:2)
    beta[, j] ~ normal(mu[j], sigma[j]);
  y ~ bernoulli_logit(beta[kk, 1] + beta[kk, 2] .* x);
}

And here’s the mapped version:

functions {
  vector bl_glm(vector mu_sigma, vector beta,
                real[] x, int[] y) {
    vector[2] mu = mu_sigma[1:2];
    vector[2] sigma = mu_sigma[3:4];
    real lp = normal_lpdf(beta | mu, sigma);
    real ll = bernoulli_logit_lpmf(y | beta[1] + beta[2] * to_vector(x));
    return [lp + ll]';
  }
}
data {
  int<lower = 0> K;
  int<lower = 0> N;
  int<lower = 1, upper = K> kk[N];
  vector[N] x;
  int<lower = 0, upper = 1> y[N];
}
transformed data {
  int<lower = 0> J = N / K;
  real x_r[K, J];
  int<lower = 0, upper = 1> x_i[K, J];
  {
    int pos = 1;
    for (k in 1:K) {
      int end = pos + J - 1;
      x_r[k] = to_array_1d(x[pos:end]);
      x_i[k] = y[pos:end];
      pos += J;
    }
  }
}
parameters {
  vector[2] beta[K];
  vector[2] mu;
  vector<lower=0>[2] sigma;
}
model {
  mu ~ normal(0, 2);
  sigma ~ normal(0, 2);
  target += sum(map_rect(bl_glm, append_row(mu, sigma),
                         beta, x_r, x_i));
}

I mapped the priors, but that’s not necessary—that could be done at the top level and it would cut down on communication but increase non-parallel calculation.

This is also still hugely simplifying by assuming equal numbers of observations in each group and a single return value.

2 Likes

I assume make + mpiexec is how one can test this example?

Maybe already fixed, there’s some runaway typesetting in the manual (p332, 334) in this branch.

The language component of MPI has been merged already, so we’ll have to fix any manual errors in another PR.

Hi @wds15, @Bob_Carpenter, I’m not quite sure I understand the behavior of the test model in MPI run. What’s the relationship between work load distribution and the number of chains, the number of processes(mpiexec -n ), and the number of shards?

the MPI code is meant to run within a single chain only. If you have W workers allocated and pass N jobs to map_rect, then each worker will receive a chunk of N/W jobs.

I am a bit surprised to see this many questions on MPI on this from your end… I thought you reviewed this a while ago (no?).

What are shards?

I guess I’m either confused by the cmdstan output or the nomenclature (turns out you didn’t this shards thing either). When I run mpiexec -n 9 on the test model I see 9 chains from output. What information does map_rect spit out to let a user know that he’s running with W workers on N jobs?

cmdstan is not yet prepared to run with MPI!

Currently there is no output anywhere (well on the prototypes I used to play with I had debug output), but probably including this info in the cmdstan startup message is a good thing to do.

Somehow by “MPI is ready” I took it as ready for all Stan components…

Did I write that? What is ready to be reviewed is the MPI map_rect PR in stan-math. When that is in cmdstan should be simple to address.

I hope that @Bob_Carpenter gets along with 1k lines of code…

I was referring to title of this post.

Dear Bob,

Thanks for this working example of map_rect for a hierarchical model! I’m currently experimenting a lot with map_rect and I was wondering if we could still write the hierarchy for beta using non-centered parameterization?

Kind regards,

Douwe

Sure—just pass the transformed parameters or do the transform as part of the map.

Thanks for your quick reply! Indeed passing the transformed parameters to map_rect would work, but you will cut down on parallel computation in this case, as you mentioned. I was struggling, however, with doing the transform as part of the map. Taking your example of the hierarchical model where the contribution of beta to the log probability is given by normal_lpdf(beta | mu, sigma), how do I implement Matt’s trick in the function block? Should I write normal_lpdf(beta_raw .* sigma + mu | mu, sigma) and bernoulli_logit_lpmf(y | beta_raw[1] .* sigma[1] + mu[1] + (beta_raw[2] .* sigma[2] + mu[2]) * to_vector(x)) in the function block, where beta_raw ~ normal(0,1) is in the model block, and construct beta = beta_raw * sigma + mu in generated quantities, for example? Would this be correct? If yes, does it make sense to do it this way?

Yes, that’s the way to code it. It’s unclear to me why you’d scale beta_raw by sigma.

Also, if beta_raw is organized by group, then you can provide the prior in the function that’s getting parallelized, too. Or you can separately parallelize the prior if it’s expensive.

The generated quantities calculations are relatively cheap as they are double-based and don’t involve gradients and are only done once/iteration (as opposed to the log density and gradient, which are evaluated at each leapfrog step in the Hamiltonian simulation within an iteration).

Thanks again for replying so fast. beta_raw is the parameter with a standard normal density. In your example, beta is normal with mean mu and scale sigma, so I construct beta in a non-centered fashion by scaling beta_raw by sigma and adding mu. Thus by writing normal_lpdf(beta_raw .* sigma + mu | mu, sigma), I intended to write your hierarchical example in non-centered parametrized form.