Map_rect in the Stan language (MPI is coming)

Bob_Carpenter · March 12, 2018, 3:00pm

We’re getting really close to releasing a version of Stan with multi-core support for parallel likelihood evaluations. There’s a code-complete, tested branch that works. I’m finishing up the doc now and should have a pull request very soon. I just wanted to share the first working example, which translates a simple fixed-size logistic regression

data {
  int y[12];
  real x[12];
}
parameters {
  vector[2] beta;
}
model {
  beta ~ normal(0, 1);
  y ~ bernoulli_logit(beta[1] + beta[2] * to_vector(x));
}

to the following fixed-size/fixed-shard-size mapped version:

functions {
  // logistic regression lpmf
  vector lr(vector beta, vector theta, real[] x, int[] y) {
    real lp = bernoulli_logit_lpmf(y | beta[1] + to_vector(x) * beta[2]);
    return [lp]';
  }
}
data {
  // N = 12 data points
  int y[12];
  real x[12];
}
transformed data {
  // K = 3 shards
  int ys[3, 4] = { y[1:4], y[5:8], y[9:12] };
  real xs[3, 4] = { x[1:4], x[5:8], x[9:12] };
  vector[0] theta[3];
}
parameters {
  vector[2] beta;
}
model {
  beta ~ normal(0, 1);
  target += sum(map_rect(lr, beta, theta, xs, ys));
}

Like the ODE integrator and algebraic solver, dealing with the types and dummies to pack/unpack is clunky.

I’m already seeing that it’d be nice to have a version of map_rect that maps a function returning a single real value to avoid that final bit of syntactic cruft in the return, [lp]'. That should be very easy to build using a simple adaptor.

wds15 · March 12, 2018, 3:56pm

Really great to see the progress!

The map_rect has been designed with hierarchical models in mind such that these types of models look a bit more natural.

Adding a map_rect_lpdf (or in this case map_rect_lpmf) would make sense and would even allow for some more tweaks as the reduction step can then happen locally on the workers. This would come at the price of loosing exact reproducibility as floating point arithmetic depends on order of execution which we would change… but to emphasize: The current map_rect gives exactly the same results just a lot faster if the problem scales well.

The packing/unpacking business can be remedied a lot with some utility functions. Maybe this calls for a case study which has these in it. We also need to teach users what would limit the performance and what not:

always try to return the lpdf or lpmf value if possibly (reduction of back-ward communication)
avoid assigning the x_r or x_i data to variables inside the function, but rather subset the data directly. Otherwise one ends up propagating data to var variables which is never a good idea.

And there is presumably more.

BTW, while it will be a pain to “squeeze” a model through the map_rect due to packing/unpacking one does even get on a single core speedups for large models as the overall scale-abilty is apparently improved by the evaluation scheme used in map_rect.

Guido_Biele · March 12, 2018, 4:05pm

Great to see that stan soon can parallelize computations!
Quick question: Could you explain what the function of theta is?

wds15 · March 12, 2018, 4:09pm

the theta’s are job-specific parameters which Bob did not use in this example, but the map_rect expects that you give an array of vectors as job-specific parameters as input. The array must have the same length as there are number of jobs which is 3 in this example. As the model does not need job-specific parameters the vector has length 0.

In a hierarchical model you would use theta to transfer the group specific parameters if you parallelize over the groups of your hierarchical model which usually makes a lot of sense.

Bob_Carpenter · March 13, 2018, 9:05pm

And it’d just use ~ or target += internally? We’d then have to buffer up those target contributions, which wouldn’t be too hard—the basic implementation is an accumulator that then returns an efficient sum at the end.

Exactly why that’s the second example I coded up for the manual.

Here’s the basic version:

data {
  int<lower = 0> K;
  int<lower = 0> N;
  int<lower = 1, upper = K> kk[N];
  vector[N] x;
  int<lower = 0, upper = 1> y[N];
}
parameters {
  matrix[50, 2] beta;
  vector[2] mu;
  vector<lower=0>[2] sigma;
}
model {
  mu ~ normal(0, 2);
  sigma ~ normal(0, 2);
  for (j in 1:2)
    beta[, j] ~ normal(mu[j], sigma[j]);
  y ~ bernoulli_logit(beta[kk, 1] + beta[kk, 2] .* x);
}

And here’s the mapped version:

functions {
  vector bl_glm(vector mu_sigma, vector beta,
                real[] x, int[] y) {
    vector[2] mu = mu_sigma[1:2];
    vector[2] sigma = mu_sigma[3:4];
    real lp = normal_lpdf(beta | mu, sigma);
    real ll = bernoulli_logit_lpmf(y | beta[1] + beta[2] * to_vector(x));
    return [lp + ll]';
  }
}
data {
  int<lower = 0> K;
  int<lower = 0> N;
  int<lower = 1, upper = K> kk[N];
  vector[N] x;
  int<lower = 0, upper = 1> y[N];
}
transformed data {
  int<lower = 0> J = N / K;
  real x_r[K, J];
  int<lower = 0, upper = 1> x_i[K, J];
  {
    int pos = 1;
    for (k in 1:K) {
      int end = pos + J - 1;
      x_r[k] = to_array_1d(x[pos:end]);
      x_i[k] = y[pos:end];
      pos += J;
    }
  }
}
parameters {
  vector[2] beta[K];
  vector[2] mu;
  vector<lower=0>[2] sigma;
}
model {
  mu ~ normal(0, 2);
  sigma ~ normal(0, 2);
  target += sum(map_rect(bl_glm, append_row(mu, sigma),
                         beta, x_r, x_i));
}

I mapped the priors, but that’s not necessary—that could be done at the top level and it would cut down on communication but increase non-parallel calculation.

This is also still hugely simplifying by assuming equal numbers of observations in each group and a single return value.

yizhang · April 27, 2018, 3:27pm

I assume make + mpiexec is how one can test this example?

yizhang · April 27, 2018, 4:33pm

Maybe already fixed, there’s some runaway typesetting in the manual (p332, 334) in this branch.

Bob_Carpenter · May 1, 2018, 2:22am

The language component of MPI has been merged already, so we’ll have to fix any manual errors in another PR.

yizhang · May 7, 2018, 6:49pm

Hi @wds15, @Bob_Carpenter, I’m not quite sure I understand the behavior of the test model in MPI run. What’s the relationship between work load distribution and the number of chains, the number of processes(mpiexec -n ), and the number of shards?

wds15 · May 7, 2018, 7:28pm

the MPI code is meant to run within a single chain only. If you have W workers allocated and pass N jobs to map_rect, then each worker will receive a chunk of N/W jobs.

I am a bit surprised to see this many questions on MPI on this from your end… I thought you reviewed this a while ago (no?).

What are shards?

yizhang · May 7, 2018, 7:40pm

I guess I’m either confused by the cmdstan output or the nomenclature (turns out you didn’t this shards thing either). When I run mpiexec -n 9 on the test model I see 9 chains from output. What information does map_rect spit out to let a user know that he’s running with W workers on N jobs?

wds15 · May 7, 2018, 7:43pm

cmdstan is not yet prepared to run with MPI!

Currently there is no output anywhere (well on the prototypes I used to play with I had debug output), but probably including this info in the cmdstan startup message is a good thing to do.

yizhang · May 7, 2018, 7:46pm

Somehow by “MPI is ready” I took it as ready for all Stan components…

wds15 · May 7, 2018, 7:48pm

Did I write that? What is ready to be reviewed is the MPI map_rect PR in stan-math. When that is in cmdstan should be simple to address.

I hope that @Bob_Carpenter gets along with 1k lines of code…

yizhang · May 7, 2018, 7:52pm

I was referring to title of this post.

DvdSchaaf · October 23, 2018, 8:58pm

Dear Bob,

Thanks for this working example of map_rect for a hierarchical model! I’m currently experimenting a lot with map_rect and I was wondering if we could still write the hierarchy for beta using non-centered parameterization?

Kind regards,

Douwe

Bob_Carpenter · October 25, 2018, 4:13am

Sure—just pass the transformed parameters or do the transform as part of the map.

DvdSchaaf · October 25, 2018, 6:24pm

Thanks for your quick reply! Indeed passing the transformed parameters to map_rect would work, but you will cut down on parallel computation in this case, as you mentioned. I was struggling, however, with doing the transform as part of the map. Taking your example of the hierarchical model where the contribution of beta to the log probability is given by normal_lpdf(beta | mu, sigma), how do I implement Matt’s trick in the function block? Should I write normal_lpdf(beta_raw .* sigma + mu | mu, sigma) and bernoulli_logit_lpmf(y | beta_raw[1] .* sigma[1] + mu[1] + (beta_raw[2] .* sigma[2] + mu[2]) * to_vector(x)) in the function block, where beta_raw ~ normal(0,1) is in the model block, and construct beta = beta_raw * sigma + mu in generated quantities, for example? Would this be correct? If yes, does it make sense to do it this way?

Bob_Carpenter · October 26, 2018, 2:43pm

Yes, that’s the way to code it. It’s unclear to me why you’d scale beta_raw by sigma.

Also, if beta_raw is organized by group, then you can provide the prior in the function that’s getting parallelized, too. Or you can separately parallelize the prior if it’s expensive.

The generated quantities calculations are relatively cheap as they are double-based and don’t involve gradients and are only done once/iteration (as opposed to the log density and gradient, which are evaluated at each leapfrog step in the Hamiltonian simulation within an iteration).

DvdSchaaf · October 26, 2018, 3:50pm

Thanks again for replying so fast. beta_raw is the parameter with a standard normal density. In your example, beta is normal with mean mu and scale sigma, so I construct beta in a non-centered fashion by scaling beta_raw by sigma and adding mu. Thus by writing normal_lpdf(beta_raw .* sigma + mu | mu, sigma), I intended to write your hierarchical example in non-centered parametrized form.

Topic		Replies	Views
Getting map_rect error at C++ compile (Stan to C++ compile seems to work) General	2	648	October 14, 2019
Map_rect wish list Developers	4	516	December 10, 2019
Map_rect and future Stan - any chance it gets easier? :) Developers	1	471	November 19, 2018
MPI roadmap Developers	16	1814	March 11, 2018
Extending reduce_sum to use MPI Developers	5	481	June 1, 2021

Map_rect in the Stan language (MPI is coming)

Related topics