Practical questions on map_rect usage

I have few questions about the usage of map_rect (with threading, not MPI yes)

  1. Is it fair to say that the amout of shards should be = to the amount of threads for optimal performances (and not higher)?
  2. If I have my Makevars as
CXX14FLAGS += -pthread

Do the “thread flags” slow down NON thread models (models where map_rect is not present)?

  1. Should I have just one call of map_rect per model?

  2. global parameters are better processed outside map_rect in a vectorised fashion?

  3. Could I use map_rect for speeding up mixture models, that so far are not vectorised ?

  4. The data and paramaters should be spread equally across shards. Of course if N is a prime number this is not possible. In my case I fill the missing values (either in data or parameters) with 0 and track the size in an array. In this case some thread is doing more work than others. Is this likely to kill the performances? On this non squared design I see poor performances.


  1. This is a good start. If you are certain that each block always needs the same amount of CPU work, then this is probably the best thing to do… but if the workload varies from draw to draw between the shards in extreme ways, then this may not be ideal (it would be hard to handle this right now anyway).

  2. If you put -DSTAN_THREADS then you will always hurt your performance by about ~20% right now - even if you do not use map_rect. This will hopefully get better in the future… but for now you really want to make that choice of using threads only for those models where you use map_rect.

  3. That is probably the best strategy to have a single map_rect call. You can have multiple (even nested), but I would doubt that this helps in performance … but it may help you to code up the model easier.

  4. Anything which you can vectorise - do so! Giving up vectorisation for map_rect is a bad idea right now. I can only imagine that splitting huge expressions up into smaller slices can eventually make up for the price to pay for giving up some vectorisation.

  5. I haven’t thought about this, but probably yes (if each mixture model component is expensive to calculate)

  6. If you give a prime number of shards to map_rect it will automatically give the root node a little less work. I don’t think that you need to worry about his too much.

I hope this helps.


I guess the implicit alternative is to build many more shards than threads (?)

For now it does not make a difference as we chunk all shards into the number of threads you use. That will change hopefully at some point when we have more clever queuing. I am honestly surprised about the sharding question… I would rather concentrate on the model in most cases. If sharding would be very important then the model probably needs other tweaks…

The model with one thread behaves well, and run rather faster than using 8 threads. I’m trying to identify the problem, maybe due to the fact that not having a “square” data structure but rather a “tidy” one (the number of observation is not the same for each paramater). Tha’s why I was asking if I have N threads whats the best/safest/most logical number of shards.

Maybe there is some inefficiency I’m not seing in producing data structures

For completeness


	 vector lp_reduce( vector global_parameters , vector local_parameters , real[] xr , int[] xi ) {
	 	int M = xi[1];
	 	int N = xi[2];
	 	int G_per_shard = xi[3];
	 	int symbol_start[M+1] = xi[(3+1):(3+1+M)];
	 	int sample_idx[N] = xi[(3+1+M+1):(3+1+M+1+N-1)];
	 	int counts[N] = xi[(3+1+M+1+N):size(xi)];

	 	vector[G_per_shard] lambda_MPI = local_parameters[1:G_per_shard];
	 	vector[G_per_shard] sigma_MPI = local_parameters[(G_per_shard+1):(G_per_shard*2)];
	 	vector[G_per_shard] lp;

	 for(g in 1:G_per_shard){
	 	lp[g] =  neg_binomial_2_log_lpmf(counts[symbol_start[g]:symbol_start[g+1]] | lambda_MPI[g], sigma_MPI[g]);

    return [sum(lp)]';

data {
  int<lower=0> N;
  int<lower=0> M;
	int<lower=0> G;
	int<lower=0> S;
  int n_shards;
	int<lower=0> counts[n_shards, N];
	int<lower=0> symbol_start[n_shards, M+1];
	int<lower=0> sample_idx[n_shards, N];
	int<lower=0> G_per_shard[n_shards];
	int<lower=0> G_per_shard_idx[n_shards + 1];

transformed data {
  vector[0] global_parameters;
  real xr[n_shards, 0];

  int<lower=0> int_MPI[n_shards, 3+(M+1)+N+N];

  // Shards - MPI
  for ( i in 1:n_shards ) {
  int M_N_Gps[3];
  M_N_Gps[1] = M;
  M_N_Gps[2] = N;
  M_N_Gps[3] = G_per_shard[i];

  int_MPI[i,] = append_array(append_array(append_array(M_N_Gps, symbol_start[i]), sample_idx[i]), counts[i]);
parameters {
  // Overall properties of the data
  real lambda_mu; // So is compatible with logGamma prior
  real<lower=0> lambda_sigma;
  real lambda_skew;
  vector[S] exposure_rate;

  // Gene-wise properties of the data
  vector[G] lambda;
  vector<lower=0>[G] sigma_raw;

  // Signa linear model

  real<upper=0> sigma_slope;
  real sigma_intercept;

transformed parameters {
  // Sigma
  vector[G] sigma = 1.0 ./ sigma_raw;

// Shards - MPI
	vector[2*M] lambda_sigma_MPI[n_shards];
	for( i in 1:(n_shards) ) {

	  vector[ (M*2) - (G_per_shard[i]*2) ] buffer = rep_vector(0.0,(M*2) - (G_per_shard[i]*2));

		lambda_sigma_MPI[i] = append_row(append_row(
		), buffer);


model {

  // Overall properties of the data
int i = 1;
  lambda_mu ~ normal(0,2);
  lambda_sigma ~ normal(0,2);
  lambda_skew ~ normal(0,2);

  //sigma_raw ~ normal(0,1);
  exposure_rate ~ normal(0,1);
  sum(exposure_rate) ~ normal(0, 0.001 * S);

  sigma_intercept ~ normal(0,2);
  sigma_slope ~ normal(0,2);
  sigma_sigma ~ normal(0,2);

  // Gene-wise properties of the data
  // lambda ~ normal_or_gammaLog(lambda_mu, lambda_sigma, is_prior_asymetric);
  lambda ~  skew_normal(lambda_mu,lambda_sigma, lambda_skew);
  sigma_raw ~ lognormal(sigma_slope * lambda + sigma_intercept,sigma_sigma);

	// Gene-wise properties of the data
	target += sum( map_rect( lp_reduce , global_parameters , lambda_sigma_MPI , xr , int_MPI ) );