Map_rect, multithreaded on parameters only

grburgess · November 12, 2018, 3:08pm

I’m playing with using the the map_rect for a sum marginalization where there are no real data in the argument. This is effectively a sum over a bunch of latent parameters that I can easily break into shards and add the computed density onto the target at the end.

The results of the model are the same, but the model implementing map_rect with mutliple threads is much slower. I can post an example, but I’m curious if this behavior is expected when the operation is only on parameters. I’ve run all the toy models and seen the impressive speed up, so I’m guessing that this is just a model specific issue.

Cheers

wds15 · November 12, 2018, 3:22pm

No, this behavior is not per se expected. It really depends on how expensive each task is. It’s hard to say anything without seeing more detail. One thing which you always should make sure is to only return a single log-lik contribution from each job to cut down on the communication.

grburgess · November 12, 2018, 4:00pm

Cool. Well, without going into too much detail, the function to map is:

  vector mix_reduce(vector beta, vector theta, real [] xr, int [] xi) {


    int M_local = num_elements(theta)/3;
    real out = 0;
    
    vector[M_local] log_lum_latent_tilde;
    vector[M_local] flux_tilde;
    vector[M_local] z_tilde;
    vector[M_local] log_flux_latent_tilde;

    real Lambda = beta[1];
    real Lambda0 = beta[2];

    real flux_sigma = xr[1];
    real boundary = xr[2];
    real z_max = xr[3];
    
    log_lum_latent_tilde = theta[1:M_local];
    flux_tilde = theta[(M_local+1):2*M_local];
    z_tilde = theta[(2*M_local +1):3*M_local];
    
    log_flux_latent_tilde = log_lum_latent_tilde - 2*log10(z_tilde+1.)  - log10( 4 * pi());
    
    for (m in 1:M_local) {
      
      out += log_sum_exp(log(4*pi())
			  + 2*log(z_tilde[n])
			  + log(Lambda)
			  + normal_lpdf(log10(flux_tilde[n]) | log_flux_latent_tilde[n], flux_sigma) + log(0.434294/flux_tilde[n]),

			  log(Lambda0)
			  + uniform_lpdf(z_tilde[n]| 0, z_max ) 
			  + uniform_lpdf(flux_tilde[n]| 0, boundary ) 
			  );

    
    }

    return [out]';


  }

and is handled in the model block as:

 // Measurement model for auxiliary objects
  log_lum_latent_tilde ~ normal(mu, tau); 
  

  target += sum( map_rect( mix_reduce, beta , theta , xr , xi ) );

and I transform the parameters in the transform block:

transformed parameters {

  vector[N] log_flux_latent;
  vector[3*group_size] theta[n_shards];
  vector[2] beta;


  beta[1] = Lambda;
  beta[2] = Lambda0;
  
  
  for (n in 1:n_shards) {

    int j = 1 + (n-1)*group_size;
    int k = n*group_size;
    
    theta[n,1:group_size] = log_lum_latent_tilde[j:k];
    theta[n,(group_size+1):2*group_size] = flux_tilde[j:k];
    theta[n,(2*group_size+1):3*group_size] = z_tilde[j:k];

  }
  
  
  
}

There may be more efficient ways of doing this, but I’m still experimenting with this.

Topic		Replies	Views
Map_rect causing substantial slowdown; trying to understand how to fix Modeling cmdstanr , paralellization	13	870	July 24, 2020
Practical questions on map_rect usage General performance	4	793	February 26, 2019
Reduce_sum() using only one thread Interfaces	2	425	June 10, 2020
Multiple calls to map_rect Modeling performance	4	522	May 7, 2019
Catastrophic performance drop with map_rect Modeling paralellization	4	536	May 9, 2023

Map_rect, multithreaded on parameters only

Related topics