Map_rect, multithreaded on parameters only

I’m playing with using the the map_rect for a sum marginalization where there are no real data in the argument. This is effectively a sum over a bunch of latent parameters that I can easily break into shards and add the computed density onto the target at the end.

The results of the model are the same, but the model implementing map_rect with mutliple threads is much slower. I can post an example, but I’m curious if this behavior is expected when the operation is only on parameters. I’ve run all the toy models and seen the impressive speed up, so I’m guessing that this is just a model specific issue.

Cheers

No, this behavior is not per se expected. It really depends on how expensive each task is. It’s hard to say anything without seeing more detail. One thing which you always should make sure is to only return a single log-lik contribution from each job to cut down on the communication.

Cool. Well, without going into too much detail, the function to map is:

  vector mix_reduce(vector beta, vector theta, real [] xr, int [] xi) {


    int M_local = num_elements(theta)/3;
    real out = 0;
    
    vector[M_local] log_lum_latent_tilde;
    vector[M_local] flux_tilde;
    vector[M_local] z_tilde;
    vector[M_local] log_flux_latent_tilde;

    real Lambda = beta[1];
    real Lambda0 = beta[2];

    real flux_sigma = xr[1];
    real boundary = xr[2];
    real z_max = xr[3];
    
    log_lum_latent_tilde = theta[1:M_local];
    flux_tilde = theta[(M_local+1):2*M_local];
    z_tilde = theta[(2*M_local +1):3*M_local];
    
    log_flux_latent_tilde = log_lum_latent_tilde - 2*log10(z_tilde+1.)  - log10( 4 * pi());
    
    for (m in 1:M_local) {
      
      out += log_sum_exp(log(4*pi())
			  + 2*log(z_tilde[n])
			  + log(Lambda)
			  + normal_lpdf(log10(flux_tilde[n]) | log_flux_latent_tilde[n], flux_sigma) + log(0.434294/flux_tilde[n]),

			  log(Lambda0)
			  + uniform_lpdf(z_tilde[n]| 0, z_max ) 
			  + uniform_lpdf(flux_tilde[n]| 0, boundary ) 
			  );

    
    }

    return [out]';


  }

and is handled in the model block as:

 // Measurement model for auxiliary objects
  log_lum_latent_tilde ~ normal(mu, tau); 
  

  target += sum( map_rect( mix_reduce, beta , theta , xr , xi ) );


and I transform the parameters in the transform block:

transformed parameters {

  vector[N] log_flux_latent;
  vector[3*group_size] theta[n_shards];
  vector[2] beta;


  beta[1] = Lambda;
  beta[2] = Lambda0;
  
  
  for (n in 1:n_shards) {

    int j = 1 + (n-1)*group_size;
    int k = n*group_size;
    
    theta[n,1:group_size] = log_lum_latent_tilde[j:k];
    theta[n,(group_size+1):2*group_size] = flux_tilde[j:k];
    theta[n,(2*group_size+1):3*group_size] = z_tilde[j:k];

  }
  
  
  
}

There may be more efficient ways of doing this, but I’m still experimenting with this.