Efficient vector matrix product for many matrices

I am fitting a model where I have many observations from several detectors. For each observation, a spectrum vector of size M is multiplied by a detector’s respose which is a matrix of size MxN producing a vector of size N which is then compared to the data.

This essentially looks like this (I’ve left out some details to highlight the main points):


int<lower=0> N_dets[N_intervals]; // number of detectors per data type
int<lower=0> N_chan[N_intervals, max(N_dets)]; // number of channels in each detector
 int<lower=0> N_echan[N_intervals,  max(N_dets)]; // number of energy side channels in each detector

int grb_id[N_intervals]; // index to specify GRBs
int N_grbs; // number of GRBs

vector[max_n_echan] ebounds_hi[N_intervals, max(N_dets)];  // boundaries of the matrix for integration
 vector[max_n_echan] ebounds_lo[N_intervals, max(N_dets)];



vector[max_n_chan] observed_counts[N_intervals, max(N_dets)]; // observed counts 
vector[max_n_chan] background_counts[N_intervals, max(N_dets)]; // background counts
 vector[max_n_chan] background_errors[N_intervals, max(N_dets)]; // error on the background 

int idx_background_zero[N_intervals, max(N_dets), max_n_chan]; // where the background is zero
int idx_background_nonzero[N_intervals, max(N_dets), max_n_chan]; // where the background in non-zero
int N_bkg_zero[N_intervals,max(N_dets)]; // number of zero backgrounds
int N_bkg_nonzero[N_intervals,max(N_dets)]; // number of non-zero backgrounds

real exposure[N_intervals, max(N_dets)]; // exposure of the observation

matrix[max_n_echan, max_n_chan] response[N_intervals, max(N_dets)]; // response matrix per interval per detector



   transformed parameters {

    real pre_calc[N_intervals, 4];
    vector[N_grbs] gamma;
    vector[N_grbs] delta;
    vector[N_intervals] log_energy_flux;
    vector[N_intervals] epeak;
    vector[max_n_chan] expected_model_counts[N_intervals, max(N_dets)];



     // non-centered parameterization
     gamma = gamma_mu + gamma_offset * gamma_sigma;
     delta = delta_mu + 52  + delta_offset * delta_sigma;

     // correlation between flux and epeak treating GRBs as groups
     log_energy_flux = delta[grb_id] + gamma[grb_id] .* (log_epeak + log_zp1 - 2. ) -  log_dl2[grb_id];


     // compute the folded counts

     for (n in 1:N_intervals) {


     epeak[n] = 10^log_epeak[n];
     
     // pre-calculations are shared across interval between detectors so only do it once
     // norm, ec, epslit, pre
     
     pre_calc[n, :] = band_precalculation(10^log_energy_flux[n], alpha[n], beta[n], epeak[n], emin, emax);

     for (m in 1:N_dets[n]) {

       // compute the latent spectrum that is folded thru the response
       row_vector[N_echan[n, m]] latent_spectrum =  to_row_vector(integral_flux(ebounds_lo[n, m, :N_echan[n, m]],
                                                                            ebounds_hi[n, m, :N_echan[n, m]],
                                                                            ebounds_add[n, m, :N_echan[n, m]],
                                                                           ebounds_half[n, m, :N_echan[n, m]],
                                                                           pre_calc[n,1],
                                                                            pre_calc[n,2],
                                                                           pre_calc[n,3],
                                                                            alpha[n],
                                                                           beta[n],
                                                                            pre_calc[n,4])


    // expected counts for each interval/detector is a matrix/ vector product
     
    expected_model_counts[n,m,:N_chan[n,m]] = latent_spectrum * response[n, m,:N_echan[n,m],:N_chan[n,m]]) * exposure[n,m])';

      }
    }
}




   model {

      vector[N_total_channels] log_like; // storage for loglike
      int pos = 1;


      alpha ~ normal(-1,.5);
      beta ~ normal(-3,1);
      log_epeak ~ normal(2.,1);

      gamma_mu ~ normal(1.5, 1);
      delta_mu ~ normal(0,4);

      gamma_offset ~ std_normal();
      delta_offset ~ std_normal();

      gamma_sigma ~ normal(0, 5);
      delta_sigma ~ normal(0, 5);


      for (n in 1:N_intervals) {

       for (m in 1:N_dets[n]) {

    
      // compute the loglike for each detector and interval
      log_like[pos : pos + N_channels_used[n,m] - 1]= pgstat(observed_counts[n, m, mask[n,m,:N_channels_used[n,m]]],
                                                           background_counts[n, m, mask[n,m,:N_channels_used[n,m]]],
                                                           background_errors[n, m, mask[n,m,:N_channels_used[n,m]]],
                                                           expected_model_counts[n,m, mask[n,m,:N_channels_used[n,m]]],
                                                           idx_background_zero[n,m, :N_bkg_zero[n,m]],
                                                           idx_background_nonzero[n,m, :N_bkg_nonzero[n,m]]  );
      pos += N_channels_used[n,m];


      }

      }

    // sum at the end for full vectorization  
   target += sum(log_like);

     }


This works fine, but the many many matrix vector product operations are slow. One could imagine using map_rect to split this up, but I’m curious if the packing and unpacking of all the data and responses will probably lead to no gain in speed. I’m curious if there is a more efficient way to do so many vector/matrix product because it is not always clear how Stan handles these things efficiently.

To give an idea of size, each matrix is typically 150x128. I have ~500 of these in the data, but really want to have even more. The matrices are only about 50-60% sparse, so there is no gain in using a sparse math. Perhaps there is a clever way to vectorize that I’m not immediately seeing?

Is it this line?

expected_model_counts[n,m,:N_chan[n,m]] = latent_spectrum * response[n, m,:N_echan[n,m],:N_chan[n,m]]) * exposure[n,m])';

All the row vectors are different and all the matrices are different, so seems unlikely there’s a faster way to do it. The thing you want to look for in this sorta stuff is shared operations, and it doesn’t look like there are any.

Is there a parenthesis mismatch in that line?

You should make sure the scalar exposure[n,m] is not being multiplied on the matrix – that will be less efficient than multiplying it on a vector or whatnot.

Thanks! The parenthesis is a mistake from copying out of emacs. But it is as I feared… just have to wait a long time!

1 Like