I am fitting a model where I have many observations from several detectors. For each observation, a spectrum vector of size M is multiplied by a detector’s respose which is a matrix of size MxN producing a vector of size N which is then compared to the data.

This essentially looks like this (I’ve left out some details to highlight the main points):

```
int<lower=0> N_dets[N_intervals]; // number of detectors per data type
int<lower=0> N_chan[N_intervals, max(N_dets)]; // number of channels in each detector
int<lower=0> N_echan[N_intervals, max(N_dets)]; // number of energy side channels in each detector
int grb_id[N_intervals]; // index to specify GRBs
int N_grbs; // number of GRBs
vector[max_n_echan] ebounds_hi[N_intervals, max(N_dets)]; // boundaries of the matrix for integration
vector[max_n_echan] ebounds_lo[N_intervals, max(N_dets)];
vector[max_n_chan] observed_counts[N_intervals, max(N_dets)]; // observed counts
vector[max_n_chan] background_counts[N_intervals, max(N_dets)]; // background counts
vector[max_n_chan] background_errors[N_intervals, max(N_dets)]; // error on the background
int idx_background_zero[N_intervals, max(N_dets), max_n_chan]; // where the background is zero
int idx_background_nonzero[N_intervals, max(N_dets), max_n_chan]; // where the background in non-zero
int N_bkg_zero[N_intervals,max(N_dets)]; // number of zero backgrounds
int N_bkg_nonzero[N_intervals,max(N_dets)]; // number of non-zero backgrounds
real exposure[N_intervals, max(N_dets)]; // exposure of the observation
matrix[max_n_echan, max_n_chan] response[N_intervals, max(N_dets)]; // response matrix per interval per detector
```

```
transformed parameters {
real pre_calc[N_intervals, 4];
vector[N_grbs] gamma;
vector[N_grbs] delta;
vector[N_intervals] log_energy_flux;
vector[N_intervals] epeak;
vector[max_n_chan] expected_model_counts[N_intervals, max(N_dets)];
// non-centered parameterization
gamma = gamma_mu + gamma_offset * gamma_sigma;
delta = delta_mu + 52 + delta_offset * delta_sigma;
// correlation between flux and epeak treating GRBs as groups
log_energy_flux = delta[grb_id] + gamma[grb_id] .* (log_epeak + log_zp1 - 2. ) - log_dl2[grb_id];
// compute the folded counts
for (n in 1:N_intervals) {
epeak[n] = 10^log_epeak[n];
// pre-calculations are shared across interval between detectors so only do it once
// norm, ec, epslit, pre
pre_calc[n, :] = band_precalculation(10^log_energy_flux[n], alpha[n], beta[n], epeak[n], emin, emax);
for (m in 1:N_dets[n]) {
// compute the latent spectrum that is folded thru the response
row_vector[N_echan[n, m]] latent_spectrum = to_row_vector(integral_flux(ebounds_lo[n, m, :N_echan[n, m]],
ebounds_hi[n, m, :N_echan[n, m]],
ebounds_add[n, m, :N_echan[n, m]],
ebounds_half[n, m, :N_echan[n, m]],
pre_calc[n,1],
pre_calc[n,2],
pre_calc[n,3],
alpha[n],
beta[n],
pre_calc[n,4])
// expected counts for each interval/detector is a matrix/ vector product
expected_model_counts[n,m,:N_chan[n,m]] = latent_spectrum * response[n, m,:N_echan[n,m],:N_chan[n,m]]) * exposure[n,m])';
}
}
}
```

```
model {
vector[N_total_channels] log_like; // storage for loglike
int pos = 1;
alpha ~ normal(-1,.5);
beta ~ normal(-3,1);
log_epeak ~ normal(2.,1);
gamma_mu ~ normal(1.5, 1);
delta_mu ~ normal(0,4);
gamma_offset ~ std_normal();
delta_offset ~ std_normal();
gamma_sigma ~ normal(0, 5);
delta_sigma ~ normal(0, 5);
for (n in 1:N_intervals) {
for (m in 1:N_dets[n]) {
// compute the loglike for each detector and interval
log_like[pos : pos + N_channels_used[n,m] - 1]= pgstat(observed_counts[n, m, mask[n,m,:N_channels_used[n,m]]],
background_counts[n, m, mask[n,m,:N_channels_used[n,m]]],
background_errors[n, m, mask[n,m,:N_channels_used[n,m]]],
expected_model_counts[n,m, mask[n,m,:N_channels_used[n,m]]],
idx_background_zero[n,m, :N_bkg_zero[n,m]],
idx_background_nonzero[n,m, :N_bkg_nonzero[n,m]] );
pos += N_channels_used[n,m];
}
}
// sum at the end for full vectorization
target += sum(log_like);
}
```

This works fine, but the many many matrix vector product operations are slow. One could imagine using map_rect to split this up, but I’m curious if the packing and unpacking of all the data and responses will probably lead to no gain in speed. I’m curious if there is a more efficient way to do so many vector/matrix product because it is not always clear how Stan handles these things efficiently.

To give an idea of size, each matrix is typically 150x128. I have ~500 of these in the data, but really want to have even more. The matrices are only about 50-60% sparse, so there is no gain in using a sparse math. Perhaps there is a clever way to vectorize that I’m not immediately seeing?