Log likelihood computation parallelization with Reduce_sum and map_rect in Cmdstan

Hello

I have a computationally expensive log likelihood calculation that uses an array of 2n big matrixes(16 rows and a large number L of columns) . The code is very slow I would like to parallelize the log likelihood calculation.

The first half (n matrixes) in the array is filled in the transformed data block( those matrixes do not depend on the model parameters). The second half of the array of matrixes is filled in the model block since those matrixes depend on the model parameters.

The L columns of the matrices are independent.

I would like to parallelize the process of filling up the second half of the array of matrices.

I checked the reduce_sum example(https://mc-stan.org/users/documentation/case-studies/reduce_sum_tutorial.html)and the map_rect (26.2 Map-rect | Stan User’s Guide)

But the signature of the function f that receives reduce_sum is:

(array[] T x_subset, int start, int end, T1 s1, T2 s2, ...):real

return a real, and don’t allow to update some elements in the array x_subset ( in my case the second half of the array of matrices).

On the other hand, map_rect returns a vector instead of matrix block( I would like to use columns 1 to k, k+1 to 2k, 2k+1 to 3k and so on from the matrixes with indexes 1 to n to update the corresponding columns in the matrixes from n to 2n).

The code for the Stan model that I have is similar to this:

data{
int n;
int L;
......

}
transformed data{

matrix[16,L] array_matrices[2*n];

for(i in 1:n){
//fill the first n matrices 
    }
}
parameters{
  real<lower=0.01> gamma;
 
}
model{

 for( i in 1:L ){//columns of matrices
    for( j in (n+1):(2*n) ) {//number of matrices 
     //set column i in matrices from n+1 to 2n
          col(array_matrices[j],i) = gamma * ((col(array_matrices[j-n],i) .* (col(array_matrices[j-n], i)));
        }
     
        target += log(sum(col(array_matrices[2*n],i)));

}

My question is: Can I achieve this parallelization with Reduce_sum and map_rect functions or should I use an external C+ function?