Hello I am trying to model multiple sets of counts with:

```
data{
// Dimensions of input gene x cell matrix
int<lower=1> Ngenes;
int<lower=1> Ncells;
// Observed UMI counts for [Ngenes] genes across [Ncells] cells
array[Ngenes, Ncells] int counts;
// Logged total counts per-cell
vector<lower=0>[Ncells] ltotal_counts;
// Initial guestimates for the lfraction parameters
vector[Ngenes] init_lfraction;
}
parameters{
// Logged estimated fraction of counts taken up by a gene across cells
vector<lower=-100,upper=0>[Ngenes] lfraction;
// Inverse overdispersion factor
vector<lower=1e-15,upper=1e15>[Ngenes] theta;
}
model{
// Allocated vector for expected per-cell count
vector[Ncells] mu;
// Priors
lfraction ~ normal( init_lfraction , 5 ); // lfraction estimates are assumed to be nearby initial guestimate
theta ~ exponential( 0.1 ); // theta estimates past 1000 are unlikely
// Iterate over all genes in counts matrix
for (g in 1:Ngenes) {
// {mu = fraction * total_counts} is equivalent to {mu = exp( log(fraction) + log(total_counts) )}
mu = exp(lfraction[g] + ltotal_counts);
// Likelihood
counts[g ,] ~ neg_binomial_2(mu , theta[g]);
}
}
```

Currently I am testing the code with `counts`

being an input matrix of 5 by 2000(represented by a 2d-array) and sampling takes about 28.9 seconds on average for [1000 samples, 1500 warm-ups]. However, if I use the following code which just models one set of counts:

```
data{
// Number of cells
int<lower=1> Ncells;
// Observed UMI counts for 1 gene across [Ncells] cells
array[Ncells] int counts;
// Logged total counts per-cell
vector<lower=0>[Ncells] ltotal_counts;
// Initial guestimate for the lfraction parameter
real init_lfraction;
}
parameters{
// Logged estimated fraction of counts taken up by a gene across cells
real<lower=-100,upper=0> lfraction;
// Inverse overdispersion factor
real<lower=1e-15,upper=1e15> theta;
}
model{
// Allocate vector for expected per-cell count
vector[Ncells] mu;
// Priors
lfraction ~ normal( init_lfraction , 5 ); // lfraction estimates are assumed to be nearby initial guestimate
theta ~ exponential( 0.001 ); // theta estimates past 1000 are unlikely
// [mu = fraction * total_counts] is equivalent to [mu = exp( log(fraction) + log(total_counts) )]
mu = exp(lfraction + ltotal_counts);
// Likelihood
counts ~ neg_binomial_2(mu, theta);
}
```

and then loop over the matrix in R(I am using `cmdstanr`

), it takes 21.61 seconds on average instead with the same amount of sampling, which scales much better as extending to model more sets of counts(more genes in the counts matrix), the time difference between these two approaches further increases.

My questions are:

- Why does looping over the matrix in Stan take longer than doing the same in R?
- Is there any way to speed this up? I am hoping to model at least ~2000 genes and I think it’s either I parallelize looping over the matrix in R or parallelize looping over the matrix in Stan, I was hoping Stan would be faster.

Thank you for any help!