Modelling multiple sets of counts with neg_binomial_2, doesn't scale well with # of sets

Hello I am trying to model multiple sets of counts with:

data{
    // Dimensions of input gene x cell matrix
    int<lower=1> Ngenes;
    int<lower=1> Ncells;
    
    // Observed UMI counts for [Ngenes] genes across [Ncells] cells
    array[Ngenes, Ncells] int counts;
  
    // Logged total counts per-cell
    vector<lower=0>[Ncells] ltotal_counts;

    // Initial guestimates for the lfraction parameters
    vector[Ngenes] init_lfraction;
}
parameters{
    // Logged estimated fraction of counts taken up by a gene across cells
    vector<lower=-100,upper=0>[Ngenes] lfraction;

    // Inverse overdispersion factor
    vector<lower=1e-15,upper=1e15>[Ngenes] theta;
}
model{
    // Allocated vector for expected per-cell count
    vector[Ncells] mu;

    // Priors
    lfraction ~ normal( init_lfraction , 5 ); // lfraction estimates are assumed to be nearby initial guestimate
    theta ~ exponential( 0.1 ); // theta estimates past 1000 are unlikely
    
    // Iterate over all genes in counts matrix
    for (g in 1:Ngenes) {
        // {mu = fraction * total_counts} is equivalent to {mu = exp( log(fraction) + log(total_counts) )}
        mu = exp(lfraction[g] + ltotal_counts);

        // Likelihood
        counts[g ,] ~ neg_binomial_2(mu , theta[g]);
    }
}

Currently I am testing the code with counts being an input matrix of 5 by 2000(represented by a 2d-array) and sampling takes about 28.9 seconds on average for [1000 samples, 1500 warm-ups]. However, if I use the following code which just models one set of counts:

data{
    // Number of cells
    int<lower=1> Ncells;

    // Observed UMI counts for 1 gene across [Ncells] cells
    array[Ncells] int counts;

    // Logged total counts per-cell
    vector<lower=0>[Ncells] ltotal_counts;

    // Initial guestimate for the lfraction parameter
    real init_lfraction;
}
parameters{
    // Logged estimated fraction of counts taken up by a gene across cells
    real<lower=-100,upper=0> lfraction;

    // Inverse overdispersion factor
    real<lower=1e-15,upper=1e15> theta;
}
model{
    // Allocate vector for expected per-cell count
    vector[Ncells] mu;

    // Priors
    lfraction ~ normal( init_lfraction , 5 ); // lfraction estimates are assumed to be nearby initial guestimate
    theta ~ exponential( 0.001 ); // theta estimates past 1000 are unlikely

    // [mu = fraction * total_counts] is equivalent to [mu = exp( log(fraction) + log(total_counts) )]
    mu = exp(lfraction + ltotal_counts);

    // Likelihood
    counts ~ neg_binomial_2(mu, theta);
}

and then loop over the matrix in R(I am using cmdstanr), it takes 21.61 seconds on average instead with the same amount of sampling, which scales much better as extending to model more sets of counts(more genes in the counts matrix), the time difference between these two approaches further increases.

My questions are:

  1. Why does looping over the matrix in Stan take longer than doing the same in R?
  2. Is there any way to speed this up? I am hoping to model at least ~2000 genes and I think it’s either I parallelize looping over the matrix in R or parallelize looping over the matrix in Stan, I was hoping Stan would be faster.

Thank you for any help!

Can you clarify what you mean when you say that you loop over the matrix in R? My guess is that you are estimating a fundamentally different model by doing this, because in the full model the counts for each gene share the parameter mu. But without a full example in code I’m not certain of what you are doing.

Essentially each row of the matrix should be an independent negative binomial regression. So using the second script I do:

post1all <- lapply(1:5, function(i) {
  vec <- counts_matrix[i,]
  data1 <- list("Ncells" = length(vec),
                "counts" = vec,
                "ltotal_counts" = log(total_counts),
                "init_lfraction" = log(mean(vec / total_counts))
  )
  post1 <- model1$sample(data = data1, chains = 4, parallel_chains = 4, iter_sampling = 1000, iter_warmup = 1500)
})

In the first script instead mu should be overwritten whenever the model is looking at a different gene, which is the mu = exp(lfraction[g] + ltotal_counts); line.