Speeding up a hierarchical negative binomial regression with a huge number of groups

Hi everyone,

I was hoping to get some help specifying a model for some data from a multiplexed sequencing experiment. I can’t share the data but I’ll try to describe the setup concisely.

I have a collection of around 150 sequencing experiments where around 30k “barcode” sequences are counted. The barcodes are unevenly divided among ~2500 genes. Each barcode was counted in an “input” experiment to assess the concentration of each barcode initially present in the 150 experiments. I need to look for the effect of each gene globally, as well as for an interaction between gene and experiment. The model for each output count also needs to include a term to adjust for the total sequencing depth allocated to each experiment.

In short, the likelihood part of the model needs to be something like this:

counts ~ NB2_log(intercept + depth + input + gene + gene:experiment, \phi)

Ideally there would also be some adaptive regularizing priors on the regression coefficients, but right now I’m just trying to get a simple model to run.

The issue is that there are a lot of genes and even more gene:experiment interaction terms. gene + gene:experiment means roughly 400k regression coefficients. If I were to encode the mean function in the usual X\beta way, X would be roughly 30k rows x 400k columns. In the model I have below, I instead loop over the output observations and use indicator variables to pick out the right regression coefficients to use when modelling each observation.

In the past I’ve gotten a similar model to run by first fitting an MLE to each group, fitting a prior distribution to those MLE’s, then using that prior to model each group separately. This has the downside of not modelling uncertainty in the prior and not allowing for adaptive regularization, which I think might actually be quite important in this case.

Things I’ve tried using a small subset of the data:

  • Using the compressed row storage to input a sparse representation of X and do the multiplication X\beta with csr_matrix_times_vector (worked out to be slower)
  • Made the standard deviations of the normal priors on the regression coefficients parameters to be inferred (slower, made the priors too wide)
  • within-chain parallelization with reduce_sum (helped, but not quite enough)
  • Using rstanarm::stan_glmer(sparse=TRUE) and a variety of the prior options (negligible difference).

There are additional complications I’d like to layer into the model (using additional data that is pure noise, experiment-wise dispersion), but I’m still trying to get something that will run on the full dataset.

Are there any other ways to restructure / reparameterize this to be faster?

This is the simplest Stan model I’ve come up with so far:

data {
  int<lower=1> n_gene;
  int<lower=1> n_e; // number of experiments
  int<lower=1> n_bc;
  
  int<lower=1, upper = n_e> e_i[n_e*n_bc]; // indicator of which experiment each count belongs to
  int<lower=1, upper = n_gene> gene_i[n_e*n_bc]; //indicator of which gene each count belongs to
  int<lower=1, upper = n_bc> bc_i[n_e*n_bc]; // indicator of which barcode each count
  
  real<lower=0> e_depths[n_e]; // sequencing depths of each experiment
  
  int<lower=0> count_obs[n_e*n_bc];
  int<lower=0> input_obs[n_e*n_bc];
}

parameters {
  real intercept; 
  real gene[n_gene]; // global gene effects
  real gene_e[n_gene,n_e]; // matrix of gene:experiment interaction effects
  real<lower=0> phi; 
}

model {
  
  phi ~ gamma(2,2);
  gene ~ std_normal();
  for (i in 1:n_gene) {
    for (j in 1:n_e) {
      gene_e[i,j] ~ normal(0,.2);
    }
  }
  
  for (i in 1:(n_e*n_bc)){
    target += neg_binomial_2_log_lpmf(count_obs[i] | intercept + 
                                                     e_depths[e_i[i]] + 
                                                     log(input_obs[bc_i[i]]) + 
                                                     gene[gene_i[i]] + 
                                                     gene_e[gene_i[i], e_i[i]],
                                                     phi);
  }
}

1 Like

This:

  for (i in 1:n_gene) {
    for (j in 1:n_e) {
      gene_e[i,j] ~ normal(0,.2);
    }
  }

Can be simplified/sped-up as:

to_vector(gene_e) ~ normal(0,.2);
1 Like

Oops, needed to also make gene_e a matrix (rather than 2D array). Also, you can vectorize the entire likelihood by using an indicator for the entry in the gene_e matrix when flattened (just remember that flattening occurs in column-major order). Here’s the result:

data {
  int<lower=1> n_gene;
  int<lower=1> n_e; // number of experiments
  int<lower=1> n_bc;
  
  int<lower=1, upper = n_e> e_i[n_e*n_bc]; // indicator of which experiment each count belongs to
  int<lower=1, upper = n_gene> gene_i[n_e*n_bc]; //indicator of which gene each count belongs to
  int<lower=1, upper = n_gene*n_e> gene_e_i[n_e*n_bc]; //indicator of which gene and experiment combination each count belongs to
  int<lower=1, upper = n_bc> bc_i[n_e*n_bc]; // indicator of which barcode each count
  
  vector<lower=0>[n_e] e_depths; // sequencing depths of each experiment
  
  int<lower=0> count_obs[n_e*n_bc];
  vector<lower=0>[n_e*n_bc] input_obs;
}

parameters {
  real intercept; 
  vector[n_gene] gene; // global gene effects
  matrix[n_gene,n_e] gene_e; // matrix of gene:experiment interaction effects
  real<lower=0> phi; 
}

model {
  
  phi ~ gamma(2,2);
  gene ~ std_normal();
  to_vector(gene_e) ~ normal(0,.2);

  count_obs ~ neg_binomial_2_log(
    intercept
    + e_depths[e_i]
    + log(input_obs[bc_i])
    + gene[gene_i]
    + to_vector(gene_e)[gene_e_i]
    , phi
   );
3 Likes

And if you want to use reduce sum, here’s the equivalent target += variant to build from (I don’t know how to use reduce_sum yet, I only know that it needs the target += version):

data {
  int<lower=1> n_gene;
  int<lower=1> n_e; // number of experiments
  int<lower=1> n_bc;
  
  int<lower=1, upper = n_e> e_i[n_e*n_bc]; // indicator of which experiment each count belongs to
  int<lower=1, upper = n_gene> gene_i[n_e*n_bc]; //indicator of which gene each count belongs to
  int<lower=1, upper = n_gene*n_e> gene_e_i[n_e*n_bc]; //indicator of which gene and experiment combination each count belongs to
  int<lower=1, upper = n_bc> bc_i[n_e*n_bc]; // indicator of which barcode each count
  
  vector<lower=0>[n_e] e_depths; // sequencing depths of each experiment
  
  int<lower=0> count_obs[n_e*n_bc];
  vector<lower=0>[n_e*n_bc] input_obs;
}

parameters {
  real intercept; 
  vector[n_gene] gene; // global gene effects
  matrix[n_gene,n_e] gene_e; // matrix of gene:experiment interaction effects
  real<lower=0> phi; 
}

model {
  
  phi ~ gamma(2,2);
  gene ~ std_normal();
  to_vector(gene_e) ~ normal(0,.2);

  target += neg_binomial_2_log_lupmf( count_obs |
    intercept
    + e_depths[e_i]
    + log(input_obs[bc_i])
    + gene[gene_i]
    + to_vector(gene_e)[gene_e_i]
    , phi
   );
4 Likes

Some additional speedup (probably minor) available by computing log(input_obs) in transformed data rather than in the model block.

3 Likes

Thanks so much guys. Your suggestions made it 40% faster.

1 Like

Take a look at the those additive terms and check if there’s any redundancy where you could compute the addition (or a subset) once then index into the result to expand to the length of the observations.

Ah, so I would just need to pass in the unique combinations of count_obs, input_obs, gene, and gene_e, then pass those in along with the pre-computed number of occurrences of each combination and multiply each term by the count. That’s a good idea. The list of unique outcomes is about 26% shorter than the full data table.

Yup! Though I hadn’t been thinking of the count_obs variable too; I guess that’s kinda like the sufficient statistics trick in a binomial model.

I wonder if there might be further speed ups from looking at the unique combinations of subsets of those variables too? For example, if there are far fewer unique levels of experiment and barcode, compute the sum for those separately then index into the result when adding the gene effect.