Hi everyone,
I was hoping to get some help specifying a model for some data from a multiplexed sequencing experiment. I can’t share the data but I’ll try to describe the setup concisely.
I have a collection of around 150 sequencing experiments where around 30k “barcode” sequences are counted. The barcodes are unevenly divided among ~2500 genes. Each barcode was counted in an “input” experiment to assess the concentration of each barcode initially present in the 150 experiments. I need to look for the effect of each gene globally, as well as for an interaction between gene and experiment. The model for each output count also needs to include a term to adjust for the total sequencing depth allocated to each experiment.
In short, the likelihood part of the model needs to be something like this:
counts ~ NB2_log(intercept + depth + input + gene + gene:experiment, \phi)
Ideally there would also be some adaptive regularizing priors on the regression coefficients, but right now I’m just trying to get a simple model to run.
The issue is that there are a lot of genes and even more gene:experiment interaction terms. gene + gene:experiment
means roughly 400k regression coefficients. If I were to encode the mean function in the usual X\beta way, X would be roughly 30k rows x 400k columns. In the model I have below, I instead loop over the output observations and use indicator variables to pick out the right regression coefficients to use when modelling each observation.
In the past I’ve gotten a similar model to run by first fitting an MLE to each group, fitting a prior distribution to those MLE’s, then using that prior to model each group separately. This has the downside of not modelling uncertainty in the prior and not allowing for adaptive regularization, which I think might actually be quite important in this case.
Things I’ve tried using a small subset of the data:
- Using the compressed row storage to input a sparse representation of X and do the multiplication X\beta with
csr_matrix_times_vector
(worked out to be slower) - Made the standard deviations of the normal priors on the regression coefficients parameters to be inferred (slower, made the priors too wide)
- within-chain parallelization with
reduce_sum
(helped, but not quite enough) - Using
rstanarm::stan_glmer(sparse=TRUE)
and a variety of the prior options (negligible difference).
There are additional complications I’d like to layer into the model (using additional data that is pure noise, experiment-wise dispersion), but I’m still trying to get something that will run on the full dataset.
Are there any other ways to restructure / reparameterize this to be faster?
This is the simplest Stan model I’ve come up with so far:
data {
int<lower=1> n_gene;
int<lower=1> n_e; // number of experiments
int<lower=1> n_bc;
int<lower=1, upper = n_e> e_i[n_e*n_bc]; // indicator of which experiment each count belongs to
int<lower=1, upper = n_gene> gene_i[n_e*n_bc]; //indicator of which gene each count belongs to
int<lower=1, upper = n_bc> bc_i[n_e*n_bc]; // indicator of which barcode each count
real<lower=0> e_depths[n_e]; // sequencing depths of each experiment
int<lower=0> count_obs[n_e*n_bc];
int<lower=0> input_obs[n_e*n_bc];
}
parameters {
real intercept;
real gene[n_gene]; // global gene effects
real gene_e[n_gene,n_e]; // matrix of gene:experiment interaction effects
real<lower=0> phi;
}
model {
phi ~ gamma(2,2);
gene ~ std_normal();
for (i in 1:n_gene) {
for (j in 1:n_e) {
gene_e[i,j] ~ normal(0,.2);
}
}
for (i in 1:(n_e*n_bc)){
target += neg_binomial_2_log_lpmf(count_obs[i] | intercept +
e_depths[e_i[i]] +
log(input_obs[bc_i[i]]) +
gene[gene_i[i]] +
gene_e[gene_i[i], e_i[i]],
phi);
}
}