I am trying to fit the following hierarchical model where gene expression is estimated over many samples. Currently I have about 300 samples. When I run this with 12 genes for a test it runs about 7 minutes on our server that has a 32-core 2.4GHz CPU and 512GB RAM. But when I run it with full data (about 8,000 genes) it is taking a prohibitively long time (I had to kill it after running for a week).

I have already tried non-centered reparametrization for the lognormal part which was even a bit slower (with 12-gene test) than the one I am showing below. Could you help me scaling up this model? Any advice would be greatly appreciated.

data {
int<lower=1> M; // number of genes
int<lower=1> N; // number of samples
real<lower=0> C[N]; // size factor per sample
int<lower=0> X[M, N]; // observed count matrix
}
parameters {
real<lower=0> lambda[M, N];
real<lower=0> shape[M];
real<lower=0> scale[M];
real mu;
real<lower=0> sigma;
}
transformed parameters {
real<lower=0> rate[M];
real<lower=0> adj_lambda[M, N];
rate = inv(scale);
for (i in 1:M)
for (j in 1:N)
adj_lambda[i][j] = lambda[i][j] * C[j];
}
model {
for (i in 1:M) {
X[i] ~ poisson(adj_lambda[i]);
lambda[i] ~ gamma(shape[i], rate[i]);
}
shape ~ gamma(0.01, 0.01);
scale ~ lognormal(mu, sigma);
sigma ~ cauchy(0, 5);
}

I also wonder what you think about providing point estimators for shape and scale as in empirical Bayes perhaps with some large variance.

for (i in 1:M) {
X[i] ~ poisson(adj_lambda[i]);
lambda[i] ~ gamma(shape[i], rate[i]);
}

I think you can write this as

X ~ poisson(adj_lambda);
lambda ~ gamma(shape, rate);

which can be faster even if there are no common parameters

It would bee god to check the treedepth information. Your priors are such that itâ€™s possible that you have narrow part where you need small step size, but then also Cauchy prior which may lead to a long tail requiring very long chains.

I think there will be this year a Stan release with parallelization, that could speed up this to be about 30 times faster with your server.

as Aki suggest you need to vectorize your loop over the samplesâ€¦ this will dramatically speedup up things when you scale up.

the prior on the shapeâ€¦ is it really needed to make it an eps, eps gamma prior which we know are not good? Maybe a lognormal on the shape works as well?

you should probably switch to the poisson_log density and calculate lambda in log space.

if you can, then switch from the gamma hierarchical model to a normal one in log space. That should fare a lot better performance wiseâ€¦ so you will at least get an answer in finite time

if you go to a normal hierarchical model, then you should consider the non centered parametrization if your data situation is sparse, which sounds like it is

Thank you Aki! Iâ€™ve got the following error though.

No matches for:
int[,] ~ poisson(real[,])
...
require real scalar return type for probability function.
error in 'unkown file name' at line 27, column 29
-------------------------------------------------
25: }
26: model {
27: X ~ poisson(adj_lambda);
^
28: lambda ~ gamma(shape, rate);
-------------------------------------------------

Just for clarification, pooling is done across N samples for lambda in each gene, and then across M genes for scale. Each layer of the hierarchy is already vectorized in my original model specification. Do you think I am missing something?

Donâ€™t use diag_matrix! It creates a ton of zeros on the AD stack which slows things down. Looping is faster here.

However, I went too fast over your model, sorry. To vectorize it, what I would try to is to cast X into a MN long vector. Then you create also an indexing int arrays which is of size NM. This indexing int array has N times 1s, then N times 2s and so on. Letâ€™s call this index. Then you replace the loop over M with

Just try itâ€¦ I donâ€™t know if the gamma is vectorized like this. Maybe just put a call with the right signature in the Stan model and check if it compiles. This would also help to make things faster.

Because I am partial pooling lambdas, I ended up using your trick on the hyperparameters of gamma, but it was slower (~13 minutes) than my original model (~7 minutes) according to my 12-gene test set :( But thank you very much @wds15.

...
transformed data {
int<lower=1> MN = M*N;
int<lower=0> X_flat[MN];
real<lower=0> C_flat[MN];
int<lower=0> index[MN];
for (i in 1:M) {
for (j in 1:N) {
X_flat[(i-1)*N+j] = X[i, j];
C_flat[(i-1)*N+j] = C[j];
index[(i-1)*N+j] = i;
}
}
}
...
transformed parameters {
...
real<lower=0> adj_lambda[MN];
for (k in 1:MN)
adj_lambda[k] = lambda[k] * C_flat[k];
...
}
model {
X_flat ~ poisson(adj_lambda);
lambda ~ gamma(shape[index], rate[index]);
shape ~ lognormal(mu_shape, sigma_shape);
scale ~ lognormal(mu_scale, sigma_scale);
sigma_shape ~ cauchy(0, 5);
sigma_scale ~ cauchy(0, 5);
}

A last idea: Assuming you have many more genes than replicates, then you should probably reverse the order of the looping. So instead of looping of the many genes, you loop over the samples. Essentially you would need to transpose your data. Should you try this, I would be interested in results as gene data could cross my table as well.

No luck :( It seems I cannot use â€śarray of vectorsâ€ť trick because poisson wants it to be an integer array. I am still interested in making this run faster as many ppl would benefit from the model. Thanks @wds15.

No matches for:
row vector ~ poisson(real[])
Available argument signatures for poisson:
int ~ poisson(real)
int ~ poisson(real[])
int ~ poisson(vector)
int ~ poisson(row vector)
int[] ~ poisson(real)
int[] ~ poisson(real[])
int[] ~ poisson(vector)
int[] ~ poisson(row vector)

Thatâ€™s not because of diag_matrix. We shouldâ€™ve never used diag_matrix because itâ€™s so inefficient. With it in the manual, people try to use it. I think I added a warning at least in the last round of manual updates.

Thanks for your clarification @Bob_Carpenter. It would be great if you can give me any comment on the following model (currently fastest one) when you have a chance.

data {
int<lower=1> M; // The number of genes
int<lower=1> N; // The number of samples
row_vector<lower=0>[N] C; // size factor (offsets)
int<lower=0> X[M, N]; // Read counts
}
parameters {
row_vector<lower=0>[N] lambda[M];
real<lower=0> shape[M];
real<lower=0> scale[M];
real mu;
real<lower=0> sigma;
}
transformed parameters {
real<lower=0> rate[M];
row_vector<lower=0>[N] adj_lambda[M];
rate = inv(scale);
for (i in 1:M)
adj_lambda[i] = lambda[i] .* C;
}
model {
for (i in 1:M) {
X[i] ~ poisson(adj_lambda[i]);
lambda[i] ~ gamma(shape[i], rate[i]);
}
shape ~ gamma(0.01, 0.01);
scale ~ lognormal(mu, sigma);
sigma ~ cauchy(0, 5);
}

@wds15 Sorry I use python, so I am posting data file of 128 genes only.

Hereâ€™s the model I tried along your advice @wds15. I transposed X and reconfigured index such that to_array_1d (which traverse matrix in column-major order) loads genes.

Can you recode your problem using a negative binomial distribution instead of a gamma-poisson mixture?
If you can, directly code the negative binomial function and use a look-up table for the log-gamma values and use it for the whole dataset.

So, you could drop scale and cut down on the number of parameters

You have one more level for the rate part of the model. Nothing wrong with that but it might point to something you have not thought about. Maybe some parts of your model are too complex or another is to simple?
prior --> mu/sigma --> rate --> lambda --> X: partial pooling on rate
prior --> shape --> lambda --> X : some regularisation but no partial pooling on shape.

The combination of the |cauchy| and lognormal prior gives a very fat tail. I got >3% infinities when I tried to simulate that in R. You also have no prior on mu. I am not smart enough to see the implications of that but tighter priors would help I guess.

None of this will lead to large improvements though.

Thanks @stijn! I had to introduce scale because stan only allows gamma(shape, rate) parametrization but I want to get partial pooling on scale rather than on rate. Could you share your intuition on what hyperprior distribution I should use for mu and sigma?

Rate ~ lognormal(-mu, sigma) implies partial pooling on scale. So even if you remove scale, the partial pooling on scale is still in your model.

Tricky. But I will give it a try. Unless you have strong priors on the shape of the prior, I think you donâ€™t need mu and I would set mu = 0. On sigma, I would use a student_t(df = 7). The student_t(df = 7, loc = 0, sc 1) implies that Pr(0.20 < rate < 5) = 87% & Pr(0.01 < rate < 100) = 99%. I cannot assess whether that is acceptable as a prior or not.

The best thing to do is actually simulate lambdas from the priors and verify whether they make sense.