Hi all,
I’ve got some time set aside over the coming months to learn Stan and was thinking I would try and ‘translate’ DESeq2 because count data is what I spend most of my time on, and the area I’d like to put Stan to work in, and I enjoy learning by putting the concepts to use in projects.
Preliminary searching led me to this discussion (Bayesian approaches to differential gene expression | Statistical Modeling, Causal Inference, and Social Science) and I just wanted to commence by enquiring if anyone in the community is still pursuing any such projects whether it be with a bulk, single-cell or spatial focus?
Not actively pursuing but I have a basic implementation of the DESeq2 model (no predictors so far) in Stan at RNAseq-noise-model/negBinomial_deseq2.stan at master · stemangiola/RNAseq-noise-model · GitHub (there’s a bunch of noise due to our need to do kfold cross validation, so maybe you don’t want to start there, but sharing just in case).
Good luck with the project!
Thanks Martin, that will be a great place to start.
The model ppcseq uses a hierarchical (with shrinkage) NB regression model for differential analyses. I use that to identify outlier observations
Differential expression analysis (I would hardly call it modeling, but that’s beside the point) like what DESeq2 uses is essentially a (Generalized) Linear Model, the main features probably being:
- translating a generic metadata table into a design matrix for the regression;
- computing normalization factors for the samples;
- pre-computing dispersion parameters by borrowing information from other genes (since normally there are too few samples to estimate the Negative Binomial dispersion parameter from data together with all other parameters)
None of that really involves anything Bayesian, and while it may be possible it’s likely not worth building that into a Stan program. Translating the estimation part into Stan is just writing a GLM in Stan; there are plenty of examples and tutorials for doing that without spending time on non-Bayesian, non-Stan work.
I have “translated” some of DESeq2 for this work on a large gene expression data set, but as I mentioned this mostly consisted in rewriting the normalization computations in Python and writing a Hierarchical GLM (we needed to account for random effects of replicates and DESeq2 cannot do it). We didn’t have to bother with the pre-computed dispersion parameters because we had hundreds of samples so we could estimate it as any other parameter (which is good, because from a statistical standpoint precomputing anything related to variance parameters seems extremely sketchy to me). The data and code isn’t available yet, but will be sometime soon, hopefully.
I believe there are other packages that implement this kind of analysis using bayesian inference, but from a modeling perspective there’s probably nothing especially interesting compared to a generic GLM.
I’ll gladly accept the semantic correction that this isn’t modeling (as in using ODE to model gene expression). Perhaps I should have said implementing generalized linear models in Stan to analyse gene expression.
The aim for me with this project is not so much about a novel approach as much as just to take a familiar tool and apply that in Stan to get started with the language. I look forward to reading the linked paper and accompanying data and code, thank you for sharing.
Thank you for linking those Stefano, it has already proven helpful reading through the Stan source code.
Yes, it’s not really relevant, it’s just to emphasize the actual model (and therefore its Stan translation) consist of a generally straightforward program.
That said, it’s also true that working closely on a simple project may be more instructive than working superficially on several more complicated ones, as far as learning a language or approach goes.
So good luck, and welcome to the forum. This is the right place to ask questions about the language in practice.