Gene Expression Modelling

BayesianBegginer · September 29, 2022, 11:54am

Hi all,
I’ve got some time set aside over the coming months to learn Stan and was thinking I would try and ‘translate’ DESeq2 because count data is what I spend most of my time on, and the area I’d like to put Stan to work in, and I enjoy learning by putting the concepts to use in projects.
Preliminary searching led me to this discussion (Bayesian approaches to differential gene expression | Statistical Modeling, Causal Inference, and Social Science) and I just wanted to commence by enquiring if anyone in the community is still pursuing any such projects whether it be with a bulk, single-cell or spatial focus?

jd_c · September 29, 2022, 12:14pm

@martinmodrak @Bob_Carpenter @stemangiola

martinmodrak · September 29, 2022, 12:32pm

Not actively pursuing but I have a basic implementation of the DESeq2 model (no predictors so far) in Stan at RNAseq-noise-model/negBinomial_deseq2.stan at master · stemangiola/RNAseq-noise-model · GitHub (there’s a bunch of noise due to our need to do kfold cross validation, so maybe you don’t want to start there, but sharing just in case).

Good luck with the project!

BayesianBegginer · September 30, 2022, 1:44am

Thanks Martin, that will be a great place to start.

stemangiola · September 30, 2022, 7:22am

The model ppcseq uses a hierarchical (with shrinkage) NB regression model for differential analyses. I use that to identify outlier observations

caesoma · September 30, 2022, 9:12am

Differential expression analysis (I would hardly call it modeling, but that’s beside the point) like what DESeq2 uses is essentially a (Generalized) Linear Model, the main features probably being:

translating a generic metadata table into a design matrix for the regression;
computing normalization factors for the samples;
pre-computing dispersion parameters by borrowing information from other genes (since normally there are too few samples to estimate the Negative Binomial dispersion parameter from data together with all other parameters)

None of that really involves anything Bayesian, and while it may be possible it’s likely not worth building that into a Stan program. Translating the estimation part into Stan is just writing a GLM in Stan; there are plenty of examples and tutorials for doing that without spending time on non-Bayesian, non-Stan work.

I have “translated” some of DESeq2 for this work on a large gene expression data set, but as I mentioned this mostly consisted in rewriting the normalization computations in Python and writing a Hierarchical GLM (we needed to account for random effects of replicates and DESeq2 cannot do it). We didn’t have to bother with the pre-computed dispersion parameters because we had hundreds of samples so we could estimate it as any other parameter (which is good, because from a statistical standpoint precomputing anything related to variance parameters seems extremely sketchy to me). The data and code isn’t available yet, but will be sometime soon, hopefully.

I believe there are other packages that implement this kind of analysis using bayesian inference, but from a modeling perspective there’s probably nothing especially interesting compared to a generic GLM.

BayesianBegginer · October 1, 2022, 1:26am

I’ll gladly accept the semantic correction that this isn’t modeling (as in using ODE to model gene expression). Perhaps I should have said implementing generalized linear models in Stan to analyse gene expression.
The aim for me with this project is not so much about a novel approach as much as just to take a familiar tool and apply that in Stan to get started with the language. I look forward to reading the linked paper and accompanying data and code, thank you for sharing.

BayesianBegginer · October 1, 2022, 1:30am

Thank you for linking those Stefano, it has already proven helpful reading through the Stan source code.

caesoma · October 1, 2022, 10:37pm

Yes, it’s not really relevant, it’s just to emphasize the actual model (and therefore its Stan translation) consist of a generally straightforward program.

That said, it’s also true that working closely on a simple project may be more instructive than working superficially on several more complicated ones, as far as learning a language or approach goes.

So good luck, and welcome to the forum. This is the right place to ask questions about the language in practice.

Topic		Replies	Views
Published today (NAR GAB). Stan used for outlier detection for genetic analyses (RNA sequencing) Publicity rstan	2	576	March 4, 2021
Community feedback (acknowledged/authored) before submission to open archive (topic: posterior predictive check of linear models for differential RNA abundance) Publicity bioinformatics	9	845	April 9, 2020
Looking for first-author collaborator to complete sccomp (outlier-aware multinomial regression tool), based on ppcseq we have just published General	0	308	March 5, 2021
Call for abstracts: StanConnect 2021 - Biomedical Statistics Events stanconnect	18	1523	October 19, 2021
Bayesian Benchmarking 1.0 General	6	665	July 20, 2021

Gene Expression Modelling

Related topics