New paper, sum-contrained Beta binomial : sccomp: Robust differential composition and variability analysis for single-cell data

It’s a pleasure to announce that our Stan compositional model based on sum-constrained Beta-binomial is out at PNAS

https://doi.org/10.1073/pnas.2203828120

We propose a sum-constrained Beta binomial as flexible compositional distribution that allows for missing values (e.g. outlier detection ad exclusion).

We show that in many contexts we have an association between log mean and log overdispersion and that the Dirichlet multinomial provides biased estimates of variability (Panel F)


Sum-constrained Beta-binomials modelling mean–variability associations are adequate for experimental data from 6 studies (SI Appendix, Table S1). (A ) Study of the correlation between the proportion mean and variability (see Methods subsection Study of mean–variability association ). The left facets refer to mean and variability estimates association without constraints on their relationship. The dotted line is the fit of robust linear modeling [rlm (26)]. The middle facets plot the rlm residuals versus fitted values with a lowess smoother superimposed. The facets on the right show a decrease in the size of the 95% credible intervals for all datasets. Only changes larger than 0.5 are shown (increase or decrease). (B ) The four main steps of the sccomp algorithm (see Methods section Study of model adequacy to experimental data). (C ) Example of the posterior-predictive check with the simulated data over the observed data [colorful boxplot, COVID-19 dataset EGAS00001004481 (4); blue boxplots]. The subset of cell groups showing a larger effect is visualized. The color code expressed the magnitude of the difference estimated by sccomp across biological conditions, critical and moderate. (D ) Scatter plot of the observed versus simulated cell-group proportions for 6 datasets. Datasets are labeled by their numeric IDs (SI Appendix, Table S1). Each point corresponds to the proportion of a sample-cell group combination, and each line corresponds to a cell group. The slopes of fitted lines describe the match between observed and generated data for one group (paired by their ranks), which is expected to be 1 when the two distributions are the same. The dashed gray line represents a perfect linear match. (E ) The distribution of slopes of the scatter plots (panel D ). (F ) Association between the slopes of the scatter plots of the observed (y axis) and each group’s estimated (x axis) proportion abundance. The sum-constrained Beta-binomial (scBb) and Dirichlet-multinomial (Dm) are compared. If the data simulated from the posterior predictive distribution are similar to the observed data, we expect a straight horizontal line with intercept 1.

9 Likes

Very interesting, thanks for posting!

Also interesting to see rlm outperforms limma.

I wonder how far one could get with a Bayesian Beta-binomial model that is hierarchical but that addresses compositionality in a more naive way, or ignores it altogether. As far as I can tell from Table 1, nothing falls in this category?

1 Like

Interesting question. I did not, but it would be interesting in doing in the fututre.

1 Like

Of course, the more categories you have, the less the compositionality matters. But for example, I find myself having to test from two categories (immune vs. non-immune cells), three categories, and many categories (resolved immune cell types), and anything in between.

1 Like