It’s a pleasure to announce that our Stan compositional model based on sum-constrained Beta-binomial is out at PNAS

https://doi.org/10.1073/pnas.2203828120

We propose a sum-constrained Beta binomial as flexible compositional distribution that allows for missing values (e.g. outlier detection ad exclusion).

We show that in many contexts we have an association between log mean and log overdispersion and that the Dirichlet multinomial provides biased estimates of variability (Panel F)

Sum-constrained Beta-binomials modelling mean–variability associations are adequate for experimental data from 6 studies (

*SI Appendix*, Table S1). (

*A*) Study of the correlation between the proportion mean and variability (see

*Methods*subsection

*Study of mean–variability association*). The left facets refer to mean and variability estimates association without constraints on their relationship. The dotted line is the fit of robust linear modeling [rlm (26)]. The middle facets plot the rlm residuals versus fitted values with a lowess smoother superimposed. The facets on the right show a decrease in the size of the 95% credible intervals for all datasets. Only changes larger than 0.5 are shown (increase or decrease). (

*B*) The four main steps of the sccomp algorithm (see

*Methods*section Study of model adequacy to experimental data). (

*C*) Example of the posterior-predictive check with the simulated data over the observed data [colorful boxplot, COVID-19 dataset EGAS00001004481 (4); blue boxplots]. The subset of cell groups showing a larger effect is visualized. The color code expressed the magnitude of the difference estimated by sccomp across biological conditions, critical and moderate. (

*D*) Scatter plot of the observed versus simulated cell-group proportions for 6 datasets. Datasets are labeled by their numeric IDs (

*SI Appendix*, Table S1). Each point corresponds to the proportion of a sample-cell group combination, and each line corresponds to a cell group. The slopes of fitted lines describe the match between observed and generated data for one group (paired by their ranks), which is expected to be 1 when the two distributions are the same. The dashed gray line represents a perfect linear match. (

*E*) The distribution of slopes of the scatter plots (panel

*D*). (

*F*) Association between the slopes of the scatter plots of the observed (

*y*axis) and each group’s estimated (

*x*axis) proportion abundance. The sum-constrained Beta-binomial (scBb) and Dirichlet-multinomial (Dm) are compared. If the data simulated from the posterior predictive distribution are similar to the observed data, we expect a straight horizontal line with intercept 1.