In implementing a dirichlet_multinomial model (with 1K - 20K categories)
model{
...
y[n] ~ dirichlet_multinomial( precision * softmax( X[n] * beta ) )
}
I have noticed that there is not a single value for precision that fits well all the data. That is, the abundant categories have much variability that explained by the model, and viceversa (the rare categories have less variability). It seems that the gene wise overdispersion is quite more complex than a single number (this is the reason why people model sequencing data with NB rather than multinomial, even though multinomial is the real final process that gives us the data)
This leads to:
- if precision is a parameter, the most abundant categories push the precision down, so regression on rare categories with positive slopes are clearly missed. (red dots observed, lines are the generated quantities)
- if I set a higher precision, the abundant categories include false positive regressions, and the variability is obviously underestimated. (red dots observed, lines are the generated quantities)
I think I need a more flexible implementation where I have to set either:
- a precition/overdispersion proportional to the expected value, or
- a precition/overdispersio independent for every category
Something like
~ dirichlet_like_multinomial( vector[N] precision , vector[N] simplex);
Where precision is either a function of the proportion (of the simplex), or independent values, that are simply correlated with proportion values.
(I hope it is not too crazy)
The generative model I am thinking
biology ā
gene_expression_expected_value(gene 1ā¦G, person 1) ā
NB_2(ā¦, overdispersion) ā
softmax( all genes ) ā
multinomial() ā
observed_counts_person_1
and a regression is built over many persons each caracterised by a value for a covariate of interest
with overdispersion = (or ~) function(gene_expression_expected_value)
In which overdispersion depends on the value of gene_expression_expected_value