Hierarchical overdispersed count models


I’m new to Stan and Bayesian statistics. I’m trying to create a hierarchical model to regress thousands of variables K in a proteomics experiment where I have dozens of donors N.

Each observation in matrix k is well modeled as a binomial, with the corresponding number of attempts in vector n. The predictor variable for each donor is lor.

The end goal is to regress each variable, account for overdispersion, pool information across variables and estimate differences and fold changes across extremes of the predictor variable.

My first attempt is below, and seems to work well. I’m looking for:

  • Alternatives for modeling overdispersion
  • Suggestions on how to model covariance
  • Any other criticism
data {
  int<lower=1> K;
  int<lower=1> N;
  vector[N] lor;
  int<lower=0> k[K,N];
  int<lower=0> n[N];
parameters {
  real mua;
  real mub;
  real<lower=0> sga;
  real<lower=0> sgb;
  vector[K] alpha;
  vector[K] beta;
model {
  for (i in 1:K) {
    alpha[i] ~ normal(mua, sga);
    beta[i] ~ normal(mub, sgb);
    for (j in 1:N) {
      k[i,j] ~ binomial(n[j], inv_logit(alpha[i] + beta[i] * lor[j]));
generated quantities {
  vector[K] minp;
  vector[K] maxp;
  vector[K] diff;
  vector[K] fold;
  minp = inv_logit(alpha + beta * min(lor));
  maxp = inv_logit(alpha + beta * max(lor));
  diff = maxp - minp;
  fold = maxp ./ minp;
1 Like

Replying to myself on model covariance, I am planning to employ the usual LKJ prior for each correlation matrix i \in 1:K. However, I would be interested in hearing how people usually model the population level prior on \eta to connect it to each LKJ draw.

Furthermore, I am interested in hearing whether there are any interesting alternatives to my overdispersed binomial formulation, where I am simply modeling overdispersion through the latent intercept. I know one alternative is to use a reparametrized beta-binomial into scale and dispersion.

1 Like

I might have missed something about your model, but wouldn’t there be a single correlation matrix and then each pair of (\alpha_i, \beta_i) is drawn from a multivariate normal distribution with this single correlation matrix times the standard deviations?

It doesn’t appear that you are modelling overdispersion, you are just having a different binomial distribution for each group of observations. The Beta-binomial is definitely a sensible choice.

Since it appears the number of trials is shared by all categories, wouldn’t your data be better represented as multinomial (or dirichlet-multinomial to model dispersion)? Note that this would imply that you have one less degree of freedom and need to treat one category as reference (or enforce some other constraint to make the model identified - see e.g. Multinomial logistic regression - Wikipedia for some background.

Best of luck with your model!

1 Like