Estimating parameters of a negative-binomial mixture model

I have tried your suggestion for a true \mu_1 = 20, \mu_2 = 19 and sd = 3. You were right. Adding a prior

mu[2] ~ normal(20, 10);

to the model really did help. probably this particular prior is too narrow for my application, but at least I should be able to find a sensible upper bound which should help.

I did not do that before because I had assumed that there should be enough data to pull the estimate towards the true value. I observed an estimate for mu[2] of 1e+153 even for very large amounts of data and am still a bit confused why the data is not able to overwhelm the uniform prior.

two other observations:

  • the n_eff is rather low. For 2000 total draws I get an n_eff of around 135 for the parameters. And the mixing looks a bit bumpy.
  • The posterior distribution for mu[1] becomes bimodal. One peak is at around 19, the other is at around 15. The posterior mean is 16.4. So now the lower mu[1] is underestimated.

Thank you for your help! :)

That’s because the two means are so close to each other (difference of 1) relative to the sd (=3), such that a single normal distribution can very well describe the whole set of data and then the other mixture component runs off (the first mixture proponent could just as well have run off to minus infinity, leaving the second to cover the data). That’s why it’s so important to be super careful with uniform and especially improper priors.

1 Like

For the same reason as before, I would argue that your toy example is not really a case where one would need a mixture distribution to describe the data and/or where the data cannot inform a model usefully about the mixture components.

Ah that is a very useful point. Thank you for that.

Do you know whether there is a way to handle that well? If there is really only one underlying distribution, then ideally I would just want both distributions to estimated the same. In my case however I observe a bias in estimation.

And would you have an idea why the negative binomial model works with N \pi s (one for each row) but not with a single \pi? I don’t think it’s due to the fact that the distributions are not separable as the sampling even fails for very large values of \phi, (i.e. variance = mean) and very different mean values.
Thanks a lot!

I’ll answer with another question: have you played with a poisson mixture model instead of an NB with outrageous shape? When the shape parameter of a NB is that high, the parameter space becomes a desolate flatland (no way to distinguish phi=50 from phi=100). So why not make it simpler first and see if that works. Maybe it will expose an unexpected error in your code.

Following your advice I have tried a poisson mixture model and observed the following:

  • the sampling worked fine and the poisson rates were reasonably estimated and the n_eff and R_hat looked fine
  • the \pi was consistently estimated too low (0.63 instead of 0.75, i.e. too many observations were deemed to be from the distribution with the smaller poisson rate). But the real \pi was still within 1 sd from the estimated value (sd = 0.15)

I have also tried a negative binomial model where I just hard-coded phi = 100 into the stan model like this:

contributions[1] = log(pi) + neg_binomial_2_lpmf(y[i,j] | mu[2], 100);
contributions[2] = log(1 - pi) + neg_binomial_2_lpmf(y[i,j] | mu[1], 100);

In my understanding this should be very close to a poisson model. However here I also got lots of diverging transitions. For very different \mu s the estimation worked somewhat reasonably even though the chains looked a bit terrifying. For c(20, 16) the poisson model worked, the negative-binomial model with a hard-coded phi of 100 failed.
Hmm…

That’s a useful test. Do not know why the the NB version with high phi (which indeed should behave similarly to poisson) is being fickle. I’m guessing that it’s somehow harder for Stan to approximate the HMC dynamic with the NB distribution with high phi, which maybe has to do with the NB (log-)density involving the (log-)gamma function applied to phi (if I’m not mistaken)? I havent read the whole thread, but have you tried reducing the stepsize/upping the target acceptance rate? If yes, better asksomeone like @betanalpha who may be better able to help here.

1 Like

I agree with @LucC – it’s probably the gamma function calculation blowing up (https://en.wikipedia.org/wiki/Negative_binomial_distribution#Extension_to_real-valued_r – r related to phi here). The gamma functions are numerically finicky I think.

I’d guess you’re gonna have to roll with the poisson approximation LucC talked about or a normal or something else.

By this you mean specifying something like this?

control = list(adapt_delta = 0.99)

This unfortunately did not alleviate the problem… :(

ah ok I see - even though I am not fully grasping why that is different for one \pi instead of many \pi s.

Is there a way to reintroduce the overdispersion? What I assume should happen if I take a poisson instead of a negative-binomial mixture model is that my posterior membership probabilities should be estimated accurately, but that the confidence intervals for the posterior membership probabilities are too narrow. Is that what you would expect as well?

@LucC do you have an intuition why that might be the case? Or whether it is something to worry about at all?

Thank you very much!

I assume with lots of pis the parameters don’t land that far from their individual means and maybe the numerics end up easier. But I agree it’s strange and it’s not obvious that this explanation is correct.

@LucC might know. I do not know. Give this a try generating data for various dispersions and see what happens, or maybe try to figure out a posterior predictive to diagnose if you’ve underestimated dispersion or whatnot.

1 Like

Again, I apologise for not having read all the contents of the thread, but could it be that you didn’t specify sensible priors for the means of the two mixture component (as was the issue with your normal toy model)? If one or both of the mixture components are weakly identified in terms of the mean, and the estimated mean of the higher mixture component is too high (for instance), that could decrease the estimated probability of membership for that mixture component.

1 Like

The probabilities themselves would not be estimated correctly by a poisson mixture if the true data generating process was NB (with a reasonable value of phi). The tails of the mixture components highly determine group membership probability and those tails can be widely different for poisson and NB distributions.

2 Likes

You are right. I have tried a poisson mixture model and the means very estimated incorrectly so I assume the posterior membership probabilities are also wrong. So this is probably not an avenue worth pursuing

I have tried reparameterizing the model. I am not sure why, but it works much better now.
I did the following: I generated data that looks like this:

| gene     | cond A | cond A | cond A | cond B | cond B | cond B |
|----------|--------|--------|--------|--------|--------|--------|
| gene1    | 4      | 18     | 9      | 63     | 87     | 103    |
| gene2    | 79     | 163    | 144    | 74     | 64     | 47     |
| ...      | .      | .      | .      | .      | .      | .      |
| gene60   | 15     | 18     | 32     | 4      | 1      | 0      |

In the transformed data block I cut the matrix in half and stored condition A (called input) and B (called ip) in different separate matrices with equal dimensions.
I then estimated a (row-wise) model for condition A with the observations following a Negative-Binomial Distribution with one common \mu and one common \phi. The rows of Condition B are distributed with \mu \cdot \delta0 and \phi or with \mu \cdot \delta1 and \phi with 0 < \delta0 < 1 < \delta1.
For this model, the estimation works pretty well.

I successively added more parameters and in the end allowed every row to have a different \mu_i, \delta0_i, \delta1_i, \phi_i (they have a common prior, however).
The estimation also seemed reasonable, with one exception: the \delta0_i were all estimated to be around 0.55 - 0.75 and the \delta1_i were all estimated to be between 6 - 8, more or less regardless of posterior membership probability. I assume this has to do with the fact that I have to estimate one \delta0_i and one \delta1_i even though in reality there is only one of them observed. However, I am not sure why they all default to the values they do (i.e. 6-8), as the prior for \delta1 has a maximum in 4 (even though it is rather flat) and I would assume it should center there if there is little information. (Edit: I think think this has to do with the fact that I truncate the distributions so the actual mean is higher than 4 if I cut out everything lower than 1).
Would you have an idea how to formulate the model as to avoid this problem?

edit:

maybe the above is not actually a problem and the better question is: can I just compute my posterior estimate for \delta_i by multiplying the posterior probability \pi_i \cdot \delta1_i + (1-\pi_i) \cdot \delta0_i?

Thank you very much!

full code:

generatemixedcounts <- function(n = 6, N = 50, phi = 2, mu = 10, deltaeffect = 4, 
                                meth_prob = 0.25){
  ## select some of the entries of the matrix
  m <- matrix(NA, nrow = N, ncol = n)
  m[] <- rnbinom(n = n*N, mu = mu, size = phi)

  conditions <- rep(c(F, T), each = n/2)

  select <- sample(c(T, F), size = N, replace = T, prob = c(meth_prob, 1- meth_prob))
  m[select, conditions] <- rnbinom(n = sum(select) * sum(conditions), 
                                          mu = mu * deltaeffect, size = size)
  
  return(list(m = m, select = select, conditions = conditions))
}


toy <- generatemixedcounts(N = 60, mu = 10, deltaeffect = 4, n = 6, phi = 2)

standata <- list(y = toy$m, N = nrow(toy$m), n = ncol(toy$m), 
                 ip = toy$conditions)
truepi <- sum(toy$select)/60; truepi


write("
data {
  int<lower=0> N;
  int<lower=0> n;
  int<lower=0> y[N,n];
  int<lower = 0, upper = 1> ip[n];
}
transformed data {
  // idea is to make one matrix for every condition
  int m = n/2; //number of samples per condition input/ip
  int k = 1; int l = 1;
  int<lower=0> y_input[N,m];
  int<lower=0> y_ip[N,m];
  for (j in 1:n){
    if (ip[j] == 0){
      for (i in 1:N){
        y_input[i,k] = y[i,j];
      }
      k+=1;
    } else {
      for (i in 1:N){
        y_ip[i,l] = y[i,j];
      }
      l+=1;
    }
  }
}

parameters {
  real <lower=0, upper=1> pi;
  real <lower = 0> mu;
  real<lower=0> phi;
  real <lower = 0, upper = 1> delta0;
  real <lower = 1> delta1;
}

model {
  vector[2] contributions;

  for (j in 1:m){
    for (i in 1:N) {
      //distribution of y_input
      target += neg_binomial_2_lpmf(y_input[i,j] | mu, phi);

      //distribution of y_ip
      contributions[1] = log(pi) + neg_binomial_2_lpmf(y_ip[i,j] | mu * delta1, phi);
      contributions[2] = log(1 - pi) + neg_binomial_2_lpmf(y_ip[i,j] | mu * delta0, phi);
      target += log_sum_exp(contributions);
    }
  }
  delta1 ~ normal(4, 10); 
  delta0 ~ normal(1, 1);
  mu ~ normal(10,2);
  phi ~ normal(2, 0.8);
}

// posterior membership probabilities
generated quantities {
  real pi_post[N];
  for (i in 1:N) {
    pi_post[i] = exp(neg_binomial_2_lpmf(y_ip[i,] | mu * delta1, phi)) * pi / (exp(neg_binomial_2_lpmf(y_ip[i,] | mu * delta1, phi)) * pi + exp(neg_binomial_2_lpmf(y_ip[i,] | mu * delta0, phi)) * (1 - pi)); 
  }
}

", file = "./mixturenegbin_delta_reduced.stan")


write("
data {
  int<lower=0> N;
  int<lower=0> n;
  int<lower=0> y[N,n];
  int<lower = 0, upper = 1> ip[n];
}
transformed data {
  int m = n/2; //number of samples per condition input/ip
  int k = 1; int l = 1;
  int<lower=0> y_input[N,m];
  int<lower=0> y_ip[N,m];
  for (j in 1:n){
    if (ip[j] == 0){
      for (i in 1:N){
        y_input[i,k] = y[i,j];
      }
      k+=1;
    } else {
      for (i in 1:N){
        y_ip[i,l] = y[i,j];
      }
      l+=1;
    }
  }
}

parameters {
  real <lower=0, upper=1> pi;
  real <lower = 0> mu[N];
  real<lower=0> phi[N];
  real <lower = 0, upper = 1> delta0[N];
  real <lower = 1> delta1[N];
}

model {
  vector[2] contributions;

  for (j in 1:m){
    for (i in 1:N) {
      //distribution of y_input
      target += neg_binomial_2_lpmf(y_input[i,j] | mu[i], phi[i]);

      //distribution of y_ip
      contributions[1] = log(pi) + neg_binomial_2_lpmf(y_ip[i,j] | mu[i] * delta1[i], phi[i]);
      contributions[2] = log(1 - pi) + neg_binomial_2_lpmf(y_ip[i,j] | mu[i] * delta0[i], phi[i]);
      target += log_sum_exp(contributions);
    }
  }
  delta1 ~ normal(4, 10); 
  delta0 ~ normal(1, 1);
  mu ~ normal(10,2);
  phi ~ normal(2, 0.8);
}

// posterior membership probabilities
generated quantities {
  real pi_post[N];
  for (i in 1:N) {
    pi_post[i] = exp(neg_binomial_2_lpmf(y_ip[i,] | mu[i] * delta1[i], phi[i])) * pi / (exp(neg_binomial_2_lpmf(y_ip[i,] | mu[i] * delta1[i], phi[i])) * pi + exp(neg_binomial_2_lpmf(y_ip[i,] | mu[i] * delta0[i], phi[i])) * (1 - pi)); 
  }
}

", file = "./mixturenegbin_delta_full.stan")


## Bayesian Estimation
stanfit_onepi_reduced <- rstan::stan(file = "./mixturenegbin_delta_reduced.stan",
    data = standata,
    iter = 1000, warmup = 500, thin = 1, control = list(adapt_delta = 0.99))
stanfit_onepi_reduced

stanfit_onepi_full <- rstan::stan(file = "./mixturenegbin_delta_full.stan",
    data = standata,
    iter = 1000, warmup = 500, thin = 1, control = list(adapt_delta = 0.99))
stanfit_onepi_full

It’s been a long time (sorry!) so I might be off base, but just looking at this I agree I don’t see why the delta thing you’re describing is a problem.

The standard deviations on those priors are huge too. The estimates of \delta_1 being away from 4 doesn’t seem weird at all.

Did you figure this out in the end? If not, what is \delta_i here?