# Estimating parameters of a negative-binomial mixture model

I am trying to fit a mixture model of two negative binomial distributions. I would like to estimate the parameters of those distributions. Also, for every single observation, I want to know from which of the two possible distributions that observation comes.

As far as I can tell the sampling works fine. However I am not able to estimate the \mu parameters of the models correctly (irrespective of the amount of data provided). For comparison I also tried \texttt{optimizing()}, with similar results. \texttt{optimizing()} gave me results like (34, 5), Stan gave (27, 7) instead of (20, 5). Do you have any ideas how to fix this? Thank you.

library(rstan)
options(mc.cores = parallel::detectCores() - 2)
rstan_options(auto_write = TRUE)

generatemixedcounts <- function(n = 6, N = 50, phi = 1, mu1 = 20, mu2 = 5){
## select some of the entries of the matrix
m <- matrix(NA, nrow = N, ncol = n)
m[] <- rnbinom(n = n*N, mu = mu1, size = phi)
select <- sample(c(T,F), size = n * N, replace = T, prob = c(0.25, 0.75))
tmp <- rnorm(n = sum(select), mean = 0, sd = 1)

## assign rnbinom with different mu
m[select] <- rnbinom(n = sum(select), mu = mu2, size = phi)
return(m)
}

toy <- generatemixedcounts(N = 60, mu1 = 20, mu2 = 5)
standata <- list(y = toy, N = nrow(toy), n = ncol(toy))

write("

data {
int<lower=0> N;
int<lower=0> n;
int<lower=0> y[N,n];
}
parameters {
matrix <lower=0, upper=1>[N,n] pi;
ordered[2] mu;
real<lower=0> phi;
}
model {
vector[2] contributions;
for (j in 1:n){
for (i in 1:N) {
contributions[1] = log(pi[i,j]) + neg_binomial_2_lpmf(y[i,j] | mu[2], phi);
contributions[2] = log( (1 - pi[i,j]) ) + neg_binomial_2_lpmf(y[i,j] | mu[1], phi);

target += log_sum_exp(contributions);

}
}
}

", file = "./mixturenegbin.stan")

## MAP Estimation
model <- stan_model(file = "./mixturenegbin.stan")
optimizing(model,
data = standata,
iter = 1000, seed = 1)

## Bayesian Estimation
stanfit <- rstan::stan(file = "./mixturenegbin.stan",
data = standata,
iter = 1000, warmup = 500, thin = 1, seed = 1)

stanfit

library(shinystan)
launch_shinystan(stanfit)


I think since you’re adding a parameter for every bit of data (you have N * n parameters to go with N * n data), I don’t think you will necessarily get nice limiting behavior with bunches and bunches of data.

For N = 6000 I was getting optimization results that were sometimes like c(17, 17) and sometimes like c(6, 37).

I do not think setting seeds is a good idea during model development. It’s too easy to get fixated on the result of one random initialization or whatnot.

I didn’t look at plots, but maybe vary the difficulty of the classification problem (by separating the groups more or less) and see what happens.

Also, is the N/n split a placeholder for a more complicated model? It looks like there’s no significance of that split in the code right now.

1 Like

Thank you very much for your reply and your help.
Omitting the seed is indeed good advice.

Yes the N/n split is a placeholder. In the end I want to analyze the counts by row. following your suggestion I have now tried the following: Instead of one \pi[i,j] per observation I have now one \pi[i] per row (and set n = 3 which is the worst case I would need to analyze). This seems to work a little better, but not perfect. I observed the following:

• for n = 3 and N = 3000 the larger \mu is still consistently overestimated quite a bit (something in the range of [24, 27]).
• the lower \mu tends to be underestimated (mostly in the range of [2, 4.5]).
• setting n = 3 or n = 6 or n = 18 does not seem to make a large difference (I reduced N accordingly). I had the impression that the larger \mu was closer to 20 with large n, but the smaller \mu was more off (something like 2.5).
• reducing the variance of the distributions (i.e. increasing phi) makes the sampler eventually converge to the correct values.
• phi is usually overestimated by a bit (i.e. the variance is underestimated)

What troubles me the most is the consistency of the estimation errors as this seems to indicate a systematic problem I cannot quite grasp.
Would you have any advice on how to work with that? Thank you very much.

The new full code is:

library(rstan)
options(mc.cores = parallel::detectCores() - 2)
rstan_options(auto_write = TRUE)

generatemixedcounts <- function(n = 6, N = 50, phi = 1, mu1 = 20, mu2 = 5){
## select some of the entries of the matrix
m <- matrix(NA, nrow = N, ncol = n)
m[] <- rnbinom(n = n*N, mu = mu1, size = phi)
select <- sample(c(T,F), size = N, replace = T, prob = c(0.25, 0.75))

## assign rnbinom with different mu
m[select,] <- rnbinom(n = sum(select)*n, mu = mu2, size = phi)
return(m)
}

toy <- generatemixedcounts(N = 500, mu1 = 20, mu2 = 5, n = 18, phi = 1)
standata <- list(y = toy, N = nrow(toy), n = ncol(toy))

write("
data {
int<lower=0> N;
int<lower=0> n;
int<lower=0> y[N,n];
}
parameters {
vector <lower=0, upper=1>[N] pi;
positive_ordered[2] mu;
real<lower=0> phi;
}
model {
vector[2] contributions;
for (j in 1:n){
for (i in 1:N) {
contributions[1] = log(pi[i]) + neg_binomial_2_lpmf(y[i,j] | mu[2], phi);
contributions[2] = log( (1 - pi[i]) ) + neg_binomial_2_lpmf(y[i,j] | mu[1], phi);

target += log_sum_exp(contributions);

}
}
}

", file = "./mixturenegbin.stan")

## MAP Estimation
model <- stan_model(file = "./mixturenegbin.stan")
optimizing(model,
data = standata,
iter = 1000)

## Bayesian Estimation
stanfit <- rstan::stan(file = "./mixturenegbin.stan",
data = standata,
iter = 1000, warmup = 500, thin = 1)
stanfit

library(shinystan)
launch_shinystan(stanfit)


Before working on the mixture of two negative binomials I had tried a zero inflated negative binomial model. In this case, the mu was estimated well, even when adding another parameter for every observation. So I assume it works for simpler models, but not this this one…

(Specifically, I had changed the following, compared to the original code in my first post:

in the function:

m[select] <- 0


in the parameters block:

positive_ordered[1] mu;


in the model block:

contributions[1] = log(pi[i,j]) + normal_lpdf(y[i,j] | 0, 0.00001);


)

I just took a look at some histograms of the data. I don’t think there’s any real chance of separating the groups until phi is at least 100 or larger.

Have a look at hist(toy) for various values of phi and try to separate the data into two groups yourself.

1 Like

Following your suggestion I also had another look at the densities.

generatemixedcounts2 <- function(n = 6, N = 50, phi = 1, mu1 = 20, mu2 = 5){
m <- matrix(NA, nrow = N, ncol = n)
m[] <- rnbinom(n = n*N, mu = mu1, size = phi)
select <- sample(c(T,F), size = N, replace = T, prob = c(0.25, 0.75))
m[select,] <- rnbinom(n = sum(select)*n, mu = mu2, size = phi)
return(list(m = m, select = select))
}

toy2 <- generatemixedcounts2(N = 600, mu1 = 20, mu2 = 5, n = 6, phi = 1)

hist(toy2$m, breaks = 100, freq = F) lines(density(toy2$m))
dselect <- density(toy2$m[toy2$select,])
lines(x = dselect$x, y = (dselect$y * 0.25), col = "green")
dnotselect <- density(toy2$m[!toy2$select,])
lines(x = dnotselect$x, y = (dnotselect$y * 0.75), col = "red")

## ML estimates - separation by hand - works well even for phi = 1
vec <- as.vector(toy2$m[toy2$select,])
vec2 <- as.vector(toy2$m[!toy2$select,])

MASS::fitdistr(vec, "Negative Binomial")
MASS::fitdistr(vec2, "Negative Binomial")



Separating the two distributions worked reasonably well already for phi = 8 (visually with a bimodal density as well as with the Stan model above). But I agree that it seems quite difficult to distinguish them for smaller phi. If I separate the data correctly by hand (vec and vec2), the parameters are well estimated even for small phi.

If the data can hardly be separated my intuitive assumption would have been that

• either estimating the parameters for the two distributions works well and it is just the \hat{\pi} that has a high variance or
• the estimated parameters for the two distributions would be drawn towards a common parameter, i.e. something like \hat{\mu}_1 = 9, \hat{\mu}_2 = 14 (just as if I had pooled the data and estimated one distribution).

However I don’t understand why adding a mixture component results in the estimated \mu s being drawn away from each other (e.g. getting an estimated \hat{\mu}_2 = 27 that is much too large and incompatible with either of the two distributions I put in to generate the data).

Thanks a lot for your help!

I agree. We’re not recovering the correct parameters here and that’s no fun.

Have you tried running some fits with either difference \hat{\pi} s? That’s the mixing parameter right? So maybe an even, 50-50 split? Or maybe running the inference you’re currently running but with that parameter fixed to the value used to generate data?

I concur based on gut feeling from working a lot with NB distributions. When mixed, the result can often be approximated well enough with a single NB distribution. Unless the shapes of the two underlying distributions are wildly different.

Also, a NB distribution with shape 100 can be practically considered a poisson distribution in most cases (unless mean is high).

1 Like

I have tried different values for the mixture when generating the data, i.e. tried different values for prob in

select <- sample(c(T,F), size = N, replace = T, prob = c(0.25, 0.75))


This did not seem to make a difference (which is in some way good, I guess, as I don’t know the mixture proportions beforehand).

By that you mean to somehow change the prior in the model? or hard coding the probabilities like this?

contributions[1] = log(0.75) + neg_binomial_2_lpmf(y[i,j] | mu[2], phi);
contributions[2] = log( 0.25 ) + neg_binomial_2_lpmf(y[i,j] | mu[1], phi);


Hard coding the probabilities seems to alleviate the problem to quite some extent. As hard-coding is somewhat equivalent to only having one \pi instead of one \pi per row, this also seems to have to do with the way that the mixture components introduce some bias. I am still puzzled how that happens.

Some of my general confusion with mixture models has to do with the posterior membership probabilities. The reason I introduced one \pi[i] for every row is that I am interested in the posterior probability (distribution) that a certain row comes from distribution 1 or distribution 2. As far as I know this is not equivalent to the posterior distribution of the individual \pi[i] that Stan automatically returns for every parameter. Insofar I am not sure whether I actually need one \pi[i] per row, or how to correctly implement it in Stan.

Thank you very much for your continued support!

#### Correction

Writing this I have discovered a mistake in the original function that generated the data and corrected it in the previous posts. Instead of

m[select,] <- rnbinom(n = sum(select), mu = mu2, size = phi)


it must be

m[select,] <- rnbinom(n = sum(select)*n, mu = mu2, size = phi)


But I think that doesn’t change the inference much.

Oh, this must be it. So we’re generating the data with one \pi and then fitting a bunch. I assume if you only fit one it works? That’d make me feel a lot better at least haha. Good catch.

So I think the thing you want to compute is group membership, which we’re not sampling here.

So for every output y_n we have a distribution on group membership g_n. So things can either be group 1 or 2, so g_n \in [1, 2].

So the thing we want is g_n. If you pretend you’re gonna write a Gibbs sampler for g_n I think you can derive the equations you want. So we write down the distribution of an individual g_n assuming everything else in the problem is fixed and get:

p(g_n | Y, \mu, \pi) \propto p(g_n, Y, \mu, \pi)

The thing above is true cause we’ve assumed Y and \theta are fixed.

p(g_n | Y, \mu, \pi) \propto p(Y | \mu, g_n) p(g_n | \pi) p(\mu, \pi)

The thing above this is true cause we can break out things into likelihood/prior

p(g_n | Y, \mu, \pi) \propto p(Y_n | \mu, g_n) p(g_n | \pi) p(\mu, \pi)

And this is true cause the parts of the likelihood that depend on Y_m, m \ne n are not a function of g_n (so they’ll cancel out in normalization).

And since g_n only takes two values you can just compute p(g_n = 1 | Y, \mu, \pi) and p(g_n = 2 | Y, \mu, \pi) and normalize that distribution and output the two probabilities.

You’d do this in your generated quantities, so this will have the effect of integrating over all the different \theta s.

I doubt I said that very clearly, so lemme know if it doesn’t make sense.

In the final problem, each row is multiple samples from the same distribution? So like a group is assigned for each row, not each row and column?

Edit: Edited to fix representation of likelihoods
Edited again: Fixing derivation mistakes

1 Like

Thanks a lot for your help, you are making me very happy. I hope I’m getting there, but I am not yet quite able to code that correctly in Stan.

From what you wrote together with what I have read in the Stan user guide (https://mc-stan.org/docs/2_19/stan-users-guide/summing-out-the-responsibility-parameter.html) I have tried to come up with this:

P(g_n | y_n, \mu_1, \mu_2, phi, \pi) = \frac{P(y_n | g_n = 1, \mu_1, phi) \cdot p(g_n = 1 | \pi)}{P(y_n | g_n = 1, \mu_1, phi) \cdot \pi + P(y_n | g_n = 2, \mu_2, phi) \cdot (1 - \pi)}

(with p(g_n = 1 | \pi) = \pi and p(g_n = 2 | \pi) = 1 - \pi
and coded this as

generated quantities {
real pi_post[N];
for (i in 1:N) {
pi_post[i] = neg_binomial_2_lpmf(y[i,] | mu[2], phi) * pi /
(neg_binomial_2_lpmf(y[i,] | mu[2], phi) * pi +
neg_binomial_2_lpmf(y[i,] | mu[1], phi) * (1 - pi));
}
}


However I don’t think the values I get for pi_post are correct. How can I correctly implement your suggestions in the generated quantities block?

### Some more background information

To explain a bit what I am actually trying to achieve: The application comes from biology. Samples are run through a sequencing-machine that detects gene sequences and counts how often a particular sequence is found by the machine.
In the end you get a count matrix with rows corresponding to genes and columns corresponding to replicates different conditions. Something like this:

| gene     | cond A | cond A | cond A | cond B | cond B | cond B |
|----------|--------|--------|--------|--------|--------|--------|
| gene1    | 4      | 18     | 9      | 63     | 87     | 103    |
| gene2    | 79     | 163    | 144    | 74     | 64     | 47     |
| ...      | .      | .      | .      | .      | .      | .      |
| gene1000 | 15     | 18     | 32     | 4      | 1      | 0      |


for every row I have to fit a different \mu and phi. Within one row the phi is assumed to be constant, but the \mu can vary per condition. I want to express this in terms of a parameter \delta_i = \frac{\mu_{2i}}{\mu_{1i}}. This parameter \delta_i can then come either from a distribution that is restricted to \delta < 1 or one restricted to \delta > 1. In the end I am interested in the probability that \delta_i > 1, i.e. the probability that \delta_i comes from the second distribution.

The reason I am taking this mixture model approach instead of just computing different \mu s and look at the posterior distribution of the fraction \delta is that there is a third possibility: Especially in condition B there can be lots of zeros with entire lines frequently just having zeros. As I have two \mu s and one phi per row to estimate I need to share information between the rows through priors, which would be messed up by the zero inflation. I therefore thought about having a three component mixture model.

• with probability \pi the observations y_i in condition B are y_i \sim NB(\mu_{1i} * \delta_i, phi_i) with \delta >1
• with probability (1- \pi) * z_i the observations y_i in condition B are y_i \sim NB(\mu_{1i} * \delta_i, phi_i) with \delta < 1
• with probability (1- \pi) * (1- z_i) the observations y_i in condition B come from a distribution with a point mass in zero.

In the end I am mainly interested in the posterior values for \pi

Remember neg_binomial_2_lpmf(y[i,] | mu[2], phi) is a log pmf. Pass those through exp to get the actual probabilities.

Also, in the previous post I shoulda done what you did broken out the likelihood into p(Y | g_n, \mu) p(g_n | \pi) p(\mu, \pi). I’m gonna go steal that and edit the post if anyone else comes along. Props for fixing that!

Since condition A and condition B are so different, does it not make sense to model them separately? Do you know what measurements are condition A and which ones are condition B?

#### membership probabilities

So in principle this should give me the posterior membership probabilities, right?

generated quantities {
real pi_post[N];
for (i in 1:N) {
pi_post[i] = exp(neg_binomial_2_lpmf(y[i,] | mu[2], phi)) * pi / (exp(neg_binomial_2_lpmf(y[i,] | mu[2], phi)) * pi + exp(neg_binomial_2_lpmf(y[i,] | mu[1], phi)) * (1 - pi));
}
}


In your derivation, is \theta = (\mu, \phi)? In contrast to what you have written, I have no term similar to p(\mu, \theta). Is my solution still correct?

#### sampling

The sampling with just one \pi seems to be more problematic than I thought previously. I am not sure whether I changed anything, but now I regularly get results like (20,8), (19, 11) for the \mu s (instead of (20,5)) if I fix \pi = 0.75. So now the smaller \mu seems affected.

If I don’t fix \pi but have it estimated as a parameter, than I get tons of divergent transitions. Even in the scenario with the fixed \pi I have the feeling that the effective sample size is rather low for all parameters.

Here again the full code:

generatemixedcounts <- function(n = 6, N = 50, phi = 1, mu1 = 20, mu2 = 5){
## select some of the entries of the matrix
m <- matrix(NA, nrow = N, ncol = n)
m[] <- rnbinom(n = n*N, mu = mu1, size = phi)
select <- sample(c(T,F), size = N, replace = T, prob = c(0.25, 0.75))

## assign rnbinom with different mu
m[select, ] <- rnbinom(n = sum(select) * n, mu = mu2, size = phi)
return(m)
}

toy <- generatemixedcounts(N = 60, mu1 = 20, mu2 = 5, n = 6, phi = 1)
standata <- list(y = toy, N = nrow(toy), n = ncol(toy))

write("
data {
int<lower=0> N;
int<lower=0> n;
int<lower=0> y[N,n];
}
parameters {
real <lower=0, upper=1> pi;
positive_ordered[2] mu;
real<lower=0> phi;
}
model {
// real pi = 0.75;
vector[2] contributions;
for (j in 1:n){
for (i in 1:N) {
contributions[1] = log(pi) + neg_binomial_2_lpmf(y[i,j] | mu[2], phi);
contributions[2] = log(1 - pi) + neg_binomial_2_lpmf(y[i,j] | mu[1], phi);

target += log_sum_exp(contributions);

}
}
}

// generated quantities {
//     real pi = 0.25;
//     real pi_post[N];
//     for (i in 1:N) {
//         pi_post[i] = exp(neg_binomial_2_lpmf(y[i,] | mu[2], phi)) * pi / (exp(neg_binomial_2_lpmf(y[i,] | mu[2], phi)) * pi + exp(neg_binomial_2_lpmf(y[i,] | mu[1], phi)) * (1 - pi));
//     }
// }

", file = "./mixturenegbin.stan")

stanfit <- rstan::stan(file = "./mixturenegbin.stan",
data = standata,
iter = 1000, warmup = 500, thin = 1)
stanfit


Thanks a lot!

Yes I know which ones are A and B. Since the \delta can vary freely I essentially estimate the means separately. I could also estimate different size parameters \phi but in biology they mostly assume the \phi to be constant.

Yeah the generated quantities looks correct. Yeah I meant \theta = (\mu, \phi) (all the parameters). There were typos in my derivation, fixing them lol. Thanks.

That is interesting. Does control(adapt_delta = 0.99) do anything for the divergences?

Awesome, thanks a lot for that!

Unfortunately the chains get stuck and don’t really mix at all. As of now I am not experienced enough with sampler diagnostics to actually find out where the problem comes from, but I am trying to read up :)

Lolol, oh man, I’m nervous – I feel like we’re about to learn something.

What you’re doing looks right to me, but when the sampler fails it’s usually not for no reason. The thing I’d try is smaller data and fit normals. If that works, crank up the data and see if anything funny happens.

If that works, switch back to small data and drop in the negative binomials.

oh yes I have that feeling, too… :-D

I have tried out fitting normals and made the following observations:

• fitting normals that are very different works well in terms of estimating the \mu s and the standard deviation.
• sampling was much much faster than with the negative binomial models
• if the distributions are too similar, i.e. \mu_1 = 20, \mu_2 = 19 or \mu_2 = 18, then the lower \mu is estimated to be in between both values, the higher \mu gets a value like 1e+153. I assume this is due to the fact that I have specified an ordered vector \mu and the lower \mu pushes the upper one away. But this is just a guess. Is there another way to avoid label switching that I could try instead of an ordered vector \mu to test my hypothesis?
• even if the distributions are quite different I frequently get messages like “The following variables have undefined values: pi_post[1]…” for my pi_post. Some of the pi_post have values for n_eff, R_hat, s.e. mean. But mostly those values are NaN. All pi_post mean values are either 1 or 0 (which I assume is correct as the distributions are very very different).

Here is the full code:

library(rstan)
options(mc.cores = parallel::detectCores() - 2)
rstan_options(auto_write = TRUE)

generatemixedcounts <- function(n = 6, N = 50, phi = 1, mu1 = 20, mu2 = 5, normal = F){
m <- matrix(NA, nrow = N, ncol = n)

## select some of the entries of the matrix
select <- sample(c(T,F), size = N, replace = T, prob = c(0.25, 0.75))

if (normal == T){
m[] <- rnorm(n = n*N, mean = mu1, sd = phi)
m[select, ] <- rnorm(n = n*sum(select), mean = mu2, sd = phi)
} else {
m[] <- rnbinom(n = n*N, mu = mu1, size = phi)
m[select, ] <- rnbinom(n = sum(select) * n, mu = mu2, size = phi)
}
return(m)
}

toy <- generatemixedcounts(N = 60, mu1 = 20, mu2 = 8, n = 6, phi = 3, normal = T)
standata <- list(y = toy, N = nrow(toy), n = ncol(toy))

write("
data {
int<lower=0> N;
int<lower=0> n;
real y[N,n];
}
parameters {
real <lower=0, upper=1> pi;
positive_ordered[2] mu;
real<lower=0> phi;
}
model {
vector[2] contributions;
for (j in 1:n){
for (i in 1:N) {
contributions[1] = log(pi) + normal_lpdf(y[i,j] | mu[2], phi);
contributions[2] = log(1 - pi) + normal_lpdf(y[i,j] | mu[1], phi);

target += log_sum_exp(contributions);

}
}
}

// posterior membership probabilities
generated quantities {
real pi_post[N];
for (i in 1:N) {
pi_post[i] = exp(normal_lpdf(y[i,] | mu[2], phi)) * pi / (exp(normal_lpdf(y[i,] | mu[2], phi)) * pi + exp(normal_lpdf(y[i,] | mu[1], phi)) * (1 - pi));
}
}

", file = "./mixturenorm_onepi.stan")

## MAP Estimation
# model <- stan_model(file = "./mixturenorm_onepi.stan")
# optimizing(model, data = standata, iter = 1000)

## Bayesian Estimation
stanfit_onepi <- rstan::stan(file = "./mixturenorm_onepi.stan",
data = standata,
iter = 1000, warmup = 500, thin = 1)
stanfit_onepi

library(shinystan)
launch_shinystan(stanfit_onepi)


Thank you very much!! :-)

you mean simply running the model for e.g. 20 observations? trying N = 3 and n = 6 did not make a difference :(

Trying the same for negative binomials failed unfortunately. With only one \pi, transitions diverge even with \phi = 100.

The estimate for the higher mu in your ordered vector is running off to infinity because you did not specify a prior for the (components of the) ordered vector mu, which implies a uniform prior. I bet you get very different model behaviour when using proper priors for mu (although it will still be hard to estimate a difference bnyween the means if they are close).