Sum of binomials when only the sum is observed

I’m trying to model a process in which I observe a sum of two binomials:

y~binomial(alpha*N,p_1)+binomial((1-alpha)*N,p_2)


Such that y is observed, N is known and I’m trying to infer alpha,p_1 and p_2 (I am aware that this isn’t “legitimate” Stan code, but I thought it’s a good way to describe the assumed data-generating process). I have informative priors for p_1 and p_2 (they should differ by an order of magnitude), and I hope this would help to deal with identifiability issues.

Is there any way to code something similar to this in Stan? Currently I’m using a poisson approximation to each, and then the sum is just another poisson… but I’m not sure this is a very good approximation here (both because N is not very large, and because of possible correlations between them), and want to see if there’s “something better”.

1 Like

Ideally you should find an analytical solution/approximation, of the sum of two random variables

For example, if p is constant you have

y~binomial(alpha*N + (1-alpha)*N , p)


if p_1 != p_2 then you do the same exercise but the math will be more complicated I guess. You can look at the method of moments. @martinmodrak has done it for negative binomials.

Thanks for the reply. Since I assume the p-s are quite different, I couldn’t find any simple analytical solution. The most relevant approximation I found was this R package:
https://cran.r-project.org/web/packages/sinib/index.html
Which is based on this paper:

Not really sure what to do with it, though… I assumed integrating an external package within the Stan code isn’t possible.

Well, ideally you should find a easy analytical approximation that needs 2 lines of code. Again look into approximation by method of moments. Otherwise, convert the R code into stan code and use it as function within your model.

Just to be clear: The data generating process you assume is that:

1. N is chosen and known to you
2. N_1 is chosen uniformly at random from \{1,2,...,N\}, you don’t observe N_1
3. p_1 and p_2 are chosen randomly and you don’t observe them
4. Y_1 \sim Binomial(N_1,p_1) and Y_2 \sim Binomial(N - N_1, p_2) is drawn and you don’t observe them
5. you observe Y = Y_1 + Y_2 and need to infer N_1,p_1 and p_2

If this is the case, N_1 is a discrete parameter with a possibly large domain and those tend to be trouble. Approximation strategies would probably differ based on how large is N (for small N exact solution might be feasible) and whether you can constrain p_1 and p_2 to not be extreme (close to 0 or 1).

I’ve recently implemented the saddlepoint approximation (derived in the paper you mention) for a different density and it totally can be done in Stan using the algebra_solver, but it requires some (not very deep) understanding of the math and a bit of fiddling with Stan. It also tends to be slow. Finally, it doesn’t absolve you of the need to treat N_1 as a discrete parameter. Knowing N_1 would make the problem way easier.

However, I struggle to find a natural way you could observe such data - could you elaborate more on what exactly does your data represent and what inferences are you trying to make? Maybe with more eyes on this we can find some additional structure that would let you use easier response distribution?

I would be surprised if this worked particularly well. I guess that among simple approximations, you’d be better of with normal distribution.

I actually doubt method of moments would be useful here - at least not unless you can assume normal approximation or something similarly strong. For method of moments you need to have a good candidate family of distributions and I struggle to think of a good candidate family that would work with small N, p_1 \rightarrow 1 and p_2 \simeq 0.5.

1 Like

That’s pretty accurate, but I wouldn’t say N_1 is chosen uniformly at random.

Sure. I have N subjects, and I observed Y events. My hypothesis is that my N subjects are actually two different groups combined N=\alpha N+(1-\alpha)N, and that each of these groups has its own event probability p_1,p_2. I’m interested in inferring both the “ratio factor” \alpha and the probabilities themselves. Does that makes sense?

That is however IMHO a different data generating process, which would go as follows:

1. For each subject, choose with probability \alpha whether they are part of group 1 or group 2
2. Each subject than has a probability p_1 or p_2 (depending on the group) to suffer an event
3. You observe the total number of subjects and total number of events

If this is the case, then for each individual subject, the probability of observing an event is q = \alpha p_1 + (1 - \alpha)p_2 and therefore the total number of events is Y \sim Binomial(N, q) and you have no way to distinguish the individual contributions to q.

The important difference is that in the DGP from my previous posts, the probabilities that subjects are from group 1 are not independent (N_1 determines maximum number of subjects per each group)

I see. I’m not 100% sure I understand all the subtleties here, but I don’t think sampling from Binomial(N,q) in the right way to go for me, as it somehow ignores the fact that the size of one group bounds the possible sampling values from the other group… is this what you meant by:

?

I fear Martin is on the money that the sum of events does not give you enough information to recover the separate parameters. The DGP for the events sounds like a mixture of two bernouilli distributions with probabilities p1 and p2. If you would have access to the individual events, you could try to model them as a mixture model. Intuitively, if p_1 = .99 and p_2 = .001, an observation with an event is likely from group 1 and an observation without an event is likely from group 2.

Unfortunately, you only have access to the sum of events which is the mean of the mixture times N. If you have one observation from group 1 and one observation from group 2, your data point is 1 (with N = 2). Such an observation is (almost) as likely to come from a binomial with q = .5 and N = 2 (This is Martin’s point).

I thought the mixture of two groups might lead to more variability in the number of events. While this works for a mixture of normals (I think), it does not work for the mixture of bernouilli’s. I suspect because whether you have a single group or a mixture of two groups the outcome is binary either way.

Proof by R

alpha = .5
p = c(.9999, .0001)
N = 15

mixture <- function(){
probs = sample(x = p, size = N, replace = TRUE,
prob = c(alpha, 1 - alpha))
return(sum(rbinom(N, 1, probs)))
}
test1 <- rbinom(1e5, N, alpha * p[1] + (1 - alpha) * p[2])
test2 <- replicate(1e5, mixture())
mean(test1)
mean(test2)
sd(test1)
sd(test2)

> mean(test1)
[1] 7.50716

> mean(test2)
[1] 7.4982

> sd(test1)
[1] 1.935119

> sd(test2)
[1] 1.935101

1 Like

But isn’t this an identifyiability issue, that could possibly be solved by informative enough priors?

The problem is that in this case, the posterior will be dominated by the prior. More precisely the data only tell you about q, so combinations of \alpha, p_1, p_2 that are not consistent with the posterior q will be ruled out, but among the rest, the way q is divided amont p_1 and p_2 will be exactly as specified in the prior distribution.

Will using some hierarchical structure, which “ties” some of the variables to other observed quatities but “breaks” their mixing, help in this case? Or is this simply impossible to begin with?

As an example - is this situation any better (x,M are observed, p_3 is a parameter)?

y~binomial(alpha*N,p_1)+binomial((1-alpha)*N,p_2)
x~binomial(alpha*M,p_3)


IMHO, this could in principle help - if you have other data to constrain the values. Note that however, you need at least two further constraints, as you need to infer 3 variables.

It would IMHO also help, if you could think again about the data-generating process - I’ve shown two variants above, is one a match? Or is there something different going on?

A related point: the construct binomial(alpha*N,p_1) doesn’t IMHO make sense, so you should clarify what do you mean by this. The first parameter to binomial (number of trials) is an integer, so saying N_1 is the number of trials in the first group you either have:

1. \alpha is a discrete parameter that takes values 0,\frac{1}{N}, \frac{2}{N}, ...,1 with some arbitrary distribution, i.e. N_1 is a discrete parameter with integer values and the same distribution as \alpha
2. \alpha is a continous parameter that determines the probability for each trial separately that the trial will belong to the first group, making N_1 binomially distributed

I think that the variant in which N_1 is binomially distributed indeed makes more sense (and is also consistent with Stan’s type requirements for binomials). So, rephrasing your original post, I think this is a good description of the DGP:

1. N is chosen and known.
2. N_1∼Binomial(N,\alpha)
3. p_1 and p_2 are chosen randomly and I don’t observe them.
4. Y_1∼Binomial(N_1,p_1) and Y_2∼Binomial(N−N_1,p_2) are drawn and I don’t observe them.
5. I observe Y=Y_1+Y_2 and I need infer \alpha,p_1,p_2

But in this case N_1∼Binomial(N,\alpha) is the same as choosing for each subject individually to which component they belong, and (as I discussed above) you just get

Y \sim Binomial(N, q) \\ q = \alpha p_1 + (1 - \alpha)p_2

And therefore Y only informs q and not its decomposition.

Hmm… once again I have the feeling that I don’t fully understand the difference between the two DGPs. As we discussed above, I’ll try to use hierarchical relations between different quantities to improve inference.

Thanks!

What @martinmodrak is telling you is that you can’t infer what you want to infer. You can include some priors and get an output consistent with multiple data sources (or in this case data and priors about p_1/etc…). Is that what you want?

One of the pitfalls with Bayesian modeling is that people are sometimes led to believe that it’s possible to infer things from data that you don’t have any information about (see ecology and trying to infer immigration/emigration/death/reproduction from single-population count data). So you should be clear about whether your are ok with producing a picture of your parameters that’s consistent with your data and priors or if you really expect to infer all those unidentified parameters from your data. If it’s the second case then you need more data in addition to the data you have.

This part I got, I just didn’t understand why it’s possible in one DGP and impossible in the other.

Yes, this is what I’m focusing on at the moment…

The simple answer is that “because the math is different”, but that is not very satisfactory :-) My hunch (but can’t prove this) is that N_1 \sim Binomial(...) is actually the only case where the information about the composition is erased.

For example, let’s see what happens when N_1 \sim Uniform(...), we’ll assume that p_1 \leq p_2:

EDIT: (there were bugs in the initial version)

r_choosing_N_first <- function(n_samples, N, p1, p2) {
N1 = sample(0:N, n_samples, replace = TRUE)
Y1 = rbinom(n_samples, N1, p1)
Y2 = rbinom(n_samples, N - N1, p2)
Y1 + Y2
}

hist(r_choosing_N_first(1e5, 10, 0.5, 0.5), breaks = -1:10)
hist(r_choosing_N_first(1e5, 10, 0.3, 0.9), breaks = -1:10)
hist(r_choosing_N_first(1e5, 10, 0.1, 0.9), breaks = -1:10)
hist(r_choosing_N_first(1e5, 10, 0.01, 0.99), breaks = -1:10)


When p_1 and p_2 are equal, the resulting distribution is binomial

But when they are not, the distribution becomes more and more different from binomial (it’s just imposible to have such a widely dispersed binomial):

With extreme disbalance the distribution approaches uniform:

So in this case, you get that the average of the two paremeters influences the mean of the distribution while their difference influences the variance and so both can (at least in principle) be inferred.

1 Like

This is interesting - it means that in the case where N_1 is sampled from a binomial, the parameters are not identifiable, but if it’s sampled from a uniform, they are… Is this related to conjugation in any way? It seems like there’s something unique about the “binomial within binomial” situation, and that other distributions over N_1 will result in (at least some) information about the parameters…