Continuous Binomial: further questions

What follows isn’t exactly a question so much as a request for help understanding various approaches to modeling binomial-like things. I apologize if starting a new thread is not appropriate. I’m trying to synthesize the content of a few threads and it seemed simpler to start a new one.

I recently asked a question about a version of the “continuous” binomial, in particular about having a RNG for a density which for real N, k, and p is proportional to p^k(1-p)^{N-k}.

@martinmodrak and @simonbrauer were very helpful and I have that all working, as far as it goes.

But I realize that I’m not sure that the proposed density, \frac{\Gamma(N+1)}{\Gamma(k+1)\Gamma(N-k+1)}p^k(1-p)^{N-k} (for possibly non-integral N and k) is actually a density, much less a normalized one. And if it is, I’m not sure what it means, as in what data generating process it might represent. There is a oft-cited paper with a suggested continuous generalization of the Binomial and that does not appear to be the same as what I am trying.

In another thread here, @betanalpha says that the distribution proposed in that paper is difficult to implement/use in stan.

My situation seemed straightforward at first. I have survey data with demographic information and a binary outcome (voter turnout, in my case) which I want to aggregate somewhat (collapsing some demographic categories, etc.) before modeling the binary outcome as predicted by the various categories. Following this Ghitza & Gelman paper, I use the survey weights to derive new counts of “trials” and “successes” for the aggregate cells. The paper suggests (on page 764, the 4th page of the PDF) using something like the density I wrote above: “The resulting n^*_j, y^∗_j will not in general be integers, but we handle this by simply using the binomial likelihood function with non-integer data, which works fine in practice (and is in fact simply the weighted log-likelihood approach to fitting generalized linear models with weighted data). This falls in the category of quasi-likelihood methods (Wedderburn 1974).”

What does “work fine in practice” mean here?

For instance, though when I use this the model looks fine in various ways–no divergences, elpd loo has no high pareto ks–the PPC shows that the fit is sort of interpolating between all the 0 and 1 counts in an unsurprising but odd way:

and the ppc_loo_pit_qq is, perhaps predictably, odd:

Another possibility which is sometimes suggested is to model the proportions of successes to trials via a beta distribution. Stan’s “beta_proportion” functions are already set-up to handle this case, where the proportion and variance (here I am using a binomial-like variance) are known. So I’ve implemented this model as well. The trouble for me there, I think, is that I have many many cells with 1 trial and thus either 0 or 1 success. This leads to a posterior predictive check which looks like:

and a loo_pit_qq chart is similar to the binomial-like, though perhaps even less promising.

The two models yield somewhat different results in terms of the alphas for various categories and I’m a bit lost as to how I should choose between them or if I should approach this some other way entirely, one which might be an altogether better match for the data.

Any thoughts/suggestions would be much appreciated!

2 Likes

A couple observations that may explain some of the issues:

It is definitely not normalized - in a quick numerical integration check with Wolfram, the integral over the domain goes to 0 as p \to 0 or as p \to 1 while it is close to 1 if p = \frac{1}{2}.

You are using continuous predictions and compare them to discrete observations, this has to produce artifacts (e.g. the spiked density). I think you want to round/bin all the values and do a discrete PPC (e.g. ppc_bar)

I am very slightly less skeptical than Mike about this - the PDF from the paper can be evaluated with a single call to the numerical integrator (the incomplete beta is defined as an integral, for well behaved functions you have “derivative of integral” as “integral of derivative”). This would be computationally expensive, but the integrand does not look so scary and it might be feasible. Most likely not worth the effort for your use case, but putting this out there.

That looks like a correct assesment… You can have a zero-one-inflated Beta distribution (basically have separate parameters for the proporiton of zeroes and ones), but that’s kind of hacky…

I unfortunately don’t have a good complete solution of the issue.

Hope that helps you make some additional progress.

2 Likes

This is helpful, thanks! Why would you not think it worth the effort in my case? More broadly, why does it matter if something is properly normalized?

I think I can proceed without figuring all this out for now. The fits work well enough for my purposes. But it is disconcerting that, e.g., my pseudo-continuous-binomial and the beta-proportion yield different estimates and ones that are outside each others 95% CI. So eventually I’d like to understand what is being traded off with these various choices.

I think it is not worth the effort because even if the overall idea works well (which is not a given and may require some fiddling with numerics), it is IMHO likely to make the model runtime prohibitive for even modestly sized datasets (note that you have a single numerical integration call per data point).

Normalization constants don’t matter as long as they are constant (i.e. don’t depend on parameters). When the normalization depends on paramters (as is the case for the continous binomial version we discussed), you might get incorrect asnwers. To be specific, since the normalization constant should go to infinity as p \to 0 or p \to 1, ignoring the term will make the model treat the data as stronger evidence against extreme values of p than it should have and thus the posterior mass assigned to extreme p will be lower than it should be.

One way to figure this out without doing a lot of math is to use simulated data. You pick known values of the model parameters, simulated the whole population, then simulate the survey, compute survey weights the same way it was computed for your original data and then fit your model(s). Whichever model is a better approximation should get estimates closer to the values you picked for your simulation. You can either try this informally with a couple hand-picked values or you can be more thorough and do SBC or a similar check.

1 Like