What follows isn’t exactly a question so much as a request for help understanding various approaches to modeling binomial-like things. I apologize if starting a new thread is not appropriate. I’m trying to synthesize the content of a few threads and it seemed simpler to start a new one.
I recently asked a question about a version of the “continuous” binomial, in particular about having a RNG for a density which for real N, k, and p is proportional to p^k(1-p)^{N-k}.
@martinmodrak and @simonbrauer were very helpful and I have that all working, as far as it goes.
But I realize that I’m not sure that the proposed density, \frac{\Gamma(N+1)}{\Gamma(k+1)\Gamma(N-k+1)}p^k(1-p)^{N-k} (for possibly non-integral N and k) is actually a density, much less a normalized one. And if it is, I’m not sure what it means, as in what data generating process it might represent. There is a oft-cited paper with a suggested continuous generalization of the Binomial and that does not appear to be the same as what I am trying.
In another thread here, @betanalpha says that the distribution proposed in that paper is difficult to implement/use in stan.
My situation seemed straightforward at first. I have survey data with demographic information and a binary outcome (voter turnout, in my case) which I want to aggregate somewhat (collapsing some demographic categories, etc.) before modeling the binary outcome as predicted by the various categories. Following this Ghitza & Gelman paper, I use the survey weights to derive new counts of “trials” and “successes” for the aggregate cells. The paper suggests (on page 764, the 4th page of the PDF) using something like the density I wrote above: “The resulting n^*_j, y^∗_j will not in general be integers, but we handle this by simply using the binomial likelihood function with non-integer data, which works fine in practice (and is in fact simply the weighted log-likelihood approach to fitting generalized linear models with weighted data). This falls in the category of quasi-likelihood methods (Wedderburn 1974).”
What does “work fine in practice” mean here?
For instance, though when I use this the model looks fine in various ways–no divergences, elpd loo has no high pareto ks–the PPC shows that the fit is sort of interpolating between all the 0 and 1 counts in an unsurprising but odd way:
and the ppc_loo_pit_qq is, perhaps predictably, odd:
Another possibility which is sometimes suggested is to model the proportions of successes to trials via a beta distribution. Stan’s “beta_proportion” functions are already set-up to handle this case, where the proportion and variance (here I am using a binomial-like variance) are known. So I’ve implemented this model as well. The trouble for me there, I think, is that I have many many cells with 1 trial and thus either 0 or 1 success. This leads to a posterior predictive check which looks like:
and a loo_pit_qq chart is similar to the binomial-like, though perhaps even less promising.
The two models yield somewhat different results in terms of the alphas for various categories and I’m a bit lost as to how I should choose between them or if I should approach this some other way entirely, one which might be an altogether better match for the data.
Any thoughts/suggestions would be much appreciated!