In my situation, post-processing in R isn’t an option and casting to integer isn’t going to be a good approximation (the number of trials is often quite small) so I am wondering if anyone has any pointers for generating “continuous” binomially (binomial but with a non-integer number of trials) distributed random numbers? Or just a good starting place for how I would code up the “inverse CDF” method as indicated in the linked thread?
You should be able to run a relatively simple Rejection sampling scheme. If f(x) is the (possibly unnormalized) density of the distribution you want to sample from, the core idea is:
Find a continuous proposal distribution G with the normalized density g(x) close in shape to the target, that you can easily sample from - in this case a normal distribution with matching mean + variance could work quite well. In many cases, uniform distribution with the correct range will also work quite well (and is very simple to work with).
Find a constant M, such that \forall x: f(x) \leq Mg(x)
Sample c \sim G and u \sim Uniform(0, Mg(c)) if u < f(c) then c is your sample. Otherwise repeat this step.
That sounds mildly suspicious. If you are trying to approximate a binomial distribution where the number of trials is itself unknown, that if the number of trials is small, this might be a pretty poor approximation. (there might be other uses of the continuous extension of binomial that I am unaware of and then this may not apply).
Note that if the number of trials is small, a model explicitly marginalizing the number of trials will probably be quite efficient and will give the same results as a model treating the number of trials as an integer parameter…
Do you know of an example of rejection sampling implemented in stan? I can likely figure it out but a working example always helps. I’d like to generate the ppc draws in stan itself since I am not working in R or python.
“That sounds mildly suspicious” Can you expand on that?
My situation is a survey which I am aggregating somewhat. For example, the survey might have 10 age categories and I am aggregating to 5, etc. For the aggregation, I am using survey weights, leading to non-integer counts. Each cell of the aggregated survey data thus has a non-integer “number of people” with a non-integer number of successes (people who voted, in this case). Different cells have different numbers of people, ranging from ~1 to ~30. I’m not sure if that’s “small” in the sense you mean.
And can you point to an example or place to explain what it would mean to marginalize the number of trials? I understand that means to integrate over them in some way but I don’t think I quite see what that means explicitly. Would this lead to something like a beta distribution in the proportion of voters? Then what happens to the differences in variance coming from the different numbers of people in the cells?
So this is a different situation than I thought - if the counts are fractional, but known, than this might be less of a problem than if they are treated as unknown. Also marginalization makes only sense in the case the counts would be unknown, so it is not relevant to your case.
I am not an expert on surveys, but I would also expect this situation to be handled well without need for either fractional counts or fractional successes. Do I understand correctly, that you have a dataset with integer counts and integer successes which you then post-process yourself? (that’s substantially different than having just the post-processed numbers). If yes, then I’d expect the case to be well handled by weighting the likelihood contributions, i.e.:
target += w * binomal_lpmf(...);
And for PP checks you would then mimic the sampling process you actually expect. E.g. if you had just two groups (A,B) with N_A people (in the total population) being A and N_B being B, you would then simulate the responses for the whole population (or large enough population) and the simulate taking a sample of the same size as your survey, with selection weighted by the group weights. This would mean you could get different cell counts in the posterior predictive simulation than in your datasset, but the PPC would then focus on the observed proportions, which you could compare directly…
But I am not 100% sure this makes sense for you - I repeat I am no expert on surveys and also that in the case you describe the continous approximation is likely less of an issue.
Yes!..And no. Yes I do the post-processing myself. But the counts and successes are non-integral in different ways. That is, I don’t just have cell weights. There’s also a design-effect correction, which shrinks the number of people without also shrinking the number of successes. Though maybe I am doing that bit wrong? Anyway, for now, it’s not just a cell weight.
Regardless, focusing on proportions in the ppc might be helpful here.
And even what’s confusing here is helpful and eventually clarifying! Thank you.
Could you elaborate how this works when you collapse the data? It seems that if you shrink the number of people but not the number of successes, then you end up braising your estimates. Is this something that you could easily make some fake data to demonstrate the process?
@martinmodrak 's suggestions make sense to me, but sounds like it doesn’t quote match what you’re thinking.
Thanks, @simonbrauer for pointing this out! I explained incorrectly. There are two things going on: one is that the ratio of successes to trials for the aggregate cells is computed via a weighted average over the cells being aggregated. The cells being aggregated are then treated as a single cell with an effective sample size. That effective sample size is computed by finding the sum (s), mean (m) and variance (v) of the weights and setting effective sample size = \frac{s}{1 +v/m^2}.
So, for example, imagine we have
weight
surveyed
voted
1.2
3
2
0.9
5
3
1.1
4
2
we would aggregate to a new cell with proportion of success = [1.2(2/3) + 0.9(3/5) + 1.1(2/4)]/(1.2 + 0.9 +1.1) = 0.59
and effective sample size = 3.2 / (1 + 0.0023) = 3.19.
So I would end up with trials = 3.19 and successes = 0.59 * 3.19 = 1.89.
In particular, I think this preserves the (weighted) estimates of success. And does not, I don’t think, leave me with a way to treat these new cells as weighted binomial cells or the original cells as weighted binomial cells. So I need to use some continuous extension of the binomial or, as I am pondering now, model the proportions via a beta distribution with shape parameters determined by the modeled probability and given effective sample size. Which might be the same thing?
(BTW: this is all following this Ghitza & Gelman paper, p. 5. Though I might not be doing it right! In particular, I think there is a typo/mistake in the paper. They choose as aggregate number of successes the weighted ratio times the sum of the weights rather than the effective sample size, which, since the effective sample size is used as the aggregate number of trials, would lead to changing the estimates. I am (maybe mistakenly!) assuming this is an error. Or perhaps I am misunderstanding more deeply.)
Will think this over. In the meantime, I believe s should actually be the sum of the sample sizes. Note that your denominator is the design effect (equation 6 in the linked paper), so s is equivalent to n_j in the equation just below in the text.
Oh! You’re right! The n_j in the paper is the unweighted sum.
And oddly, when I look at my code, I had that part correct. Sorry for the confusion above.
So, we agree, I think. s above should be the sum of the number of sample sizes. And the definition of effective sample size (ESS) given above is correct once that change is understood.
Where the paper confuses me though, is in what to use for the number of successes. What would make sense to me, is to compute the weighted proportion of successes and then multiply that ratio by the ESS. But that’s not what happens in the paper. There, the successes are set to the weighted proportion of successes multiplied by the sum of sample sizes (s rather than ESS).
This seems like it would raise the probability of success across the board, as well as risk having aggregate cells where the number of successes exceed the number of trials.
Do you know of a good source for this way of handling weights when partially aggregating? I did a bunch of searching when first trying to figure this out but could not find anything helpful.
Either way, as the paper also indicates, this requires use of something like a continuous Binomial. The paper says they just use the Binomial density evaluated at non-integer counts. But they don’t say if/how they managed a RNG.
My current attempt is to use a beta_rng with shape parameters chosen to have the correct mean and variance. Which might be fine but I am trying to understand this all better and see if there’s a more straightforward or correct way.
I checked the published version of the paper and confirmed that the equations are the same as the working draft that you linked. So either it is a typo or there is something we are not understanding.
I’m sure there is a reasonable way to generate random draws from such a density. The problem is that you/Ghitza and Gelman are use a pseudo-posterior, which does not describe the actual data-generating process. So there isn’t a natural link between how the data were actually generated and how you’re modelling the data. This is, by my reading, one of the core challenges in Bayesian approaches to weighted survey data. See Survey weighted regression, Are complex surveys feasible in brms?, and (Pseudo) Bayesian Inference for Complex Survey Data.
That is incredibly helpful! Both confirmation that the paper is confusing (or incorrect) and the very-clear description of the disconnect between model and DGP. I’ve read an assortment of things alluding to this issue but in such technical language that I don’t think I quite understood it as simply as this.
I’ll read these the threads you linked though I think I might also need to muddle forward with this (or similar) model/pseudo-posterior, so any more insight into the best way to generate random draws consistent with modeled density would be appreciated!
To whit: my current rng is, as mentioned above, a beta_rng with shape parameters set to match the mean and variance of the continuous-binomial. I’m not sure this is the best choice but I’m somewhat at a loss for how to figure that out.
Got rejection sampling working, from a uniform distribution for now:
}
real scalar_real_binomial_logit_rng(data real trials, real lp) {
real p = inv_logit(lp);
real k = trials * p;
real maxB = exp(lchoose(trials, k) + k * log(p) + (trials - k) * log1m(p));
real proposal = uniform_rng(0.0, trials);
while (uniform_rng(0.0, maxB) >= exp(lchoose(trials, proposal) + proposal
* log(p) + (trials - proposal)
* log1m(p))) {
proposal = uniform_rng(0.0, trials);
}
return proposal;
}
and it seems to be working fine. For my application this is not noticeably slower than the discrete binomial_rng but that’s likely because I have some memory/IO bottlenecks when building/writing the generated_quantities and then when merging those with the original samples to build things shinystan can use.
I plan to figure out how to sample from a suitable Normal at some point and see if that’s faster…