Binomial with very large numbers of trials

A quick question (I hope):

Is it possible for counts to be too large for modeling via binomial_logit? I am looking at election data and I had previously been modeling surveys with 60,000 total participants. Now I am trying to look at election results themselves (eventually hoping to merge the two). The election results (US presidential voting by state) have millions of voters and votes and the log densities I am getting are on the order of -3e-8.

I’m not sure if it’s an artifact of shinystan, but in all the plots of log-posterior vs. a parameter, the log-posterior seems to be fixed to a specific value (-254534000).

I’ve now thought through whether this is a bug in so many different ways that I’m lost. On the one hand, if I imagine approximating things with a normal, these densities seem much too large in magnitude. OTOH, if I think of them as probability mass in a binomial with hundreds of millions of trials, they don’t.

There are other signs that the model isn’t working: the chains aren’t mixing, lots of parameters have r-hat > 1.01, the autocorrelations look suspicious. And some signs that it is: PPCheck looks reasonable, residuals of 3000-4000 votes (out of millions) in each state.

So, my best attempt at concrete questions:

  1. Is this log-posterior a sign of a problem?
  2. If not, it’s going to make any meta-analysis with surveys difficult because the numbers of trials are so different. Is there a “right” way to handle that?


Bad r-hats and no mixing in lp__ are definitely indicative of a problem. I think if you’re looking for a more detailed answer, you’ll get more if you can post your Stan model. It’s possible that you’ve written things in a manner that’s prone to underflow and that could be easily fixed.

For a completely different angle, note that in the limit of large trials and low probability the binomial distribution becomes a Poisson. Perhaps the model could be usefully reframed in terms of Poisson distributions.


(Edit: removed unneccesary things from model)

Thanks for the reply @jsocolar! Code is below. I am fitting two things at once, voter turnout and preference. Each binomial_logit sampling statement produces a contribution to the log-posterior on the order of -1e8.

The data has large numbers of trials and probabilities between .3 and .8, so I think more normal than Poisson, right?

I’m thinking of shifting that way but then it will be more confusing to try and merge with the smaller surveys since I think there will be a relative weight issue or something.


data {
  int<lower=1> J_State;
  int<lower=1> N_ElectionsT;
  int<lower=1> ElectionsT_State[N_ElectionsT];
  int<lower=1> N_ElectionsP;
  int<lower=1> ElectionsP_State[N_ElectionsP];
  int<lower=0> CVAP_Elex[N_ElectionsT];
  int<lower=0> Voted_Elex[N_ElectionsT];
  int K_DM;
  matrix[N_ElectionsT,K_DM] DM_ElectionsT;
  int DM_Sex;
  int Sex_DM_Index;
  int DM_Education;
  int Education_DM_Index;
  int DM_Race;
  int Race_DM_Index;
  int DM_Density;
  int Density_DM_Index;
  int DM_WhiteNonGrad;
  int WhiteNonGrad_DM_Index;
  matrix[N_ElectionsP,K_DM] DM_ElectionsP;
  int<lower=0> VotesInRace_Elections[N_ElectionsP];
  int<lower=0> DVotesInRace_Elections[N_ElectionsP];
transformed data {
  int constIndex_ElectionsT[N_ElectionsT] = rep_array(0,N_ElectionsT);
  int constIndex_ElectionsP[N_ElectionsP] = rep_array(0,N_ElectionsP);
  vector[K_DM] mean_DM_ElectionsT = rep_vector(0,K_DM);
  matrix[N_ElectionsT,K_DM] centered_DM_ElectionsT;
  for (k in 1:K_DM) {
    mean_DM_ElectionsT[k] = mean(DM_ElectionsT[,k]);
    centered_DM_ElectionsT[,k] = DM_ElectionsT[,k] - mean_DM_ElectionsT[k];
  vector[K_DM] mean_DM_ElectionsP = rep_vector(0,K_DM);
  matrix[N_ElectionsP,K_DM] centered_DM_ElectionsP;
  for (k in 1:K_DM) {
    mean_DM_ElectionsP[k] = mean(DM_ElectionsP[,k]);
    centered_DM_ElectionsP[,k] = DM_ElectionsP[,k] - mean_DM_ElectionsP[k];
parameters {
  vector[J_State] alpha_rawT;
  real mu_alphaT;
  real<lower=0> sigma_alphaT;
  vector[K_DM] muT;
  vector<lower=0>[K_DM] tauT;
  cholesky_factor_corr[K_DM] LT;
  matrix[K_DM,J_State] betaT_raw;
  vector[J_State] alpha_rawP;
  real mu_alphaP;
  real<lower=0> sigma_alphaP;
  vector[K_DM] muP;
  vector<lower=0>[K_DM] tauP;
  cholesky_factor_corr[K_DM] LP;
  matrix[K_DM,J_State] betaP_raw;
transformed parameters {
  matrix[K_DM,J_State] betaT = rep_matrix(muT,J_State) + diag_pre_multiply(tauT,LT) * betaT_raw;
  vector[J_State] alphaT = mu_alphaT + sigma_alphaT * alpha_rawT;
  matrix[K_DM,J_State] betaP = rep_matrix(muP,J_State) + diag_pre_multiply(tauP,LP) * betaP_raw;
  vector[J_State] alphaP = mu_alphaP + sigma_alphaP * alpha_rawP;
model {
  to_vector(betaT_raw) ~ normal(0,1);
  alpha_rawT ~ normal(0,1);
  mu_alphaT ~ normal(0,1);
  sigma_alphaT ~ normal(0,1);
  muT ~ normal(0,1);
  tauT ~ normal(0,1);
  LT ~ lkj_corr_cholesky(4.0);
  to_vector(betaP_raw) ~ normal(0,1);
  alpha_rawP ~ normal(0,1);
  mu_alphaP ~ normal(0,1);
  sigma_alphaP ~ normal(0,1);
  muP ~ normal(0,1);
  tauP ~ normal(0,1);
  LP ~ lkj_corr_cholesky(4.0);
  vector[N_ElectionsT] voteDataBetaT_v;
  for (n in 1:N_ElectionsT) {
    voteDataBetaT_v[n] = alphaT[ElectionsT_State[n]] + dot_product(centered_DM_ElectionsT[n],betaT[,ElectionsT_State[n]]);
  print("before T target=",target());
  Voted_Elex ~ binomial_logit(CVAP_Elex,voteDataBetaT_v);
  vector[N_ElectionsP] voteDataBetaP_v;
  for (n in 1:N_ElectionsP) {
    voteDataBetaP_v[n] = alphaP[ElectionsP_State[n]] + dot_product(centered_DM_ElectionsP[n],betaP[,ElectionsP_State[n]]);
  print("before P target=",target());
  DVotesInRace_Elections ~ binomial_logit(VotesInRace_Elections,voteDataBetaP_v);
  print("after P target=",target());

Something is definitely wrong but I can’t see what. With such large numbers of trials, I can see how the binomial density is a big negative number when we are far away from a “good” fit. And I could imagine an underflow issue there, if stan got stuck because the number were too big.
But stan finds it’s way to a pretty good fit, where the parameters for each state give binomial probabilities which are very close (within 2e-4) to exactly right. And each state only has on the order of 1e7 voters (trials). So once we’re that close, I think the density should be on the order of -5000 or so (using a normal approximation to the binomial to estimate the density and considering all 50 states).
But whether we are far away from the solution or not, when I print the target after applying the sampling statement, I go from something reasonable from the priors (-1000 or so) to something like -1e8. And the only thing happening in between is the Voted_Elex ~ binomial_logit(CVAP_Elex,voteDataBetaT_v); line.

I took one evaluation of binomial_logit(CVAP_Elex,voteDataBetaT_v), got all the numbers --copied the vector of trials and successes from the json and the voteDataBetaT_v values from the output of a print statement. I put it all in a spreadsheet and estimated the log-density via the normal approximation. Each one of the 51 states (50 + DC) has a log-density of about -2 or -3. The probabilities implied by the values in voteDataBetaT_v are very close. So the contribution to the overall density should be around -200. Instead I am getting -1e8. I’m sort of running out of things to look at. I didn’t think it was a bug before but now maybe?

Update 2: Looks more like a bug? Some kind of underflow in binomial_logit for large N?
I replaced binomial_logit with normal (after computing the right means and standard deviations from voteDataBetaT_v and the vector of the number of trials. And now it works fine.
But this is not a good solution for me since I will have a mix of data with large N and small N so I’d like binomial_logit to work…or a something that can mix them depending on the value of N? I don’t know how to write that and definitely not how to write it efficiently…

Stan only needs the probability density up to proportionality. I think the numbers you’re seeing are because binomial sampling statements don’t bother computing the binomial coefficient.

p\left(n|N,p\right) = \binom{N}{n}p^n\left(1-p\right)^{N-n} \propto p^n\left(1-p\right)^{N-n}

Omitting the constant is equivalent to reinterpreting the binomial as N i.i.d. Bernoulli trials. It doesn’t change the posterior because the events are independent either way but does change the likelihood because now different orderings of successes and failures are considered distinct outcomes.

You can compute the exact binomial likelihood using a target increment instead of a sampling statement

// n ~ binomial_logit(N, logit_p);
target += binomial_logit(n | N, logit_p);

It shouldn’t affect the samples unless there’s a bug somewhere.

Stan allows you to use normal control flow as long as it depends only on the data, for example

for (n in 1:N_ElectionsP) {
  if (VotesInRace_Elections[n] < 1000) {
    DVotesInRace_Elections[n] ~ binomial_logit(VotesInRace_Elections[n],voteDataBetaP_v[n]);
  } else {
    DVotesInRace_Elections[n] ~ normal(VotesInRace_Elections[n]*...);
1 Like

Thank you so much, @nhuurre !

I had gotten partway to that on my own, in the sense that I’d realized that the p^nq^{N-n} bit could get very small and then, after looking at the normal approximation, realized that the N\choose n bit is what “fixes” that. But I hadn’t yet connected that back to the (well-documented!) fact that Stan drops constants in the “lupmf” functions and thus when using the “~” sampling notation.

Quick note: In your fix above–which works! thank you!–it should be target += binomial_logit_lpmf(...) rather than target += binomial_logit(...)

I think the documentation could make this clearer in two ways:

  • In all the places where the dropping of constants is mentioned, it could be explained that it’s constants with respect to the parameters of the model. I realize that should be clear but maybe it’s worth the extra words? Or a footnote/link to an example? Somehow in my head constants were things like \sqrt(2\pi) not like N\choose n, though in retrospect, I get it.
  • It might also be worth adding an explicit note on the page about the binomial and binomial_logit, just explaining that for large numbers of trials, these are prone to underflow if the constants are dropped. Or maybe that’s general to a few discrete distributions and worth mentioning in all of them?

Where is a good place to suggest that/start that conversation?

And thanks again. I gave up on this yesterday with some frustration: I understood the problem somewhat but could not see a simple way to fix it. You not only answered the question, but explained your answer clearly and showed how to fix it and how I might have approached it if there was no simple fix. That was all very very helpful.