# Dispersion and binomial vs beta-binomial

I have census data–counts of people in different locations in the US with various characteristics including sex, age, race/ethnicity and education level. I am trying to model education level–as a binary, college graduate or non-college graduate–using US state, sex, age and race/ethnicity as predictors.

My first attempt was using binomial_logit with a hierarchical state-level intercept and non-hierarchical intercepts for various combinations of the other three and their possible interactions.

All versions of that result in models that are quite under-dispersed as compared to the data. The fits are “bad” in the LOO sense–more than a third of the pareto-ks are “very bad”–and posterior predictive checks show a matching (median) mean but modeled standard deviations are much too low, etc. There are no divergences or tree-depth issues.

So I tried a beta-binomial, with location and scale parameters a_k = \phi_k \mu_k and b_k = \phi_k (1 - \mu_k) with the same parameterization of the mean probability, \mu_k as in the binomial, where k indexes whatever combination of sex, race/ethnicity, age and interactions are in that version of the model. As suggested in another thread, I parameterize the dispersion for each category as \phi_k = \frac{1}{1 - \theta_k} and \theta_k \sim \textrm{beta}(\alpha, \beta).

When the \theta prior is reasonably wide, these fits are much better in the LOO sense but now my model is over-dispersed and the (median) modeled mean is somewhat higher than the mean of the data. Here again, there are no divergences or other problematic diagnostics except sometimes a parameter or two with low effective sample size.

I can use \alpha and \beta to force the beta_binomial to be very binomial-like but I get bad LOO results before I get the right dispersion or mean.

All of which makes me think this isn’t quite the right model! But I’m not sure what to try next or how to re-parameterize. Any ideas would be most welcome!

Andrew Gelman has a great example wherein he goes through augmenting a binomal likelihood in stages.

The writeup is great by itself, but you might want to look closely at the section Fitting the New Model to Data, where he added an independent error term to each observation. You might attempt that to eliminate some of the dispersion issues you see (and also look at his conclusions regarding this “hack” and subsequent ideas for improving them).

This is all to say, your issues with binomial are shared by others!

Thanks for this suggestion!

I’m happy to share a distribution that I’m working on. Your mileage may vary (it is still WIP; possible that beta-binomial handles over-dispersion better), but maybe it will be helpful.

I call it the “scaled binomial distribution” (x \sim \text{Scaled Binomial}(n,p,s)) because it can scale x \sim \text{Binomial}(n,p) such that it looks like x \sim \text{Binomial}(n \times s,p) for any positive s. For example,

• \text{Scaled Binomial}(20, 0.25, 4) looks like \text{Binomial}(80, 0.25)
• \text{Scaled Binomial}(20, 0.25, 0.25) looks like \text{Binomial}(5, 0.25)
• \text{Scaled Binomial}(20, 0.25, 1) looks like \text{Binomial}(20, 0.25)

The relevant functions are below and in the attached files. Best of luck!

functions{
vector scaled_binomial_prob(int n, real p, real s){
real log_odds = log(p) - log(1 - p);
real n_real = n;
vector[n] n_seq;
vector[n+1] raw_scaler1;
vector[n+1] raw_scaler2;
matrix[n+1, n+1] multi_mat = rep_matrix(0, n+1, n+1);
vector[n+1] probs;

for(i in 1:n){
n_seq[i] = i;
}

raw_scaler1 = 0;
raw_scaler1[2:(n+1)] = lgamma(s * (n_seq - 1) + 1) - lgamma(s * n_seq + 1);

raw_scaler2 = 0;
raw_scaler2[2:(n+1)] = lgamma(s * (n_real - n_seq + 1) + 1) - lgamma(s * (n_real - n_seq) + 1);

for(n1 in 1:(n+1)){
for(n2 in 1:n1){
multi_mat[n1, n2] = 1;
}
}

probs = softmax(multi_mat * (s * log_odds + raw_scaler1 + raw_scaler2));

return(probs);
}

real scaled_binomial_lpmf(int x, int n, real p, real s){
real log_odds = log(p) - log(1 - p);
real n_real = n;
vector[n] n_seq;
vector[n+1] raw_scaler1;
vector[n+1] raw_scaler2;
matrix[n+1, n+1] multi_mat = rep_matrix(0, n+1, n+1);
vector[n+1] probs;
real lp;

for(i in 1:n){
n_seq[i] = i;
}
raw_scaler1 = 0;
raw_scaler1[2:(n+1)] = lgamma(s * (n_seq - 1) + 1) - lgamma(s * n_seq + 1);

raw_scaler2 = 0;
raw_scaler2[2:(n+1)] = lgamma(s * (n_real - n_seq + 1) + 1) - lgamma(s * (n_real - n_seq) + 1);

for(n1 in 1:(n+1)){
for(n2 in 1:n1){
multi_mat[n1, n2] = 1;
}
}

probs = softmax(multi_mat * (s * log_odds + raw_scaler1 + raw_scaler2));
lp = log(probs[x+1]);

return(lp);
}

real scaled_binomial_lpmf(array[] int x, int n, vector p, real s){
int nobs = num_elements(x);
vector[nobs] log_odds = log(p) - log(1 - p);
real n_real = n;
vector[n] n_seq;
vector[n+1] raw_scaler1;
vector[n+1] raw_scaler2;
matrix[n+1, n+1] multi_mat = rep_matrix(0, n+1, n+1);
vector[nobs] lp;

for(i in 1:n){
n_seq[i] = i;
}
raw_scaler1 = 0;
raw_scaler1[2:(n+1)] = lgamma(s * (n_seq - 1) + 1) - lgamma(s * n_seq + 1);

raw_scaler2 = 0;
raw_scaler2[2:(n+1)] = lgamma(s * (n_real - n_seq + 1) + 1) - lgamma(s * (n_real - n_seq) + 1);

for(n1 in 1:(n+1)){
for(n2 in 1:n1){
multi_mat[n1, n2] = 1;
}
}

for(i in 1:nobs){
int y = x[i]+1;
vector[n+1] probs = softmax(multi_mat * (s * log_odds[i] + raw_scaler1 + raw_scaler2));
lp[i] = log(probs[y]);
}

return(sum(lp));
}
}


scaled binomial preds.stan (3.5 KB)
Scaled Binomial.R (6.5 KB)
scaled binomial.stan (3.6 KB)

3 Likes

That’s interesting! You did that because you wanted to work with relatively small n and so just scaling n itself wouldn’t work because you’d end up with non-int values? I often have the large n case so I do something similar just by using integer division and accepting the small errors that come from that. Is there an underlying way of thinking about it as with beta-binomial it’s multiple Bernoulli trials but with p chosen from a beta distribution rather than fixed?

It might also be interesting to graph scaled_binomial along with beta_binomial having the same mean and variance.

Thanks. I might try this!