Dispersion and binomial vs beta-binomial

I have census data–counts of people in different locations in the US with various characteristics including sex, age, race/ethnicity and education level. I am trying to model education level–as a binary, college graduate or non-college graduate–using US state, sex, age and race/ethnicity as predictors.

My first attempt was using binomial_logit with a hierarchical state-level intercept and non-hierarchical intercepts for various combinations of the other three and their possible interactions.

All versions of that result in models that are quite under-dispersed as compared to the data. The fits are “bad” in the LOO sense–more than a third of the pareto-ks are “very bad”–and posterior predictive checks show a matching (median) mean but modeled standard deviations are much too low, etc. There are no divergences or tree-depth issues.

So I tried a beta-binomial, with location and scale parameters a_k = \phi_k \mu_k and b_k = \phi_k (1 - \mu_k) with the same parameterization of the mean probability, \mu_k as in the binomial, where k indexes whatever combination of sex, race/ethnicity, age and interactions are in that version of the model. As suggested in another thread, I parameterize the dispersion for each category as \phi_k = \frac{1}{1 - \theta_k} and \theta_k \sim \textrm{beta}(\alpha, \beta).

When the \theta prior is reasonably wide, these fits are much better in the LOO sense but now my model is over-dispersed and the (median) modeled mean is somewhat higher than the mean of the data. Here again, there are no divergences or other problematic diagnostics except sometimes a parameter or two with low effective sample size.

I can use \alpha and \beta to force the beta_binomial to be very binomial-like but I get bad LOO results before I get the right dispersion or mean.

All of which makes me think this isn’t quite the right model! But I’m not sure what to try next or how to re-parameterize. Any ideas would be most welcome!

Andrew Gelman has a great example wherein he goes through augmenting a binomal likelihood in stages.

The writeup is great by itself, but you might want to look closely at the section Fitting the New Model to Data, where he added an independent error term to each observation. You might attempt that to eliminate some of the dispersion issues you see (and also look at his conclusions regarding this “hack” and subsequent ideas for improving them).

This is all to say, your issues with binomial are shared by others!

Thanks for this suggestion!

I’m happy to share a distribution that I’m working on. Your mileage may vary (it is still WIP; possible that beta-binomial handles over-dispersion better), but maybe it will be helpful.

I call it the “scaled binomial distribution” (x \sim \text{Scaled Binomial}(n,p,s)) because it can scale x \sim \text{Binomial}(n,p) such that it looks like x \sim \text{Binomial}(n \times s,p) for any positive s. For example,

  • \text{Scaled Binomial}(20, 0.25, 4) looks like \text{Binomial}(80, 0.25)
  • \text{Scaled Binomial}(20, 0.25, 0.25) looks like \text{Binomial}(5, 0.25)
  • \text{Scaled Binomial}(20, 0.25, 1) looks like \text{Binomial}(20, 0.25)

The relevant functions are below and in the attached files. Best of luck!

functions{
    vector scaled_binomial_prob(int n, real p, real s){
        real log_odds = log(p) - log(1 - p);
        real n_real = n;
        vector[n] n_seq;
        vector[n+1] raw_scaler1;
        vector[n+1] raw_scaler2;
        matrix[n+1, n+1] multi_mat = rep_matrix(0, n+1, n+1);
        vector[n+1] probs;

        for(i in 1:n){
            n_seq[i] = i;
        }

        raw_scaler1[1] = 0;
        raw_scaler1[2:(n+1)] = lgamma(s * (n_seq - 1) + 1) - lgamma(s * n_seq + 1);

        raw_scaler2[1] = 0;
        raw_scaler2[2:(n+1)] = lgamma(s * (n_real - n_seq + 1) + 1) - lgamma(s * (n_real - n_seq) + 1);

        for(n1 in 1:(n+1)){
            for(n2 in 1:n1){
                multi_mat[n1, n2] = 1;
            }
        }

        probs = softmax(multi_mat * (s * log_odds + raw_scaler1 + raw_scaler2));

        return(probs);
    }

    real scaled_binomial_lpmf(int x, int n, real p, real s){
        real log_odds = log(p) - log(1 - p);
        real n_real = n;
        vector[n] n_seq;
        vector[n+1] raw_scaler1;
        vector[n+1] raw_scaler2;
        matrix[n+1, n+1] multi_mat = rep_matrix(0, n+1, n+1);
        vector[n+1] probs;
        real lp;

        for(i in 1:n){
            n_seq[i] = i;
        }
        raw_scaler1[1] = 0;
        raw_scaler1[2:(n+1)] = lgamma(s * (n_seq - 1) + 1) - lgamma(s * n_seq + 1);

        raw_scaler2[1] = 0;
        raw_scaler2[2:(n+1)] = lgamma(s * (n_real - n_seq + 1) + 1) - lgamma(s * (n_real - n_seq) + 1);

        for(n1 in 1:(n+1)){
            for(n2 in 1:n1){
                multi_mat[n1, n2] = 1;
            }
        }

        probs = softmax(multi_mat * (s * log_odds + raw_scaler1 + raw_scaler2));
        lp = log(probs[x+1]);

        return(lp);
    }

    real scaled_binomial_lpmf(array[] int x, int n, vector p, real s){
        int nobs = num_elements(x);
        vector[nobs] log_odds = log(p) - log(1 - p);
        real n_real = n;
        vector[n] n_seq;
        vector[n+1] raw_scaler1;
        vector[n+1] raw_scaler2;
        matrix[n+1, n+1] multi_mat = rep_matrix(0, n+1, n+1);
        vector[nobs] lp;

        for(i in 1:n){
            n_seq[i] = i;
        }
        raw_scaler1[1] = 0;
        raw_scaler1[2:(n+1)] = lgamma(s * (n_seq - 1) + 1) - lgamma(s * n_seq + 1);

        raw_scaler2[1] = 0;
        raw_scaler2[2:(n+1)] = lgamma(s * (n_real - n_seq + 1) + 1) - lgamma(s * (n_real - n_seq) + 1);

        for(n1 in 1:(n+1)){
            for(n2 in 1:n1){
                multi_mat[n1, n2] = 1;
            }
        }

        for(i in 1:nobs){
            int y = x[i]+1;
            vector[n+1] probs = softmax(multi_mat * (s * log_odds[i] + raw_scaler1 + raw_scaler2));
            lp[i] = log(probs[y]);
        }

        return(sum(lp));
    }
}

scaled binomial preds.stan (3.5 KB)
Scaled Binomial.R (6.5 KB)
scaled binomial.stan (3.6 KB)

3 Likes

That’s interesting! You did that because you wanted to work with relatively small n and so just scaling n itself wouldn’t work because you’d end up with non-int values? I often have the large n case so I do something similar just by using integer division and accepting the small errors that come from that. Is there an underlying way of thinking about it as with beta-binomial it’s multiple Bernoulli trials but with p chosen from a beta distribution rather than fixed?

It might also be interesting to graph scaled_binomial along with beta_binomial having the same mean and variance.

Thanks. I might try this!

Adam

Yes, you got it. I developed it while working with mixture models of discrete outcomes where n \le 20. Binomial was conceptually appealing but produced very fuzzy clusters. So this allowed the scaling factor to be estimated as a continuous parameter rather than directly multiplying/dividing n, including for cases where the resulting product would produce non-integer values.

I agree that it would be interesting to compare the two. I’ve never worked with beta-binomial, but I’ll add it to my list of things to compare.

Okay. That Gelman example was very helpful! I had tried adding error to each but using a separate parameter for each row of data and that didn’t help and was a lot of parameters. Using the normal approximation to the binomial and a single extra source of error, one added to the binomial error for the given number of trials, fits about as well as the beta-binomial but is much better in the LOO sense, and much faster to fit as well. After I do few other things, I think I will return to this and explore whether having some structure in those errors (one per state? One per predictor?) improves the model. The median mean, std-deviation, min and max are still mis-matching the data, though not by any more than the beta-binomial did.

1 Like