Beta-binomial with 2 data sources

TL;DR is it correct that when modeling data with a beta-binomial, data rows with lower counts will have a smaller effect/influence on the parameter distributions than rows with larger counts?

I’m trying to model voter preference in US house elections using 2 sources of data. One is the election results in each congressional district. I have a total number of votes, the number of those votes cast for a particular candidate and some demographic info about the district, e.g., %of people with college degrees and %of people over 45. I model this with a beta binomial and get reasonable answers. RHats all look good and loo seems pretty happy. Each district has about 300,000 voters and roughly half vote for each candidate.

I also have data from the CCES survey, which gives me data about a few individual voters in each district. I’m trying to add this to the model by counting up subgroups (e.g., people under 45 and with college degrees) and then setting the predictors to 0% or 100% as appropriate. The CCES survey has data for approx 100 people per district.

I fit each data-set individually and then together and compare the values and 90% confidence intervals of all 3 fits.

My intuition is that adding the CCES data might shift the model parameters slightly or narrow uncertainties but shouldn’t shift things much because it involves so many fewer people. But that’s not quite what I’m seeing. In some cases the parameters seem dominated by the CCES data. So I’m trying to figure out if my intuition is wrong or if that indicates a bug in the model code (included below) somehow.

The actual model uses a few more predictors but works as described above. It’s a bit complicated because of the 2 data-sets. And because it’s also modeling the total number of people voting out of all the people eligible to vote. But the main question remains.

Any thoughts or pointers for where to look for a deeper understanding would be much appreciated!

data {
int<lower = 1> N; // number of districts
  int<lower = 1> M; // number of cces rows
  int<lower = 1> K; // number of predictors
  matrix[N, K] Xe;
  matrix[M, K] Xc;
  int<lower=-1, upper=1> IncE[N];
  int<lower=-1, upper=1> IncC[M];
  int<lower = 0> VAPe[N];
  int<lower = 0> VAPc[M];
  int<lower = 0> TVotesE[N];
  int<lower = 0> DVotesE[N];
  int<lower = 0> TVotesC[M];
  int<lower = 0> DVotesC[M];
transformed data {
int<lower=0> G = M + N;
  matrix[G, K] X = append_row (Xe, Xc);
  int<lower=-1, upper=1> Inc[G] = append_array(IncE, IncC);
  int<lower=0> VAP[G] = append_array(VAPe, VAPc);
  int<lower=0> TVotes[G] = append_array (TVotesE, TVotesC);
  int<lower=0> DVotes[G] = append_array (DVotesE, DVotesC);vector<lower=0>[K] sigma;
  matrix[G, K] X_centered;
  for (k in 1:K) {
    real col_mean = mean(X[,k]);
    X_centered[,k] = X[,k] - col_mean;
    sigma[k] = sd(Xe[,k]);
  matrix[G, K] Q_ast;
  matrix[K, K] R_ast;
  matrix[K, K] R_ast_inverse;
  // thin and scale the QR decomposition
  Q_ast = qr_Q(X_centered)[, 1:K] * sqrt(G - 1);
  R_ast = qr_R(X_centered)[1:K,]/sqrt(G - 1);
  R_ast_inverse = inverse(R_ast);
parameters {
real alphaD;
  vector[K] thetaV;
  real alphaV;
  vector[K] thetaD;
  real incBetaD;
  real <lower=1e-5, upper=(1-1e-5)> dispD;  
  real <lower=1e-5, upper=(1-1e-5)> dispV;
transformed parameters {
real<lower=0> phiV = dispV/(1-dispV);
  real<lower=0> phiD = dispD/(1-dispD);
  vector<lower=0, upper=1> [G] pDVoteP = inv_logit (alphaD + Q_ast * thetaD + to_vector(Inc) * incBetaD);
  vector<lower=0, upper=1> [G] pVotedP = inv_logit (alphaV + Q_ast * thetaV);
  vector [K] betaV;
  vector [K] betaD;
  betaV = R_ast_inverse * thetaV;
  betaD = R_ast_inverse * thetaD;
model {
alphaD ~ cauchy(0, 10);
  alphaV ~ cauchy(0, 10);
  betaV ~ cauchy(0, 2.5);
  betaD ~ cauchy(0, 2.5);
  incBetaD ~ cauchy(0, 2.5);
  phiD ~ cauchy(0,2);
  phiV ~ cauchy(0,2);
  TVotes ~ beta_binomial(VAP, pVotedP * phiV, (1 - pVotedP) * phiV);
  DVotes ~ beta_binomial(TVotes, pDVoteP * phiD, (1 - pDVoteP) * phiD);
generated quantities {
vector[G] log_lik;
  for (g in 1:G) {
    log_lik[g] =  beta_binomial_lpmf(DVotes[g] | TVotes[g], pDVoteP[g] * phiD, (1 - pDVoteP[g]) * phiD) ;
vector<lower = 0>[G] eTVotes;
  vector<lower = 0>[G] eDVotes;
  for (g in 1:G) {
    eTVotes[g] = pVotedP[g] * VAP[g];
    eDVotes[g] = pDVoteP[g] * TVotes[g];
  real avgPVoted = inv_logit (alphaV);
  real avgPDVote = inv_logit (alphaD);
  vector[K] deltaV;
  vector[K] deltaD;
  for (k in 1:K) {
    deltaV [k] = inv_logit (alphaV + sigma [k] * betaV [k]) - avgPVoted;
    deltaD [k] = inv_logit (alphaD + sigma [k] * betaD [k]) - avgPDVote;
  real deltaIncD = inv_logit(alphaD + incBetaD) - avgPDVote;

Hey there!

I just quickly gazed over your model (I’d need more time to understand all the details). But regarding to:

I’d say: yes. Generally data rows with high counts will give you much more certainty in your estimates and this certainty will “dominate” the smaller counts. You could perhaps also say that the counts act as a weight. You could think about this sequentially as updating your prior (although this is not what Stan is really doing): When you see a large dataset first, which sets your prior, another, smaller, dataset will not change your prior much and vice versa. You’d probably need to adjust your priors to reflect your prior belief that both data sources are more “equally informative” than their counts suggest. But that’s up to your domain expertise.

Maybe others have ideas about how to achieve this? (If you don’t get replies we can start tagging people to ask for help.)


Ps: You can use qr_thin_Q and qr_thin_R, which do the thinning for you and are also a bit faster. :)

Thanks! It’s good to know that my intuition is okay here. I don’t necessarily want to make the sources more “equal,” unless I have a principled way to do it.

I think the actual numbers reflect reality: the election result in each district (300k people) should have more “weight” than a survey of 150 people in the district. My hope had been that the survey data would usefully improve some uncertainties in the election result data: e.g., voter preference by sex is hard to glean from election results since every district is pretty close to 50/50 whereas it’s easy to separate in the survey data.
But what I actually see, is that the result of combining the data do not seem consistent with what you and I expect from the difference in the counts. The combined results look more like the survey alone than like the election-results alone. And I am wondering how I might diagnose where my intuition or model has gone wrong.
I could imagine a few things:

  1. My intuition is wrong because, though the data in the survey is sparser, it is also much more specific (the predictors take on a much wider range of values).
  2. My intuition is wrong because of the interaction of a few predictors and all of the de-centering and de-correlating.
  3. I have a bug in the model!

But I’m not sure how best to figure out which it is. Any ideas would be much appreciated!


  • List item

P.S. I’ve mostly eliminated possibility 2, above. I’m re-running with just one predictor and it remains true that when fitting the combined data-sets, the smaller (in terms of total count of voters) still somehow dominates the result.

I’ll note, as much out of total confusion as anything else, that the source with more counts is ~400 rows with counts in the 100k level (300k voters out of 500k voting eligible people, or 150k voters for a particular candidate out of 300k voters) whereas the source with fewer counts comes in as 5000 rows with ~30 voters (and ~15 voting for the given candidate), So it’s smaller in terms of total counts but has more rows. Which shouldn’t matter, right?

Hm, to be quite honest I have no clue what’s going on here. I’m inclined to say it’s

but it also strongly goes against my intuition.

Let’s try our luck and see if @betanalpha has time to share some insights. :)


Yeah. I think this is it as well. The intercept acts as we both think it ought to, much more strongly influenced by the higher count data. But the various predictors don’t, even one at a time. Though the ones where the lower count data cover a wider range of values show this most of all.
I changed the model so that the data sources have separate intercepts–which doesn’t change the issue in the original post. Different intercepts make more sense, particularly for the turnout bit, since that’s computed somewhat differently in the two cases.

I won’t be able to offer any particularly detailed insight here.

As always it’s easier to understand general questions like these by abstracting the problem into as simple a simulation as possible. In particular by working with simulations one could remove any problems with representative sampling entirely and focus purely on the aggregation behaviors. Simpler problems will also make it much easier for others participate without having to invest too much time.

In general aggregation can be tricky if not implemented in a statistically-consistent way. For example using an empirical mean, mean_y ~ normal(mu, sigma), instead of the individual data, y_n ~ normal(mu, sigma), will result in a likelihood function that’s artificially wide even though the empirical mean is a sufficient statistic. Although his may appear a trivial mistake similar mistakes are not uncommon in real models, especially as the sophistication grows. For example when only aggregate data are available isn’t not always possible to analytically integrate the generative model for individual data into a generative model for the aggregate data, and approximate aggregate models can introduce inconsistent behavior when looking at different aggregations.

Another possible issue is that even a large amount of aggregate data might not inform demographic behavior particularly well. In this case even small amounts of individual data can dominate the posterior for demographic parameters and the like. The interplay between the two will depend on just how much demographic variation there is within each aggregate cell, especially relative to the heterogeneity between the cells themselves.

Again simulation studies are a great way to isolate and investigate these behaviors to see if they might be at play in the real model.

Thanks! I think I have a clearer sense of what’s happening which makes it possible to attack this some with simulation, I think. It’s taken me a while to understand it well enough to even know which aspects are important for simulating.

I think I’m okay with your first point since I’m modeling the aggregated data with a beta-binomial using the actual counts. This is actually what led to the question in the first place. I thought this should make the likelihood quite narrow and thus that the rows with much smaller counts would not contribute much.

I think what you address beginning with “Another possible issue…” is what I see. There is much less demographic variation among the high-count aggregates since each congressional district is relatively heterogenous. But the aggregates from the survey, though accounting for many fewer people, are demographically homogenous.

I think I can try to simulate some data and then sample it in a similar way (large heterogenous aggregates and small homogenous aggregates) and see what happens when I fit it. I just needed a clearer sense of what might be happening before I even knew what aspect to simplify down to.

Thanks all!