Specifying and hypothesis testing with a sort of hierarchical beta-binomial model for ants

I’m relatively new to Stan: I’ve done a few basic tutorials and have used Stan-interfacing packages in R, such as brms, etc. A while ago, I asked a question about a beta-binomial-based model for ant preferences on stackexchange and (perhaps understandably) didn’t get any answers. I tried my best to do as much as possible with what I could find online, and after that, came here. I’ve refined my questions, so here goes.

General goal: I want to know whether ants of a certain species have individual preferences in terms of which types of food they bring back to the nest, or if they just get want the colony needs.

Set-up: Suppose there are N ant colonies of a certain species, where colony i (where i \in \big\{1,2,..., N\big\} ) has N_i worker ants. Ant j_i (where j_i \in \big\{1,2,..., N_i\big\} ) goes on N_{ji} trips, where on each trip it decides to bring back either Apples or Bananas (a binary choice) to the colony.

Given a record of individual ants’ trips (across different colonies), I want to see how likely it is that ants of a certain species have specialized preferences (i.e., some ants get Apples a lot, while others go for Bananas mostly) or whether they all just get whatever their colony wants in general.

My approach: I’m assuming that the number of times Apples are brought back by an ant j_i is sampled from Binomial(N_{ji}, p_{ji}), where p_{ji} is ant j_i's ‘preference’ toward Apples as opposed to Bananas. I’m assuming (for one of my hypotheses) that p_{ji} \sim Beta(\mu_i, t_i), (parameterized with \mu and \nu) where \mu_i is the colony’s ‘preference’ for Apples, and t_i is the total number of trips taken by ants in colony i.

In order to make a model that can use data from multiple colonies simultaneously, I connect the colonies’ preferences (the \mu_i\mathrm{s}) via a hyper-parameter distribution Beta(\mu_{species}, \mathrm{var}) (parameterized with \mu and variance) and where \mu_{species} is the overall species preference for Apples and \mathrm{var} is the variance.

Alternative hypothesis: The alternative hypothesis—i.e., if individual ants don’t have preferences—would be that every time an ant goes on a trip, it basically flips a coin weighted according to the colony’s preference \mu_i. I.e., the number of times ant j_i brings back Apples is sampled from Binom(N_{ji},\mu_i) instead.

Additional (optional) goal: If possible, I would like to be able to assign a particular value to a species to indicate how “specialized” its ants are, something like the “Polarization Index” mentioned in this article by Rungie, Brown, Laurent & Rudrapatna (2005). Essentially, since the highest variance of beta distribution is bounded, the Polarization Index is “a standardized form of the variance: the variance divided by its upper bound”, where 0 is when all the decision makers (i.e., ants) are making decisions at random, and 1 is when they all are completely constant.


  1. Does my approach make sense? I’m particularly worried about how my model incorporates data from multiple colonies—I want to test hypotheses about the species, and I’m not sure how my model can do this.

  2. Is my Stan model (below) written correctly? I’ve included my best attempt at a Stan model of the generative process I’ve just described, as well as sample data that might go in. Any advice on making it better would be appreciated!

  3. How do I hypothesis test with this model? I’m pretty much in the dark here. My instinct would be to do a Bayes factor with the probability that the current model is true given the data vs. the alternative hypothesis model is true given the data. Is that kosher? How do I do that?

  4. (Optional) Can I get a ‘specialization’ metric from the data? The ideal scenario for me would be to be able to take in ant trip data and output a metric of how “specialized” an ant species’ workers are. Preferably, this would be something like the “Polarization Index” I mentioned earlier, but I’m not sure if that fully captures the intuition I’m going for and I’m not sure how I would aggregate data from multiple colonies into a single value.

Any help on anything would be appreciated—I’m at the very limits of my understanding for this project!

Below are what I’ve written for my model and some toy data to show how the data would be structured:


Here’s some toy data in the ‘expected’ format:

[3,  // number_of_colonies
 [2, 1, 3], // ants_per_colony
 6, // number_of_ants (technically, redundant)
 [90, 17, 4, 1, 100, 33],  // trips_per_ant
 [80, 0, 2, 1, 25, 33]  // apples_per_ant

The ants that made 90 and 17 trips belong to the first colony, the one that made 4 trips to the second, and the 1, 100, and 33-trip ants belong to the last colony (as indicated by ants_per_colony).

Here’s my Stan code:

data {
  int<lower=1> number_of_colonies; // # total number of ant colonies
  int<lower=1> ants_per_colony[number_of_colonies]; // # total number of ants per individual colony
  int<lower=1> number_of_ants; // # total number of ants
  int<lower=1> trips_per_ant[number_of_ants]; // # number of trips per ant, in order of colony
  int<lower=0> apples_per_ant[number_of_ants]; // # total number of apples returned per ant, in order of colony

// # This is just a way of assigning a colony to each ant via ants_per_colony and trips_per_colony:
// # Essentially, it just gets the start and stop indices in the ant list of each colony
transformed data {
  int<lower=1> ant_index_start[number_of_colonies];
  int<lower=1> ant_index_end[number_of_colonies];
  vector<lower=1>[number_of_colonies] trips_per_colony;
  if (number_of_colonies == 1) {
    ant_index_start[1] = 1;
    ant_index_end[1] = ants_per_colony[1];
  } else {
    int c;
    c = 1;

    for (x in 1:number_of_colonies) {
      ant_index_start[x] = c;
      c = c + ants_per_colony[x];
      ant_index_end[x] = c - 1;
  for (x in 1:number_of_colonies) {
    trips_per_colony[x] = sum(trips_per_ant[ant_index_start[x]:ant_index_end[x]]);

parameters {
  real<lower=0,upper=1> hyper_mean;
  real<lower=0> hyper_variance;

  vector<lower=0,upper=1>[number_of_colonies] colony_mean;
  vector<lower=0,upper=1>[number_of_ants] ant_preferences;

// # Re-parametrizing parameters from things I care about to alphas and betas
transformed parameters {
  vector<lower=0>[number_of_colonies] colony_alpha;
  vector<lower=0>[number_of_colonies] colony_beta;
  real<lower=0> colony_hyper_alpha;
  real<lower=0> colony_hyper_beta;
  colony_alpha = colony_mean .* trips_per_colony;
  colony_beta = (1 - colony_mean) .* trips_per_colony;
  colony_hyper_alpha = hyper_mean * (hyper_mean * (1 - hyper_mean) / hyper_variance - 1);
  colony_hyper_beta  = (1 - hyper_mean) * (hyper_mean * (1 - hyper_mean) / hyper_variance - 1);

model {
  // Colonies' mean preferences for apples are connected via hyper-parameters
  colony_mean ~ beta(colony_hyper_alpha, colony_hyper_beta);
  for (i in 1:number_of_colonies) {
    for (j in ant_index_start[i]:ant_index_end[i]) {
	  ant_preferences[j] ~ beta(colony_alpha[i], colony_beta[i]);
  apples_per_ant ~ binomial(trips_per_ant, ant_preferences);

your question really is a lot, so I’ve only parsed it on a surface level, sorry if I missed something important. The model you propose ends up being similar (but almost certainly not mathematically equivalent) to a generalized linear model (GLM). GLMs are a good starting point for your modelling as you can use rstanarm or brms which will do a lot of work for you. Since you are currently not confident in your modelling and Stan skills, I would advise against developing your own model, before trying GLM. You should however test the GLM (e.g. via prior or posterior predictive checks) and if it seems to be a bad fit, try to investigate why. Only once you understand what is the problem with a GLM, work on your own model.

Your task seems to be answerable with a GLM of the form (brms formula, Bernoulli family):

is_apples ~ 1 + (1 + ant || colony) + (1 || species)

(hope I did not mess up - I am NOT testing this in actual code). Here, I assume that colony,ant and species are factors. The fitted coefficients can be interpreted as increasing/decreasing the log odds of picking apples which is the same as multiplying the odds by exponentiated coefficients. The exponentiated coefficients then correspond to odds ratios. If specific ants have preferences you would expect that the posterior for some of the ant-specific coefficients would get meaningfully far from zero and sd(ant) (hyperaparameter - sd for the coefficients) would be large. Conversely, if there is little per-ant preference you would expect almost all ant coefficients to be close to zero and sd(ant) to be low. sd(ant) would be something related to the specialization metric you are interested in.

Note that you can’t really do hypothesis tests with Bayesian model (at least not in the classical sense). What you can do instead is to set a boundary, e.g. odds ratio of 2 is a notable difference (is it? depends on you questions). You can then compute the posterior probability of statements as “at least 50% of ants have preference that is notably different from the colony’s preference”.

Alternatively, you could also fit a model that does not include the ant term and compare which is a better fit to the data via loo or similar. But this will answer a different question