Beta-Binomial model with rethinking package and dispersion for each interaction


Hey all,

I am very new to Stan (and modelling in general). I have been following McElreath’s classes + reading his book to get an introduction, and I’ve tried applying a beta-binomial distribution to real life samples. The samples are very overdispersed and I would like to model the overdispersion by group-at least to see on which occasion the overdispersion is larger.

Using the tadpoles example (m12.1), I applied it to my data with this code:

m12.1 <- ulam(
    alist(
        P ~ dbetabinom( N , pbar , theta ),
        logit(pbar) <- a[TC_id],
        a[TC_id] ~ dnorm( 0 , 1.5 ),
        theta <- phi[TC_id] + 2.0,
        phi[TC_id] ~ dexp(1)
    ), data=dat , chains=4
    )

where:

  • P are counts.
  • N is the total sample size.
  • TC_id indicates an interaction between two categorical variables (1 <= TC_id <= 8). For the reason mentioned above, ideally, I wanted to get theta results for each TC_id. The following code however does not work:
22:      P ~ beta_binomial( N , pbar*theta , (1-pbar)*theta );
Ill-typed arguments supplied to infix operator *. Available signatures: [...] Instead supplied arguments of incompatible type: vector, vector.

I apologise if the code is poorly written, I have just started approaching these topics. The model works fine if phi is not followed by [TC_id]. I have also tried using something like vector[8]:, but I am still getting an error.

Thank you in advance for your help!


The error happens because that is not the correct code for beta_binomial, the multiplication (character *) is not defined when pbar is a vector and theta is a vector. If you look here, \text{BetaBinomial} has three parameters. According to the book, endnote 183, the author doesn’t use \alpha and \beta, he uses \bar p and \theta to parametrize dbetabinom. You can see that if you execute stancode(m12.1):

model{
     vector[12] pbar;
    phi ~ exponential( 1 );
    a ~ normal( 0 , 1.5 );
    for ( i in 1:12 ) {
        pbar[i] = a[gid[i]];
        pbar[i] = inv_logit(pbar[i]);
    }
    // Different parametrization, it will be always like that
    A ~ beta_binomial( N , pbar*theta , (1-pbar)*theta );
}

If you want to take into account some variation on theta, you have to code the model in Stan. For an “easy” example, look at this code:

library(rethinking)
set.seed(24)
pbar <- 0.04
theta <- 7.5
n_times <- 200
t1 <- rbetabinom(n_times, 50, pbar, theta)
t2 <- rbetabinom(n_times, 75, pbar + 0.1, theta + 10)
t3 <- rbetabinom(n_times, 100, pbar + 0.2, theta + 20)
t4<- rbetabinom(n_times, 125, pbar + 0.3, theta + 30)
A <- c(t1, t2, t3, t4)
TC_id <- c(replicate(n_times,1), replicate(n_times,2), replicate(n_times,3), replicate(n_times,4))
N <- c(replicate(n_times,50), replicate(n_times,75), replicate(n_times,100), replicate(n_times,125))
data = list(N=N, A=A, Tc_id=TC_id, J=n_times*4)

The goal is to recover pbar and theta. The code for the model in Stan (save it in a file called beta_binomial_example.stan):

data{
    int<lower=0> J;
    array[J] int N;
    array[J] int A;
    array[J] int Tc_id;
}
parameters{
     vector[4] a;
     // real<lower=0> phi;
     array[4] real<lower=0> theta;
}
model{
    vector[J] pbar;
    a ~ normal(0, 1.5);
    theta ~ normal(50, 30);
    for ( i in 1:J ) {
        pbar[i] = a[Tc_id[i]];
        pbar[i] = inv_logit(pbar[i]);
    }
    
    vector[J] theta_l;
    vector[J] pbar_l;
    vector[J] beta_c;
    for ( i in 1:J ) {
        theta_l[i] = theta[Tc_id[i]];
        pbar_l[i] = pbar[i]*theta_l[i];
        beta_c[i] = (1 - pbar[i]) * theta_l[i];
    }
    target += beta_binomial_lupmf(A | N, pbar_l, beta_c);
}

Run the model:

fit1 <- stan(
  file = "beta_binomial_example.stan",  # Stan program
  data = data,    # named list of data
  chains = 4,             # number of Markov chains
  warmup = 1000,          # number of warmup iterations per chain
  iter = 2000,            # total number of iterations per chain
  cores = 4,              # number of cores (could use one per chain
)

And see the results:

precis(fit1, depth = 2)
post2 <- extract.samples(fit1)
colMeans(logistic(post2$a))
# [1] 0.0343659 0.1394121 0.2458895 0.3401646
colMeans(post2$theta)
# [1]  8.623946 19.007324 30.898828 43.191547

Since you said that you were new to Stan, maybe all of this is relatively hard, I suggest to see the output of each line in R. For the Stan code, it is just a modification of m12.1 without using phi and putting a prior for theta. For the last line, look here and here.

1 Like

Dear @rosgori thank you so much! This is a very thorough response which was really helpful. I confess I am still trying to understand every line (especially at the end), but this is very kind of you.

I did notice that assigning your prior to theta increases the precision as well:

  • with the exponential(1) prior a value is 8.1
  • with theta ~ normal(50, 30) the value is 24.

Again, thank you!