Modeling heteroskedasticity using the auxiliary parameter phi for beta_binomial distribution

I am working with high resolution social network data that recorded interactions every second.

I want to look at what predicts the association rate of the following day during the breeding season.

My response variables are male-female association rate.

My predictors are:
1).statust–Male status (territorial or bachelor)
2). male degree-- the number of Male-Male interactions (from previous day)
3).non_focal–Female gregariousness with non-focal males (from previous day)
4).day–Day of breeding season
5) Interaction term for predictors 1 &3
6) multi-membership random effect to account for dyadic dependency.

Model: bf(number of interactions / trial(sampling period) ~ status + day+male_degree+non_focal+status:male_degree + (1|mm(male_id,female_id),
phi~male_degree+non_focal)

My network is very sparse and predictors 2 & 3 contain a lot of 0’s. Because of this I used a beta_binomial distribution and put male_degree & non_focal as heteroskadastic(i.e., phi~male_degree+non_focal).

My question is how to interpret the phi output for both male_degree & non-focal predictors.

image

@paul.buerkner
@avehtari

As described in this vignette Define Custom Response Distributions with brms for beta binomial the binomial probability is not predicted directly but rather assumed to have a beta distribution with parameters mu and phi. Phi is the dispersion parameter. You can see how a change in phi changes the beta distribution when holding mu constant, using code like below:

mu <- 0.5
phi <- 0.1
shape1 <- mu * phi
shape2 <- (1 - mu) * phi
d <- rbeta(n, shape1, shape2)
hist(d, breaks=100, xlim=c(0,1))

mu <- 0.5
phi <- 1
shape1 <- mu * phi
shape2 <- (1 - mu) * phi
d <- rbeta(n, shape1, shape2)
hist(d, breaks=100, xlim=c(0,1))

mu <- 0.5
phi <- 10
shape1 <- mu * phi
shape2 <- (1 - mu) * phi
d <- rbeta(n, shape1, shape2)
hist(d, breaks=100, xlim=c(0,1))

mu <- 0.5
phi <- 100
shape1 <- mu * phi
shape2 <- (1 - mu) * phi
d <- rbeta(n, shape1, shape2)
hist(d, breaks=100, xlim=c(0,1))

As phi becomes larger, the dispersion decreases and the distribution becomes more concentrated.
In brms, when you model phi in a distributional model, the log link is used for phi. Thus, for your results, it appears that for every unit increase in non_focal there is a -9.5 (-18.0 - -1.2) change in phi on the log scale (note the standard error for this estimate is pretty high, and also since this is the log scale that effect size is quite large…maybe lack of data?). For every unit increase in male_degree there is a 0.7 (0.2 - 1.1) change in phi on the log scale. Your phi_Intercept is the value of phi on the log scale when non_focal and male_degree are zero. For any combination of values of non_focal and male_degree you can get the estimated phi. For example, say non_focal=2 and male_degree=1, then you can get the estimate of phi:

s1 <- as_draws_df(model)
phi <- exp(s1$b_phi_Intercept + 2*s1$b_phi_non_focal + 1*s1$b_phi_male_degree)
mean(phi)
quantile(phi, c(0.025, 0.975))

You could make a plot to visualize how phi changes with different values of the predictors if you wanted, which would inform you how the dispersion of the binomial probability changes with different values of the predictors.

@jd_c

Ok so let me see if I understand this correctly.

So what this result is saying is that most of the dispersion is coming from the non-focal paramter, so it puts a bigger phi for non-focal than for male-degree in order to concentrate your posterior distribution?

So for example if you put a parameter that returned an phi value of around 0 then this would indicate that the data for this distribution isn’t dispersed so there is no need to further concentrate your data?

is this correct @jd_c?

No not quite. A larger phi means less dispersion. A phi closer to zero (some small fraction) means greater dispersion. On the log scale (the output of brms), smaller phi is more negative values and larger phi is more positive values (like, a really small phi might be -5 and a really large 5, since it is log scale). In your example, non-focal contributes to making phi smaller and thus contributes to more dispersion.
If you are modeling phi, and the parameter for some predictor is around zero, then this simply means that the predictor isn’t contributing much to predicting the dispersion parameter. It doesn’t mean that there is no dispersion.

I wrote a simulation with notes, as this might help you understand what is going on. You can easily run the code for yourself. When you simulate data, it makes it explicit how the data generation process works. Notice what phi is actually doing and how it relates to the binomial part of the model.

library(brms)


#Let's program 2 simulations: 1) where phi is very small (close to zero), when phi is really small, there is a large amount of dispersion;
#and 2) when phi is really large, when phi is really large, then there is not much dispersion

#To simulate from beta binomial, you provide a model for mu and phi and then generate p from the beta distribution. Then, you take
#the p that you generated and use it in a binomial distribution to generate the outcome.

#simulation 1, very small phi
#small value for phi, binomial vs beta binomial model
n <- 500	#number of observations
x <- rnorm(n, 0, 1)	#predictor x, simply a centered and scaled predictor
mu <- plogis(-1 + 0.5*x)	#model for mu, on the logit scale as in brms
phi <- exp(-5 + 0.25*x)		#model for phi, on the log scale as in brms
shape1 <- mu * phi	#shape 1 for the beta distribution in R
shape2 <- (1 - mu) * phi	#shape 2 for the beta distribution in R
p <- rbeta(n, shape1, shape2)	#simulate p from beta distribution using shape 1 and shape 2 parameters that we got from our mu and phi models
tr <- rpois(n, lambda=100)	#simulate some number of trials for the binomial distribution
y <- rbinom(n, tr, p)	#simulate our outcome y from a binomial distribution with our number of trials and our p from the beta distribution
d1 <- cbind.data.frame(y, tr, x)	#bind them all in a dataframe
hist(y/tr, breaks=100)	#view our outcome as a proportion of outcome y divided by trials. Note this is very dispersed! Most values are near zero or one

#run a binomial and a beta binomial model in brms
m.b <- brm(y | trials(tr) ~ 1 + x, family=binomial, data=d1, cores=4)
m.bb <- brm(bf(y | trials(tr) ~ 1 + x, phi ~ 1 + x), family=beta_binomial, data=d1, cores=4)
m.b
m.bb

#similar values for Intercept and x, which is the mean p on the logit scale
#this can be visualized in posterior predictive checks below, where the mean is similar for both the binomial and the beta binomial models
pp_check(m.b, type='stat')

pp_check(m.bb, type='stat')

#However, since the binomial does not have the dispersion parameter, it cannot model the extreme overdispersion,
#so while the mean is correct, the distribution of p (and thus the y outcome) isn't modeled correctly. 
#You can see this in posterior predictive checks with histograms or density plots. The binomial is really bad, but the beta binomial is great.
pp_check(m.b, type='hist')

pp_check(m.bb, type='hist')

pp_check(m.b)

pp_check(m.bb)



#simulation 2, large phi
#now let's look at similar simulation but instead have a large value for phi; binomial vs beta binomial model
n <- 500
x <- rnorm(n, 0, 1)
mu <- plogis(-1 + 0.5*x)
phi <- exp(5 + 0.25*x)	#this time we make phi large by making the intercept for phi on the log scale, large and positive
shape1 <- mu * phi
shape2 <- (1 - mu) * phi
p <- rbeta(n, shape1, shape2)
tr <- rpois(n, lambda=100)
y <- rbinom(n, tr, p)
d2 <- cbind.data.frame(y, tr, x)
hist(y/tr, breaks=100)	#note this is much less dispersed than the example with small phi

m.b2 <- brm(y | trials(tr) ~ 1 + x, family=binomial, data=d2, cores=4)
m.bb2 <- brm(bf(y | trials(tr) ~ 1 + x, phi ~ 1 + x), family=beta_binomial, data=d2, cores=4)
m.b2
m.bb2

#Again, we have almost the same estimates of Intercept and x for both models, which is the model for the mean
#As expected posterior predictive checks for the mean look similar for both models
pp_check(m.b2, type='stat')

pp_check(m.bb2, type='stat')

#Now let's look at the distribution of the outcome y and compare the two models. Notice that unlike in the first 
#example when phi was really small, now the binomial is actually not bad looking. It looks pretty similar to the beta-binomial.
#Why is that? It is because phi is really large, and as phi gets larger and larger, then it approaches the binomial distribution.
pp_check(m.b, type='hist')

pp_check(m.bb, type='hist')

pp_check(m.b2)

pp_check(m.bb2)

@jd_c

ok so non_focal contributes to making phi smaller (more dispersion) and male_degree contributes to making phi larger ( less dispersion).

So what the output is telling me is that non_focal is the only predictor that I need to model dispersion for, but male_degree I don’t need to.

For example, if I was only modeling dispersion for male degree than there is no need since phi is large so binomial distribution should be fine since as phi gets larger it approaches the binomial distribution.

Thank you so much @jd_c the simulation was very insightful

Yes.

No that is not correct. male_degree in your model appears to be associated with the dispersion parameter, so why would you remove it? As I mentioned in my first response, non_focal, while strongly negative, also has a massive standard error, so it is very uncertain what the value is. Maybe there is a lack of data? In any case, the values of the predictors for phi aren’t telling you whether or not to use a beta binomial, or even that you want to include those predictors in phi (maybe you want to know how they influence phi, because that is part of some generative modeling assumptions).
If you are wondering whether or not you need the dispersion parameter at all, then that information comes from the data generation process. Think about the data generation process and whether or not those predictors should be included. Use posterior predictive checks. You can run a binomial model and a beta binomial and compare posterior predictive checks, as I did in the simulation. You can also run a beta binomial model without the predictors for phi (i.e. just an intercept for phi) and compare. You can also compare models using tools like LOO.

You’re welcome! You might want to run it yourself if you haven’t, and you can play around with changing phi. That might help you understand it better.

@jd_c
One last thing sorry.

I ran the simulation and noticed that the pp_checks in the simulation looked very different to my model.

Should I be worried? see plots below

@jd_c

These plots are posterior predictive checks of your actual model and data?
If so, they don’t look that bad to me. They seem to do a decent job of capturing the mean and distribution of the outcome.
Remember that in the simulation, the true data generation process is programmed into the model (in the case of the beta binomial), so the posterior predictive checks should look pretty great. In the case of your actual data, you can’t know the true data generation process (or have access to all of the data needed to model it), so your model won’t look so perfect as in the simulated example I provided. That was just a tool to understand how beta binomial works.

@jd_c
Yes these are from my actual data.

Great, So what posterior predictive checks would you recommend for assessing over-dispersion?

-Hunter

The histograms and density plots you show are fine.
If you want to see the effect of including a dispersion parameter vs not, then you can simply compare these plots to the same type of posterior predictive check plots from running just a binomial model on the same data.

@jd_c
Last question and again I really appreciate your help.

What exactly am I looking for in the histogram plots, Specifically in mine, that indicate that there is no over-dispersion

When running your simulation its clear that the beta-binomial distribution is better because the histograms of the pp_check is normally distributed, but I don’t see a clear pattern when looking at mine.

I’ll quote from Graphical posterior predictive checking — PPC-overview • bayesplot
“The idea behind posterior predictive checking is simple: if a model is a good fit then we should be able to use it to generate data that looks a lot like the data we observed.”

And here are the plots from the simulation (first example with very small phi):


The data histogram (dark, upper left for each plot) show that the outcome y is basically either near zero or somewhere around 100. Since the number of trials was ~100, that means that when converted to a proportion, this is either near zero or one. Very extremely dispersed if you think about a beta distribution of p. The binomial model sort of splits the difference between these two extremes and comes up with a mean value, but the generated data looks nothing like the actual observed data. The beta-binomial actually captures what is happening by modeling the extremes, as you can see it generates data that looks quite similar to the observed data (most values near zero or near 100).

Your posterior predictive checks on your data seem to cover the range of values pretty well and look reasonably similar to the observed data. Try running just a binomial model brm(bf(interactions | trial(sampling_period) ~ status + day + male_degree + non_focal + status:male_degree + (1|mm(male_id,female_id)), family=binomial, data=data) and then look at posterior predictive checks. Your binomial may generate data that is a lot more concentrated than the actual data if phi for most of the data is pretty small. If it is large, then it might not look a lot different than beta binomial (as I show in the 2nd example in the simulation, with posterior predictive checks shown below, there isn’t a lot of discernable difference).