Using the brms package here, trying to teach myself how prior distributions of varying-intercept SDs work. Let a hockey player’s chance of scoring on a penalty shot be a function of a Baseline Probability, his (scaled) Salary, his Position (defensemen score less), and whether it’s a Home or Away game:
BaseProb <- 0.50
# True Betas:
B <- c(Baseline = qlogis(BaseProb),
Salary = 1.386294,
PositionD = -0.2,
Away = -0.1)
#Data:
set.seed(4)
d <- data.frame(Player = rep(LETTERS[1:20], each = 2),
Salary = rep(rnorm(20, sd = 0.8), each = 2),
PositionD = rep(c(0,1), each = 20),
Away = rep(c(0,1), times = 20),
Attempts = round(runif(n = 40, min = 5, max = 20)))
#Model matrix:
M <- matrix(c(rep(1, nrow(d)),
d$Salary,
d$PositionD,
d$Away),
ncol = 4,
dimnames = list(NULL,
c("(Intercept)", "Salary", "PositionD", "Away")))
# True Linear Predictor:
d$lp <- M %*% B
# True probabilities
d$trueP <- plogis(d$lp)
# Observed outcomes, based on the True Probabilities
d$y <- round(d$trueP * d$Attempts)
# Take a look at the data
d
Of course, this dataset can be modeled well without varying effects. But I’m trying to observe how, supposedly (Agresti 2015: 289-290), the varying-effect term can represent effects of unmeasured covariates. Specifically, I’m trying to observe how the group-level SD parameter captures variance originating from predictors that were left out of the model and attributes it, to the extent possible, to the clusters, thus becoming a kind of dispersion parameter for a response distribution that does not normally allow the SD to vary freely (here the binomial distribution).
In our dataset, the player-specific SD of the log-odds of success appears to be 0.92:
sd(sapply(unique(d$Player), function(x){with(d[d$Player == x,], weighted.mean(lp, w = Attempts))}))
I.e. the SD of the player-specific intercepts is the SD of the player-specific weighted mean success probabilities. Since this appears to be 0.92, we now fit Bayesian models with only an intercept and a group-level SD for that intercept, trying many different priors, hoping that they find the true value:
set.seed(2019)
VarCorr(mix.exp1 <- brm(y | trials(Attempts) ~ (1|Player), family = binomial(link = "logit"), prior = prior(exponential(1), class = sd), data = d))
# SD is 0.8144626
VarCorr(mix.exp05 <- brm(y | trials(Attempts) ~ (1|Player), family = binomial(link = "logit"), prior = prior(exponential(0.5), class = sd), data = d))
# SD is 0.8139658
VarCorr(mix.uni5 <- brm(y | trials(Attempts) ~ (1|Player), family = binomial(link = "logit"), prior = prior(uniform(0,5), class = sd), data = d))
# SD is 0.8413079
VarCorr(mix.uni10 <- brm(y | trials(Attempts) ~ (1|Player), family = binomial(link = "logit"), prior = prior(uniform(0,10), class = sd), data = d))
# SD is 0.8463509
VarCorr(mix.cau3 <- brm(y | trials(Attempts) ~ (1|Player), family = binomial(link = "logit"), prior = prior(cauchy(0,3), class = sd), data = d))
# SD is 0.8334766
VarCorr(mix.cau6 <- brm(y | trials(Attempts) ~ (1|Player), family = binomial(link = "logit"), prior = prior(cauchy(0,6), class = sd), data = d))
# SD is 0.8347414
VarCorr(mix.norm10 <- brm(y | trials(Attempts) ~ (1|Player), family = binomial(link = "logit"), prior = prior(normal(0,10), class = sd), data = d))
# SD is 0.8379087
VarCorr(mix.norm <- brm(y | trials(Attempts) ~ (1|Player), family = binomial(link = "logit"), prior = prior(normal(0,1), class = sd), data = d))
# SD is 0.8063237
VarCorr(mix.norm05 <- brm(y | trials(Attempts) ~ (1|Player), family = binomial(link = "logit"), prior = prior(normal(0,0.5), class = sd), data = d))
# SD is 0.7360365
Every single prior undershoots the true SD. The one coming closest is the Uniform prior between 0 and 10. Seriously, 0 and 10?!?! The true SD is less than 1. Why is it so hard to reach that low number?
Even stranger to me is that the equivalent Frequentist model undershoots even more:
require(lme4)
VarCorr(mixed100 <- glmer(y/Attempts ~ (1|Player), weights = Attempts, family = binomial, nAGQ = 100, data = d))
# SD is 0.74564
I thought a Frequentist random effect simply looked at what the cluster-specific MLEs would be if the cluster were a fixed effect, calculated an overall mean and SD based on those, and then shrank the cluster-specific estimates toward the overall mean in inverse proportion to cluster size. Hence the estimate of the SD itself should undergo zero shrinkage whatsoever. But this SD has undergone severe shrinkage.
Where am I going wrong?
References:
Agresti, Alan. 2015. Foundations of Linear and Generalized Linear Models. Hoboken, New Jersey: John Wiley & Sons.