Best Prior for Varying Intercept in Logistic Regression modeling Penalty Shot Success Rate

Using the brms package here, trying to teach myself how prior distributions of varying-intercept SDs work. Let a hockey player’s chance of scoring on a penalty shot be a function of a Baseline Probability, his (scaled) Salary, his Position (defensemen score less), and whether it’s a Home or Away game:

BaseProb <- 0.50
# True Betas:
B <- c(Baseline = qlogis(BaseProb), 
        Salary = 1.386294, 
        PositionD = -0.2, 
        Away = -0.1)
#Data: 
set.seed(4)
d <- data.frame(Player = rep(LETTERS[1:20], each = 2),                            
                             Salary = rep(rnorm(20, sd = 0.8), each = 2),                           
                             PositionD = rep(c(0,1), each = 20), 
                             Away = rep(c(0,1), times = 20),
                             Attempts = round(runif(n = 40, min = 5, max = 20)))
#Model matrix:
M <- matrix(c(rep(1, nrow(d)), 
               d$Salary,
               d$PositionD, 
               d$Away), 
             ncol = 4, 
             dimnames = list(NULL, 
            c("(Intercept)", "Salary", "PositionD", "Away")))
# True Linear Predictor:
d$lp <- M %*% B
# True probabilities
d$trueP <- plogis(d$lp)
# Observed outcomes, based on the True Probabilities
d$y <- round(d$trueP * d$Attempts)
# Take a look at the data
d

Of course, this dataset can be modeled well without varying effects. But I’m trying to observe how, supposedly (Agresti 2015: 289-290), the varying-effect term can represent effects of unmeasured covariates. Specifically, I’m trying to observe how the group-level SD parameter captures variance originating from predictors that were left out of the model and attributes it, to the extent possible, to the clusters, thus becoming a kind of dispersion parameter for a response distribution that does not normally allow the SD to vary freely (here the binomial distribution).

In our dataset, the player-specific SD of the log-odds of success appears to be 0.92:

sd(sapply(unique(d$Player), function(x){with(d[d$Player == x,], weighted.mean(lp, w = Attempts))}))

I.e. the SD of the player-specific intercepts is the SD of the player-specific weighted mean success probabilities. Since this appears to be 0.92, we now fit Bayesian models with only an intercept and a group-level SD for that intercept, trying many different priors, hoping that they find the true value:

set.seed(2019)
VarCorr(mix.exp1 <- brm(y | trials(Attempts) ~ (1|Player), family = binomial(link = "logit"), prior = prior(exponential(1), class = sd), data = d))
# SD is 0.8144626
VarCorr(mix.exp05 <- brm(y | trials(Attempts) ~ (1|Player), family = binomial(link = "logit"), prior = prior(exponential(0.5), class = sd), data = d))
# SD is 0.8139658
VarCorr(mix.uni5 <- brm(y | trials(Attempts) ~ (1|Player), family = binomial(link = "logit"), prior = prior(uniform(0,5), class = sd), data = d))
# SD is 0.8413079
VarCorr(mix.uni10 <- brm(y | trials(Attempts) ~ (1|Player), family = binomial(link = "logit"), prior = prior(uniform(0,10), class = sd), data = d))
# SD is 0.8463509
VarCorr(mix.cau3 <- brm(y | trials(Attempts) ~ (1|Player), family = binomial(link = "logit"), prior = prior(cauchy(0,3), class = sd), data = d))
# SD is 0.8334766
VarCorr(mix.cau6 <- brm(y | trials(Attempts) ~ (1|Player), family = binomial(link = "logit"), prior = prior(cauchy(0,6), class = sd), data = d))
# SD is 0.8347414
VarCorr(mix.norm10 <- brm(y | trials(Attempts) ~ (1|Player), family = binomial(link = "logit"), prior = prior(normal(0,10), class = sd), data = d))
# SD is 0.8379087
VarCorr(mix.norm <- brm(y | trials(Attempts) ~ (1|Player), family = binomial(link = "logit"), prior = prior(normal(0,1), class = sd), data = d))
# SD is 0.8063237
VarCorr(mix.norm05 <- brm(y | trials(Attempts) ~ (1|Player), family = binomial(link = "logit"), prior = prior(normal(0,0.5), class = sd), data = d)) 
# SD is 0.7360365

Every single prior undershoots the true SD. The one coming closest is the Uniform prior between 0 and 10. Seriously, 0 and 10?!?! The true SD is less than 1. Why is it so hard to reach that low number?

Even stranger to me is that the equivalent Frequentist model undershoots even more:

require(lme4)
VarCorr(mixed100 <- glmer(y/Attempts ~ (1|Player), weights = Attempts, family = binomial, nAGQ = 100, data = d))
# SD is 0.74564

I thought a Frequentist random effect simply looked at what the cluster-specific MLEs would be if the cluster were a fixed effect, calculated an overall mean and SD based on those, and then shrank the cluster-specific estimates toward the overall mean in inverse proportion to cluster size. Hence the estimate of the SD itself should undergo zero shrinkage whatsoever. But this SD has undergone severe shrinkage.

Where am I going wrong?


References:
Agresti, Alan. 2015. Foundations of Linear and Generalized Linear Models. Hoboken, New Jersey: John Wiley & Sons.

Hi,
here are my few cents, but I am no expert on this. First the Bayesian case:

The first thing I find weird is:

Why would you round the true probability? I think you wanted to simulate the number of outcomes with rbinom? This might in itself justify the lower SD estimates as the formula you use reduces the overall variability in your data.

and also:

The SD is a variable that has a posterior distribution not a single number - what is this number you are reporting? Is this the mean?

Second, if I understand things correctly your posterior mode would be something close to the observed value of 0.92 if your prior was flat improper (e.g. no prior at all). Compared to the flat prior, every prior you have used favors smaller numbers to large numbers (even the uniform priors, since they put 0 prior mass to numbers above the boundary). Therefore it makes sense that the posterior mode (and possibly also mean and median) are shrunk towards zero.

I however did not really grasp what your code for computing the SD is doing - it would IMHO be better to directly simulate data from the model you are fitting, where the SD would be an explicit parameter.

I think lme4 also does some shrinkage/regularization, but I’ve never really understood how it works, so I am not sure this is the case here.

Finally a bit of personal feedback you may or may not take from me: To me your question felt a bit angrily written, as in “Why the hell does your software not work!!”. It’s likely this was not intentional on your part, but it came out as that and made me think twice before answering.

Many thanks for the feedback, Martin. My apologies to everyone if the post had a bad vibe. Any frustration that might have been in evidence was solely due to my own failure to understand what was going on. No implication was intended that any of the software was not working. Below I respond to your points in order.

You are right that normally we would use rbinom() to simulate the success counts from the binomial distribution using the true probability and the number of attempts. This time though, I wanted to be in full control of the data-generating process in order to ensure that no observation consisted of zero or 100% successes. Thus, in a sense, the “true” success probabilities in this dataset are really the sample proportions, i.e. y/Attempts. I think this is a minor detail with regard to the problem however.

Yes, sir. It is what brms() reports by default when you call VarCorr(). It turns out that this is indeed the posterior mean.

Yes, this makes sense. That is why I am especially confused that lme4, supposedly frequentist, returns values that are more shrunken than those of Bayesian models with weak priors.

The simulation should perhaps indeed have been made more simple. I added complexity in the hopes that a ‘realistic’ example from sports would elicit more responses. Anyway, the Player column represents different players taking penalty shots. The code simulates their success rates as a function of their salary and whether it’s a home game or an away game. The number of trials per player and setting was generated randomly, with the lower and upper bounds set at 5 and 20. Therefore, as you correctly point out, no explicit SD component is used in the data-generation process – the success probabilities come directly from the fixed effects and are rounded. But when those fixed effects are not included in the model, supposedly the player-specific random intercepts will capture as much of that unexplained variation as possible. This is what I tried to model because, if true (see reference in 1st post), it is a feature of random effects that one should be aware of when modeling real data. Thus, there is no explicit ‘true’ random-effect SD specified at the beginning. The true SD is simply what results when we take the average success rate of every player (weighted by number of attempts), and calculate the SD of those player-specific success rates. This SD is what both the frequentist and Bayesian models (under the priors specified above) underestimate.

This would indeed seem to be the case. It is a shocking relevation to someone brought up in the belief that the only assumption a frequentist multilevel model will ever make is that the random intercepts come from a normal distribution.

I am not so sure. I generally prefer not to trust my intuition much in questions of statistics :-)

Then it’s also possible that the mean is not a very great summary for your purpose as the posterior for SD might be noticeably skewed - maybe you should check out the median instead. Or, even better use something like ppc_hist - would give you figure like this letting you compare the observed value with the posterior distribution:

I would strongly suggest you try to simulate a simple data - all, including the prior and observation noise. I think it is quite likely that the discrepancy will go away if you do it carefully.

1 Like

Thanks Martin.

In the latest development I set the prior to uniform(0, 1000) and increased the number of posterior draws. Nothing changed:

mix.uni1000 <- brm(y | trials(Attempts) ~ (1|Player), family = binomial(link = "logit"), prior = prior(uniform(0,1000), class = sd), data = d, warmup = 1000, iter = 9000)
plot(density((posterior_samples(mix.uni1000)[,2])))
abline(v = mean(posterior_samples(mix.uni1000)[,2]), lty = "dashed")
text(0.9, 0.06, labels = paste0("Mean: ",as.character(mean(posterior_samples(mix.uni1000)[,2]))))

I’m starting to doubt my own calculations of the ‘true’ value, because surely a prior like this, especially with this many draws, should guarantee convergence on the correct value.

Wrong.

library(tidyverse)
library(brms)

set.seed(4)

df <- tibble(player = rep(LETTERS[1:20], each = 2),
             salary = rep(rnorm(20, sd = 0.8), each = 2),
             position_d = rep(0:1, each = 20),
             away = rep(0:1, times = 20), 
             attempts = round(runif(40, 5, 20))) %>%
  mutate(lin_pred = 1.386294*salary - 0.2*position_d - 0.1*away,
         prob = plogis(lin_pred),
         y = round(attempts*prob),
         y2 = map2_dbl(.x = attempts, .y = prob, ~rbinom(1, .x, .y)))

With improper flat priors:

 Family: binomial 
  Links: mu = logit 
Formula: y | trials(attempts) ~ (1 | player) 
   Data: df (Number of observations: 40) 
Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup samples = 4000

Group-Level Effects: 
~player (Number of levels: 20) 
              Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
sd(Intercept)     0.83      0.20     0.51     1.28       1266 1.00

Population-Level Effects: 
          Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
Intercept     0.24      0.21    -0.18     0.67       1230 1.00

Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample 
is a crude measure of effective sample size, and Rhat is the potential 
scale reduction factor on split chains (at convergence, Rhat = 1).

Compared to y2:

 Family: binomial 
  Links: mu = logit 
Formula: y2 | trials(attempts) ~ (1 | player) 
   Data: df (Number of observations: 40) 
Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup samples = 4000

Group-Level Effects: 
~player (Number of levels: 20) 
              Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
sd(Intercept)     0.93      0.21     0.59     1.41       1262 1.00

Population-Level Effects: 
          Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
Intercept     0.28      0.23    -0.18     0.74       1303 1.00

Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample 
is a crude measure of effective sample size, and Rhat is the potential 
scale reduction factor on split chains (at convergence, Rhat = 1).

y2 with reasonable \text{Exponential}(1) prior on \sigma_\text{player}:

 Family: binomial 
  Links: mu = logit 
Formula: y2 | trials(attempts) ~ (1 | player) 
   Data: df (Number of observations: 40) 
Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup samples = 4000

Group-Level Effects: 
~player (Number of levels: 20) 
              Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
sd(Intercept)     0.89      0.19     0.58     1.31       1158 1.00

Population-Level Effects: 
          Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
Intercept     0.29      0.22    -0.14     0.75       1208 1.00

Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample 
is a crude measure of effective sample size, and Rhat is the potential 
scale reduction factor on split chains (at convergence, Rhat = 1).

… and y2 with the essentially flat \text{Cauchy}(0,3) prior on \sigma_\text{player}:

 Family: binomial 
  Links: mu = logit 
Formula: y2 | trials(attempts) ~ (1 | player) 
   Data: df (Number of observations: 40) 
Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup samples = 4000

Group-Level Effects: 
~player (Number of levels: 20) 
              Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
sd(Intercept)     0.93      0.21     0.58     1.40        912 1.00

Population-Level Effects: 
          Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
Intercept     0.28      0.24    -0.18     0.76       1220 1.00

Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample 
is a crude measure of effective sample size, and Rhat is the potential 
scale reduction factor on split chains (at convergence, Rhat = 1).

That being said, there is still a lot that seems weird. I’m not quite sure what you are trying to achieve. You might want to read up on fake data simulation and prior predictive checks. Basically, what @martinmodrak said.

Thanks for the input, Max.

Ultimately, I’m trying to simulate (and thereby understand) the exact way in which random effects work and how their shrinkage is done. As stated in the 1st post, the random effect supposedly functions as a kind of dispersion parameter in a logistic model, i.e. it captures the between-cluster variance that is due to unobserved predictors. That is why the simulated dataset generates the outcome vector on the basis of ‘true’ predictors, then leaves those ‘true’ predictors out of the regression model so that the random effect captures their effects instead.

My main reason for generating the responses directly from round(attempts*prob) is that randomization by rbinom() often results in empirical success rates of 0% or 100% for some clusters. Those translate to infinitive logits, and the math of shrinking infinitive estimates flies way over my head.

Your simulation code is actually very convenient because it happens to produce finite success rates even using rbinom():

sd(qlogis(tapply(df$y, df$player, sum) / tapply(df$attempts, df$player, sum)))
# True SD is 0.92 for directly generated y
sd(qlogis(tapply(df$y2, df$player, sum) / tapply(df$attempts, df$player, sum)))
# True SD is 0.99 for rbinom()-generated y

This being the case, I don’t see a qualitative difference in the results due to the way in which the outcomes were generated. When weighted by number of attempts, the empirical logits of df$y have a true SD of 0.92, and every model underestimates it regardless of the prior. For y2 this quantity is 0.99, and every model underestimates it regardless of the prior. This is the problem that I’m trying to solve.

Sidenote: I don’t know how to specify flat improper priors. You left that bit out of your code chunk. Therefore I used (uniform(0, 10000)) as an ‘improper flat’ prior for the SD in my own models fit to your data. The results were virtually identical to yours though.