Error in #times a player goes to bat (baseball binomial example)

Bob Carpenter has a great demo using the baseball batting average data from Efron & Morris’ (1975) classic paper–from which I borrow heavily below. But I remember Bob (or was it Andrew Gelman?) also saying something somewhere about how modeling batting averages like this isn’t ideal because it’s not respecting that the number of times a player comes to bat is itself a random variable. That is, if y/K is a player’s batting average (out of K number of times at bat), there’s error in “K.” If so, how would one incorporate that error in stan?

The code below doesn’t work but you get the idea. I want to sample AB as poisson(K) where AB ought never drop below the number of hits (y). My question is: is this problem an example of stan’s fundamental constraint that it can’t sample integer parameters? Or is there actually a way to modify the below to respect the error in K?

Using rstan…

data(bball1970, package = "rstanarm")
bball_data <- list(N=18, y=bball1970$Hits, K=bball1970$AB)

hier.stan_code2 <- '
data { 
  int<lower=0> N;           // items 
  int<lower=0> K[N];        // initial trials 
  int<lower=0> y[N];        // initial successes 
parameters { 
  real<lower=0, upper=1> phi;         // population chance of success 
  real<lower=1> kappa;                // population concentration 
  vector<lower=0, upper=1>[N] theta;  // chance of success
  int<lower=y> AB[N];                 // chance of times at bat
model { 
  kappa ~ pareto(1, 1.5);                        // hyperprior 
  theta ~ beta(phi * kappa, (1 - phi) * kappa);  // prior 
  y ~ binomial(AB, theta);                        // likelihood 
  AB ~ poisson(K);
hier.stan2 = stan(model_code=hier.stan_code2, data=bball_data, chains=4, iter=2000)

The above code will run if one drops all reference to AB and changes the model to “y ~ binomial(K, theta)”

I think you definitely want:

y ~ binomial(K, theta);

K and y are data here, you know them.

I think where this uncertainty in K might come in it with a posterior predictive sorta thing. If you wanted to generate outcomes for your player.

If you’re interested in generating outcomes of games, then you can’t really assume that in every game a player is going to have the same number of at bats, so you’d need to model K as well, but that’d just be a second piece to the model.

Maybe you think the number of at-bats a player gets in a game is a Poisson, in which case you’d just write:

K ~ poisson(lambda);

And then estimate the parameter lambda (and you could totally pool this, or partially pool, or not pool or whatever). lambda is a continuous parameter so its fine as a Stan model.

Maybe in the generated quantities section you’d do something like:

generated quantities {
  int yhat[N];

     for(n in 1:N) {
       int Ktmp = poisson_rng(lambda[n]);
       yhat[n] = binomial_rng(Ktmp, theta[n]); // edit: Changed K to Ktmp

So if you made any predictions off yhat it has some more uncertainty baked in (due to K). Maybe Bob/Andrew were talking about something else though. That’s my guess though!

Thanks tons, bbbales2, for your thoughtful reply. Rather than seeking predictive posteriors or something in a generated quantities block, my question is more of a Bayesian measurement error problem a la Section 13.1 in the Stan manual. The problem I’m trying to solve is how to model a binomial parameter when the number of trials contains error. In my example above, my desire is to model the number of trials as Poisson centered on K (and where K is a vector and each element of K is different for each player).

Section 13.1 contains a normal (and real-numbered) distribution example of an error-in-measurement problem. And following that example, I probably should’ve written my model block above as,

model { 
  kappa ~ pareto(1, 1.5);                         // hyperprior 
  theta ~ beta(phi * kappa, (1 - phi) * kappa);   // prior 
  y ~ binomial(AB, theta);                        // likelihood 
  K ~ poisson(AB);                                // measurement model
  AB ~ poisson(45);                               // prior on measurement

Also, there’s a footnote in section 13.2 of the manual telling me that Stan can’t handle the following constraint that I have in my original code,

 int<lower=y> AB[N];                 // chance of times at bat

But even after fixing all the above, I get the error, “parameters or transformed parameters cannot be integer or integer array”. So, I suspect I’m running into the fundamental HMC constraint about sampling discrete parameters.

In order to fit this, I likely need to re-express my model along the lines of that done in Section 13.2 of the manual. For example, I could compute the batting averages (y/K) in R prior to Stan or compute them in a transformed data block, and then I could assume that these averages are normally or beta-distributed with some sort of error. But this is all less satisfying to me than directly modeling error in the number of times at bat.

I guess you could talk about an error measurement model on K, but do you really believe that? It just seems unlikely that there are any actual measurement errors on the number of at bats.

Just to be clear, what you’re saying is right: if you want a discrete measurement error model on K, that means dealing with discrete parameters which means trouble in Stan-land (you gotta marginalize, or get lucky in some way such that the combinatorics don’t blow up on you).

But! That doesn’t seem like the way to model this. We know absolutely how many times every player went to bat, and so I really think y ~ binomial(K, theta); is the way to go.

I’d probably make AB a real as well. No reason to make it discrete here I think. The number of actual at bats (K) is an integer, but no reason the parameters couldn’t be continuous. If you wanted a greater-than-zero prior on AB, then maybe a gamma or a half-normal or something might work?

That would be me quoting Andrew :-) The point is that there’s information in the number of times a player comes to bat that is being ignored in the simpler model. In this case, at bats is positively correlated with chance of success.

Yes, because we can’t marginalize out an infinite number of options. We can often do it by just looking at some possible outcomes because the rest don’t add any probability mass that doesn’t just underflow before floating point precision.

Not if you believe umpires’s calls define the truth. So whether someone gets a walk or not depends on the umpire.

The bigger point is that you want to use both the number of bats and the sucess rate in observed number of at bats to predict ability. Perhaps by making them a simple logistic regression with an intercept and slope for number of at bats.

log odds of hit for player n = 

The more subtle baseball point is that walks and some other situations don’t count as at-bats, but have an effect on the game. So Andrew’s other point is that we don’t want to model things like batting average in isolation. We want to take plate appearances, model all the outcomes (and their correlations, etc.) and then also use the number of plate appearances as a predictor. Just using hits in number of at bats is throwing a lot of useful data away.

This is related to Andrew’s bigger point of not wanting to model who wins a soccer match, but rather the points or point differential. It’s just more information, and when you quantify it down to win-loss, you lose information.