Advice on modeling the number of goals scored in a handball match

I’m modeling handball match outcomes using an approach commonly applied in football analysis (based on this paper: https://www.tandfonline.com/doi/full/10.1080/02664760802684177). However, I’m having difficulty modeling the number of goals scored by the away team. The distribution of away‑team goals looks like this:

I tried fitting a Poisson model to the away‑team goal data using the following Stan code:

data {
  int<lower=0> N;
  array[N] int y;
}
parameters {
  real<lower=0> lambda;
}

model {
  //priors
  lambda~gamma(60,2);
  //likelihood
  y ~ poisson(lambda);
}
generated quantities{
  array[N] real log_lik;
  array[N] int y_pred;
  
  for(i in 1:N){
    log_lik[i] = poisson_lpmf(y[i]|lambda);
    y_pred[i] = poisson_rng(lambda);
  }
}

The posterior predictive checks plot looks like this:

ppc_bars(dat$y,
         fit_poisson$draws(variables = "y_pred",format="matrix")
)

The Poisson model fails to account for the spike at 28 goals.

I tried to include several covariates, but they did not improve the fit. I also tried a negative binomial model, which likewise failed to capture the spike at 28 goals.

Should I consider a different distribution? If so, which one?

1 Like

Can you think of any explanatory reason why it would be likely for some numbers (like 28) to be much more likely than either of their neighbors? It’s one thing for the model to fail to account for the spike at 28, but I also notice that if we just reallocated a bit of mass on 28 over to 29, everything would look completely fine. This suggests to me that there is either some thing special about 28 and 29 in particular, or that this is just a fluke and the Poisson is fine. When the Poisson is a bad fit, it’s usually because it doesn’t fit the tails well, not because of extra-Poisson variability in the frequency of adjacent counts.

1 Like

No, it’s more likely just a natural consequence of the tempo of play and tactics employed in high level handball games.

The model’s fit for the home team’s goals looks much better overall, but it still appears to underestimate the frequencies of the two most common values (30 and 31).

Still, as you suggested, if I were to reallocate some goals to the neighboring classes the plot would be fine.

Do you think this is cause for concern? I’ve read some papers (https://www.statlab-unisa.it/cladag2023/wp-content/uploads/simple-file-list/IS22-4058-12087-3-DR.pdf and https://journals.sagepub.com/doi/full/10.1177/22150218251313937) arguing the Poisson may not be a good fit for handball data because it can’t accommodate under dispersion. The authors suggest the Conway-Maxwell-Poisson but it was not implemented in Stan (Conway-Maxwell-Poisson Distribution).