Baseball analysis using latent generalised Pareto distribution

When I worked in sports analysis I would often try to fit hierarchical models with normal-distributed latent ability parameters to mixed sports datasets where there was lots of data for some players and not so much data for others.

My models would often end up thinking that players with less data are basically the same as the other players; as a result it would a) get these players’ abilities systematically wrong (players with more data tend to be better) and b) make all the ability uncertainties a bit too small.

I tried a lot of ways around this problem, such as adding effects for amount of data or putting tight priors on the hierarchical standard deviation parameters, but nothing really worked very well.

Recently I saw this case study by @avehtari and got another idea for how to solve the problem: use a generalised Pareto distribution for latent abilities rather than a normal distribution. The motivation is that professional sportspeoples’ abilities are basically the tails of the general sports ability distribution.

I’m not sure if that idea really stands up to scrutiny, but I tried it anyway and it seemed to work ok! Specifically, I repeated some of the analysis in this baseball case study by @Bob_Carpenter. My code and a more detailed writeup can be found here. The graph below shows the main result: ability parameter posteriors in a normal-distribution hierarchical model vs in a generalised Pareto distribution hierarchical model fit to the same data.

The normal model seems over-regularised (according to my extremely limited baseball knowledge a batting average close to 0.4 is not unheard of and around 0.2 is pretty common) and too generous to the low-data players. Intuitively, it also seems wrong that the intervals are about the same width across the at-bat spectrum. On the other hand the generalised Pareto model seems more or less ok.

This is quite a cursory analysis but I thought I’d post in case someone else working on the same kind of problem wants to follow up.


This is a really interesting thought! Thanks for sharing! Very cool.

I’m a little bit confused about this part. How did you define “wrong” for those previous analyses? Were they wrong in the sense that once you had more data on those players to check predictions, the model predictions turned out to be systematically wrong? …Intuitively, I would think that the hierarchical model with normal-distribution would be doing what you wanted in the sense that if you don’t have much data and the player performance was poor (or way too good), then it should pool them up (or down) towards the mean. The “systematically wrong” part seems to indicate that other features are needed. For example, are those with few data neo-pros (true small data cases), or are they long-time pros who don’t get many major-league at bats (missing data mechanism). It’s almost like a missing data problem, with needed modeling of the missingness mechanism.
Just curious about the systematically wrong part and if you defined that based on more future data for those players who were modeled with few data.

Cool post. Thanks for sharing.

1 Like

My confusion might stem from the same source as @jd_c’s question, but does that graph imply that the generalized Pareto model is consistent with someone who goes say 7/10 at the start of their career actually batting close to 1.000 for the rest of their career? Or just that any player can bat >400 for a small sample size, but generally won’t continue to do so?

If my first interpretation of the result is correct, then it seems to me like the normal lines are probably better estimates of the expected true talent of the player

1 Like

Really interesting stuff here, thanks for posting.

I think that like you mentioned, there is under regularization here. One thing is that your max alpha is far too high, it should be something like 0.4

1 Like

I have thought about this a fair bit in the context of performance as a business and I think the approach by @Teddy_Groves1 could be formalized by someone smarter than me. One can think of the population baseball ability as normally distributed, and everyone gets to bat once in their life. Depending on their ability and luck they are successful (whatever that means) and get to bat a second time. Repeat this process enough time and the underlying distribution of (edit:) observed ability is going to look like an extreme value/generalised pareto distribution. So I guess what I am saying is that something like a pareto distribution is modeling a type of missingness/selection mechanism.


Thanks for all the great feedback and ideas!

By this I meant that the players with less data tend to have too high abilities - you can see this in the graph by the fact there are more black dots below the orange band than above in the 0 to 100 at-bats region. I think if the model was good, then these “residuals” would be pretty much unbiased as a function of number of at-bats. As you say, the natural way to addess this would be to add a feature for number of at-bats: I haven’t tried that with this dataset but I never previously got it to work with football (soccer) data.

The blog post where the data come from says that they come from “the 2006 American League position players” so I guess it is a fairly complete one-season dataset with some players excluded for being pitchers. I imagine there is some process according to which worse batters get fewer at bats (maybe some are fielding specialists? I know very little about baseball!), but the low data players could also include some who got injured, came in from other leagues etc.

I think that is right, except that as all the data come from one season the intention is to model some kind of “true average” for that season rather than a whole career. I agree that the blue model is probably not pulling the low data players in enough. I was mainly just happy that it doesn’t do so as much as the orange model, and that there is a big difference in uncertainty between the high and low data players. I think the ideal model would probably be somewhere in the middle in terms of interval widths.

Ah ok, I guess that shows how little I know about baseball!

Very interesting, thanks!


I tried out adding more regularisation to the blue model (I just added a normal prior on the logit scale alpha parameters with most mass between averages 0.1 and 0.4) and adding an at-bats effect to the orange model. Code and a bit more writing is again here and here is the graph:

The at-bats effect removed the bias I mentioned before, but I think the intervals are still too narrow, especially for the low-data players. Thanks to extra prior the generalised Pareto model is no longer compatible with crazy true averages for low-data players.


Very interesting! I think that a prior putting most of the mass between a player batting 100 and 400 is probably correct for a full season of at bats, but it seems to me that is then assuming every player will get a full season. I think it’s fairly likely someone who goes 0-25 actually will bat below 100 on the “season”, since their season might very well end after those 25 ABs.

But, if the interpretation is “what would happen if you just let them keep going”, the gpareto interval on the new graph looks very reasonable to me.

1 Like

Nice! This does look better.

I’m wondering if what is best depends on the purpose of the analysis…? Say you have some sample of pro players with a large range of at-bats and no further information. The generalized Pareto model seems good to fit this data. However, what if you wanted to take this sample and predict the next 200 at bats, or maybe you want to predict their performance for the next 3 seasons because you are thinking about a contract? Would the generalized Pareto still be better? I’m still wondering if for those players in the sample with few at-bats, it would seem that heavy regularization would be the optimal thing - even looking at the new plot above, the chance that those players with what looks to be 20-30 at batts with a .350+ batting average, would continue with this average or even slightly less over the course of the next 200 at-batts or next 3 seasons seems a slim chance to me (intuitively).

It seems like a few things going on with the batting average variation between players decreasing with increasing at bats - as @WardBrian mentions, there is some selection bias as better players likely bat more times and better players are similar to each other, decreasing variation between them; but also, of course there is less variation anyway as the number of at-bats increases (increasing sample size of bats per player which decreases the variation that occurs due to myriad of factors, a main one of which is the particular game they are batting in, more games = better true estimate of ability).

So I could see the choice of latent ability distribution in the model being chosen based in the intent of analysis. Does that make any sense?
I guess to convince myself that the choice of normal distribution and heavy regularization isn’t best for predicting future performance, then I would want to see the models compared for a decent sized out-of-sample future at-bats for the same players.

The argument for generalized Pareto is compelling though, and I still really like this observation

1 Like

This thread brings up a few really important but often neglected issues in hierarchal modeling (I’ll be pulling from Hierarchical Modeling quite a bit so take a look if any of the terms/concepts are unclear).

Mathematically hierarchical models are equivalent to the assumption of an infinitely large, exchangeable population of behaviors from which individual behaviors of drawn. If \theta_{k} are the individual parameters then the hierarchical density function \pi(\theta_{k} \mid \phi) models the behavior of that infinite population. When constructing inferences this population model serves as a prior model for the individual parameters, pulling them towards the bulk of the population unless the individual likelihood functions are narrow enough to inform otherwise.

Critically if the population model isn’t appropriate then that regularization will lead to poor inferential and predictive performance. For example while hierarchical modeling is often introduced with normal population models those models won’t be appropriate if the parameters are not one-dimensional and unconstrained. They also won’t be appropriate if the individual behaviors cluster into groups or if there are rare extreme behaviors mixed in with the typical behaviors. The horseshoe, and many other models of sparsity, are in fact just hierarchal models with heavy-tailed population models; see for example Sparsity Blues.

Now let’s consider baseball player ability. A normal population model would be appropriate if there were a bulk of average players and then just as many poor players as good players. When this isn’t an appropriate assumption then the configuration of the normal population model will contort itself in awkward ways to try to fit the data as well as possible. For example if there’s a heavy skew towards better players than worse players then the normal population model will be pulled up into that upper tail which then introduces a bias in the regularization of the average players.

Here it looks like the selection in the data makes this even worse. The better performing players will have more at bats and hence stronger likelihood functions for their batting ability. Consequently the inappropriate normal population model will tend to strongly concentrate around these players, pulling the inferences for all of the other players up with it. This should behavior be pretty clear if you plot the marginal posteriors for the individual parameters against each other (and the population location). See the hierarchal modeling chapter linked at the beginning for some examples.

In a sport like baseball for which ability is heavily selected a normal population model is unlikely to be reasonable. Something skewed would be more appropriate, or even something with a power law like behavior which not-coincidently the Pareto model exhibits!
All of this is to say that if one steps back and thinks about the population of individual behaviors being modeled then in hindsight the normal population model is a poor assumption while the Pareto is much more reasonable.

Building a principled prior model for the Pareto population parameters (concentrating batting average between the Mendoza line and Ted Williams lines) is the next step.


Which is why @Teddy_Groves1 idea and post is so cool:-)

Thanks for the informative summary; it makes a lot of sense.

But just to push back a little bit to learn more myself (no harm in me being wrong for the sake of understanding:-), it would seem to me that the skewed population model would describe the entirety of pro baseball batting distribution (MLB + all minor leagues players) if they were to all get a chance to bat in MLB at the highest level. But it doesn’t work that way, and I would question the notion that that skewed distribution is hard cut into distinct lines demarcating the hierarchical levels of leagues (4 minor + 1 major), because players move fluidly across those boundaries via a selection process (done by people and not composed solely of batting average) that may not select using some hard boundary on the true skewed population as the threshold. It would seem to me that this selection process ‘wants’ to create a normal population within the respective tier, that would indeed balance for similar number of ‘poor’ and ‘outstanding’ players as the former are not desirable and the latter hard to come by, and stability is partly desirable.
The plot in the post here Baseball analysis using latent generalised Pareto distribution - #7 by Teddy_Groves1 seems to show where when a proxy for the selection process results (the number of at-bats) is included in the model, the bias disappears, and it does look like a normal distribution around an average that is corrected for selection.

Thus I guess I am still not quite convinced that in this particular scenario, that the normal population model is worse (where worse would be defined on the outcome of predicting the next 100 at bats), provided that you model the process by which players are selected into (or demoted from) the league. Or maybe it’s all the same, and one is a more general way of looking at the other…?

1 Like

Thanks @betanalpha and @jd_c for more comments and the case study link. It’s very nice to get help and feel like I’m making progress with this problem that bugged me for ages!

The patterns in the normal model now make much more sense, and I have a better idea about generally when a latent normal model is and isn’t appropriate. I think I read this passage from the hierarchical modelling case study before but it really hit home this time what a nice diagnostic method this is:

Visualizing the marginal posteriors for all of the individual parameters together like this is particular helpful for critiquing the assumption of a normal latent population model. In particular any outliers and clustering in the marginal posteriors would suggest that a more sophisticated latent model might be appropriate.

When I get time I will try:

  1. Look for patterns in the margnial ability posteriors that indicate that the normal model is struggling to describe the population.
  2. Try and get the prior in the generalised Pareto model to concentrate abilities appropriately.
  3. Look for some richer data that would allow for predicting future at bats

I would argue these aren’t actually different – the relevant question is what population the hierarchy is modeling. One can use the hierarchy to model the selected population, which will likely require a non-normal population model. Alternatively one could use the hierarchal to model the latent, unselected population and then model the selection process between that population and the observed data. Indeed the former is essentially trying to model the marginal of the latter directly.

Both approaches are useful in practice. If one knows something about the selection process then they can model it directly, and if they aren’t then they can try to model the selected population more heuristically.

Interestingly the precise population model often doesn’t matter much because there usually aren’t enough individuals to really be able to resolve the population shape all that well. This is a somewhat exceptional circumstance where is data for so many individuals that one can really start saying precise things about that population and the default normal is likely to be less adequate.

1 Like

Not sure if it will add anything for you in terms of the selection process, but I downloaded the last 5 years of MLB batting data from History of All Major League Baseball: National League, American League, Negro League and more | and put them in a dataset here. This has much more information, in addition to the at-bats and hits. Note that the years have quite varied number of players in the dataset, so hopefully it is comprehensive. Also, some players have zero or missing at-bats data.
MLBdata.RData (243.1 KB)

Cool discussion, and thanks again for posting. I work mainly in health related fields, not sports, but your post got me thinking about populations that are tails of distributions. Seems useful.