Determining optimal response distribution for modeling growth curves of NFL players' fantasy football points in brms

I’m trying to model age-related trajectories (i.e., growth curves) of National Football League (NFL) players’ fantasy points using a mixed model in brms. In brief, fantasy points reflect a weighted sum of performance across a number of statistical categories (e.g., passing yards, passing touchdowns, rushing yards, etc.). I’m trying to determine the best-fitting response distribution to use for fantasy points. I am currently using a hurdle gamma response distribution. Although I could model the growth curves for the underlying statistical categories that comprise fantasy points (e.g., passing yards, passing touchdowns, rushing yards, etc.), there are many statistical categories and I am more interested in the overall fantasy points variable, so I’d prefer not to model the individual statistical categories. Any help you can provide in determining the optimal response distribution for these data would be greatly appreciated.

Below is a reproducible example that shows the distribution of the data and my attempt to model player’s trajectories of fantasy points:

library("rstan")
library("brms")
library("cmdstanr")
library("fitdistrplus")
library("parallelly")
library("tidyverse")

# Load data
load(url("https://osf.io/download/6u2j9/")) # data object
load(url("https://osf.io/download/q6rjf/")) # brms model object (gamma)

# Prepare data
player_stats_seasonal_offense <- player_stats_seasonal %>% 
  filter(position_group %in% c("QB","RB","WR","TE"))

player_stats_seasonal_offense$position[which(player_stats_seasonal_offense$position == "HB")] <- "RB"

player_stats_seasonal_offense$player_idFactor <- factor(player_stats_seasonal_offense$player_id)
player_stats_seasonal_offense$positionFactor <- factor(player_stats_seasonal_offense$position)

player_stats_seasonal_offense$fantasyPoints_posOnlyNoZeros <- player_stats_seasonal_offense$fantasyPoints_posOnly <- player_stats_seasonal_offense$fantasyPoints
player_stats_seasonal_offense$fantasyPoints_posOnly[player_stats_seasonal_offense$fantasyPoints < 0] <- 0
player_stats_seasonal_offense$fantasyPoints_posOnlyNoZeros[player_stats_seasonal_offense$fantasyPoints <= 0] <- 0.01

# Histogram/density plot
ggplot2::ggplot(
  data = player_stats_seasonal_offense,
  mapping = aes(
    x = fantasyPoints)
) +
  geom_histogram(
    aes(y = after_stat(density)),
    color = "#000000",
    fill = "#0099F8"
  ) +
  geom_density(
    color = "#000000",
    fill = "#F85700",
    alpha = 0.6
  ) +
  geom_rug() +
  theme_classic()

For purposes of comparing the fit of various distributions, I recoded values of zero and negative values to be greater slightly greater than zero to ensure all values are positive, which is a requirement for some distributions. The Cullen and Frey graph appears to suggest that a beta or gamma distribution may be optimal (but my understanding is that a beta distribution requires values that range from [0,1], which is not the case for these data unless we compute a proportion).

# Determine best-fitting response distribution
fantasyPoints <- as.vector(na.omit(player_stats_seasonal_offense$fantasyPoints))
fantasyPoints[which(fantasyPoints <= 0)] <- 0.01

> fitdistrplus::descdist(
+     fantasyPoints,
+     discrete = FALSE,
+     boot = 1000)
summary statistics
------
min:  4.440892e-16   max:  429.1 
median:  29.6 
mean:  57.86977 
estimated sd:  69.26842 
estimated skewness:  1.716004 
estimated kurtosis:  5.916063

When comparing the empirical and theoretical distributions, the gamma distribution appears to fit better than normal, lognormal, Weibull, and exponential distributions (though the gamma distribution shows some misfit in the Q-Q plot).

# Compare empirical and theoretical distributions
fit.norm <- fitdistrplus::fitdist(fantasyPoints, "norm")
fit.gamma <- fitdistrplus::fitdist(fantasyPoints, "gamma")
fit.lnorm <- fitdistrplus::fitdist(fantasyPoints, "lnorm")
fit.exp <- fitdistrplus::fitdist(fantasyPoints, "exp")
fit.beta <- fitdistrplus::fitdist(fantasyPoints, "beta")

plot(fit.norm)
plot(fit.gamma)
plot(fit.lnorm)
plot(fit.exp)

Normal:

Gamma:

Log normal:

Weibull:

Exponential:

Thus, I used a gamma distribution in the brms model–and because of the excess zeros (negative values were recoded to zero for this model), I used a hurdle gamma. The model includes a main effect of the player’s position and estimates nonlinear trajectories (using a GAM smoother) by position and player. Warning–the model takes a long time to run (though it does converge):

#bayesianMixedModelFormula <- brms::bf(
#  fantasyPoints_posOnly ~ positionFactor + s(ageCentered20, by = positionFactor) + years_of_experience + s(ageCentered20, player_idFactor, bs = "re") + (1 | player_idFactor)
#)
#
#bayesianMixedModelFit <- brms::brm(
#  formula = bayesianMixedModelFormula,
#  data = player_stats_seasonal_offense,
#  family = hurdle_gamma(),
#  cores = 4,
#  save_pars = save_pars(latent = FALSE, all = FALSE),
#  threads = threading(parallelly::availableCores()),
#  backend = "cmdstanr",
#  seed = 52242,
#  silent = 0
#)

> bayesianMixedModelFit$formula
fantasyPoints_posOnly ~ positionFactor + s(ageCentered20, by = positionFactor) + years_of_experience + s(ageCentered20, player_idFactor, bs = "re") + (1 | player_idFactor) 
> bayesianMixedModelFit$family # Hurdle Gamma

Family: hurdle_gamma 
Link function: log

However, there is some misfit between the model’s predicted values and the observed values based on a posterior predictive check:

set.seed(123)

pp_check(bayesianMixedModelFit) + 
  ggplot2::xlim(0, 600)

So, I’m wondering whether there may be a better response distribution to use than gamma for these data. Thanks in advance!

It does. But you can scale it. If you have x \in (a, b), then (x - a) / (b - a) \in (0, 1), which means you can write (x - a) / (b - a) \sim \text{beta}(\alpha, \beta), and if the lower and upper bounds a, b are constant, you don’t need a Jacobian adjustment.

Did you try inverse gamma? It’s not on your list and can produce curves like the one you show. But…

Have you tested the sensitivity to how you’re pushing things above zero? I find it’s usually easier to stick to a coherent generative model. Rather than trying to fit a curve to a histogram that’s discrete, you can try a discrete distribution. You could use a Poisson or negative binomial GLM with a log link.

Thanks for the suggestion to try the inverse gamma distribution. I haven’t tried inverse gamma. I don’t see it in the list of available brms families. How would I implement this in brms? Would I need to use a custom family? Any guidance for how to try this would be greatly appreciated.

The Cullen-Frey graphs and summary statistics look very similar across 1) the raw data, 2) no negative values, and 3) no zero or negative values (see below):

Raw Data:

> fitdistrplus::descdist(
+     fantasyPointsVector,
+     boot = 1000)
summary statistics
------
min:  -12.28   max:  429.1 
median:  29.6 
mean:  57.85498 
estimated sd:  69.28116 
estimated skewness:  1.715222 
estimated kurtosis:  5.913723

No Negative Values (i.e., scores below zero were converted to zero):

> fitdistrplus::descdist(
+     fantasyPointsVector_posOnly,
+     boot = 1000)
summary statistics
------
min:  0   max:  429.1 
median:  29.6 
mean:  57.86936 
estimated sd:  69.26876 
estimated skewness:  1.715984 
estimated kurtosis:  5.916001

No Zero or Negative Values (i.e., scores at or below zero were converted to 0.01):

> fitdistrplus::descdist(
+     fantasyPointsVector_posOnlyNoZeros,
+     boot = 1000)
summary statistics
------
min:  4.440892e-16   max:  429.1 
median:  29.6 
mean:  57.86977 
estimated sd:  69.26842 
estimated skewness:  1.716004 
estimated kurtosis:  5.916063

The values (fantasy points) are not integers—I could round them to be integers, but the scores are fairly continuous (with precision to two decimal places). So, unless I round them to be integers, my understanding is that I wouldn’t be able to use Poisson or Negative Binomial distributions.

Thanks again—any suggestions you have would be greatly appreciated!

Sorry—I don’t use brms, so I have no idea!

I don’t now what a Cullen-Frey graph is, but similar behavior is encouraging if this is what you care about.

That’s convenient!

This is a fun project. My primary comment would be that distributions exist to capture errors, not to fit the empirical distribution of the response. For obvious reasons, the two are often related, but it is not a given. Regardless of the shape of outcome, the errors could still essentially be additive and therefore broadly normal or student-T in shape, particularly when you are combining multiple statistics. So I would not make fitting the response distribution your goal.

Sports distributions also typically make more sense as a rate than as an aggregated count statistic. I recognize that total points is your focus, but it really is two problems and perhaps even two models: (1) how good is the player at the metric(s) in question?; (2) how many chances are they being given to perform at it? You then are simply multiplying the two in some way. A rate statistic will usually give you something much easier to work with.

When I am looking for an appropriate error distribution, including for sports data, I (a) limit to options that make empirical sense, which you are already trying to do; (b) use LOO or its analogues to determine the best estimated fit to appropriate out of sample data, which is what really matters; and (c) use posterior-predictive checks and other graphical overlays to confirm that your combination of (a) and (b) has not led you completely astray (typically it won’t).

Unfortunately there frequently is no ideal distribution for distribution of sports outcomes (or anything else, I guess). Sports outcomes are designed, as I like to say, for maximum entertainment, not maximum entropy, although our traditional tools often work quite well anyway. So I would develop the best performing tool you can find, and then come back in a year and see if you can improve it, and you will virtually always find the answer is “yes.” The goal is to keep finding new ways to be less wrong, more often.

You seem to have a good start on this. Good luck.

2 Likes

To make this concrete, when we write y_n = \alpha + \beta \cdot x_n + \epsilon_n and \epsilon_n \sim \text{normal}(0, \sigma), we are saying that the expected value of y in the model is given by the regression \alpha + \beta \cdot x_n and the residual error, (\alpha + \beta \cdot x_n) - y_n, has a \text{normal}(0, \sigma) distribution.

You can also check that you get the right response distribution with something like posterior predictive checks that will consider both expectations and variance in the estimates.

Excellent points, bachlaw and Bob. What you both say makes good sense. I will continue to try various distributions and evaluate LOO and posterior predictive checks to compare them empirically. Much appreciated!