I’m trying to model age-related trajectories (i.e., growth curves) of National Football League (NFL) players’ fantasy points using a mixed model in brms. In brief, fantasy points reflect a weighted sum of performance across a number of statistical categories (e.g., passing yards, passing touchdowns, rushing yards, etc.). I’m trying to determine the best-fitting response distribution to use for fantasy points. I am currently using a hurdle gamma response distribution. Although I could model the growth curves for the underlying statistical categories that comprise fantasy points (e.g., passing yards, passing touchdowns, rushing yards, etc.), there are many statistical categories and I am more interested in the overall fantasy points variable, so I’d prefer not to model the individual statistical categories. Any help you can provide in determining the optimal response distribution for these data would be greatly appreciated.
Below is a reproducible example that shows the distribution of the data and my attempt to model player’s trajectories of fantasy points:
library("rstan")
library("brms")
library("cmdstanr")
library("fitdistrplus")
library("parallelly")
library("tidyverse")
# Load data
load(url("https://osf.io/download/6u2j9/")) # data object
load(url("https://osf.io/download/q6rjf/")) # brms model object (gamma)
# Prepare data
player_stats_seasonal_offense <- player_stats_seasonal %>%
filter(position_group %in% c("QB","RB","WR","TE"))
player_stats_seasonal_offense$position[which(player_stats_seasonal_offense$position == "HB")] <- "RB"
player_stats_seasonal_offense$player_idFactor <- factor(player_stats_seasonal_offense$player_id)
player_stats_seasonal_offense$positionFactor <- factor(player_stats_seasonal_offense$position)
player_stats_seasonal_offense$fantasyPoints_posOnlyNoZeros <- player_stats_seasonal_offense$fantasyPoints_posOnly <- player_stats_seasonal_offense$fantasyPoints
player_stats_seasonal_offense$fantasyPoints_posOnly[player_stats_seasonal_offense$fantasyPoints < 0] <- 0
player_stats_seasonal_offense$fantasyPoints_posOnlyNoZeros[player_stats_seasonal_offense$fantasyPoints <= 0] <- 0.01
# Histogram/density plot
ggplot2::ggplot(
data = player_stats_seasonal_offense,
mapping = aes(
x = fantasyPoints)
) +
geom_histogram(
aes(y = after_stat(density)),
color = "#000000",
fill = "#0099F8"
) +
geom_density(
color = "#000000",
fill = "#F85700",
alpha = 0.6
) +
geom_rug() +
theme_classic()
For purposes of comparing the fit of various distributions, I recoded values of zero and negative values to be greater slightly greater than zero to ensure all values are positive, which is a requirement for some distributions. The Cullen and Frey graph appears to suggest that a beta or gamma distribution may be optimal (but my understanding is that a beta distribution requires values that range from [0,1], which is not the case for these data unless we compute a proportion).
# Determine best-fitting response distribution
fantasyPoints <- as.vector(na.omit(player_stats_seasonal_offense$fantasyPoints))
fantasyPoints[which(fantasyPoints <= 0)] <- 0.01
> fitdistrplus::descdist(
+ fantasyPoints,
+ discrete = FALSE,
+ boot = 1000)
summary statistics
------
min: 4.440892e-16 max: 429.1
median: 29.6
mean: 57.86977
estimated sd: 69.26842
estimated skewness: 1.716004
estimated kurtosis: 5.916063
When comparing the empirical and theoretical distributions, the gamma distribution appears to fit better than normal, lognormal, Weibull, and exponential distributions (though the gamma distribution shows some misfit in the Q-Q plot).
# Compare empirical and theoretical distributions
fit.norm <- fitdistrplus::fitdist(fantasyPoints, "norm")
fit.gamma <- fitdistrplus::fitdist(fantasyPoints, "gamma")
fit.lnorm <- fitdistrplus::fitdist(fantasyPoints, "lnorm")
fit.exp <- fitdistrplus::fitdist(fantasyPoints, "exp")
fit.beta <- fitdistrplus::fitdist(fantasyPoints, "beta")
plot(fit.norm)
plot(fit.gamma)
plot(fit.lnorm)
plot(fit.exp)
Normal:
Gamma:
Log normal:
Weibull:
Exponential:
Thus, I used a gamma distribution in the brms model–and because of the excess zeros (negative values were recoded to zero for this model), I used a hurdle gamma. The model includes a main effect of the player’s position and estimates nonlinear trajectories (using a GAM smoother) by position and player. Warning–the model takes a long time to run (though it does converge):
#bayesianMixedModelFormula <- brms::bf(
# fantasyPoints_posOnly ~ positionFactor + s(ageCentered20, by = positionFactor) + years_of_experience + s(ageCentered20, player_idFactor, bs = "re") + (1 | player_idFactor)
#)
#
#bayesianMixedModelFit <- brms::brm(
# formula = bayesianMixedModelFormula,
# data = player_stats_seasonal_offense,
# family = hurdle_gamma(),
# cores = 4,
# save_pars = save_pars(latent = FALSE, all = FALSE),
# threads = threading(parallelly::availableCores()),
# backend = "cmdstanr",
# seed = 52242,
# silent = 0
#)
> bayesianMixedModelFit$formula
fantasyPoints_posOnly ~ positionFactor + s(ageCentered20, by = positionFactor) + years_of_experience + s(ageCentered20, player_idFactor, bs = "re") + (1 | player_idFactor)
> bayesianMixedModelFit$family # Hurdle Gamma
Family: hurdle_gamma
Link function: log
However, there is some misfit between the model’s predicted values and the observed values based on a posterior predictive check:
set.seed(123)
pp_check(bayesianMixedModelFit) +
ggplot2::xlim(0, 600)
So, I’m wondering whether there may be a better response distribution to use than gamma for these data. Thanks in advance!