I’ve been working on a variation of a model that has been used to model the outcomes of games in some sports. The basic model starts like this:
Where g_h, g_a are the goals scored in a game by the home and the away team, respectively. The coefficients \alpha and \beta refer to the skill of the teams in offense and defense, respectively. Those parameters are estimated from the outcome of many games against different teams. A high \alpha means the team is strong on offense, whereas a low \beta indicates the team is a strong defender. All good so far.
I’ve made some adjustments to this, first I decompose the team level coefficients into the sums of coefficients of individual players, so:
In my case, players play on different teams consistently so there’s interest in inference on a player by player basis. This obviously increases the number of parameters of the model a good bit, but no matter.
The second change I made is to model the parameters as a time series to infer changes in player performance/skill over time. My attempt is to model each player’s coefficients as follows:
This again increases the number of parameters as each player now has two parameters for each time
period.
The end result of this is the model has gotten rather slow to sample with a few thousand games of data, a couple hundred players, and dozen or so time periods. So I’m curious if anyone could give me some strategies to optimize the code I’ve written to. I know things like vectorization and whatnot can help speed up these processes, but I’m still relatively new to writing larger models in Stan.
My full code is as follows:
data {
int N; // num games
int J; // num players
int T; // time periods
int<lower=1,upper=T> TT[N]; // time period index of game
int X[N, 6]; // game player indices
int G[N, 6]; // goals per player
}
transformed data {
int GT[N, 2]; // goals per team
for (n in 1:N) {
GT[n,1] = sum(G[n,1:3]);
GT[n,2] = sum(G[n,4:6]);
}
}
parameters {
matrix<lower=0>[T, J] alpha; // attacking
matrix<lower=0>[T, J] beta; // defending
// time series params
cholesky_factor_corr[2] L; // correlation
vector<lower=0>[2] tau; // variance
// hyperparameters of player coefficients
real<lower=0> alpha_a;
real<lower=0> alpha_b;
real<lower=0> beta_a;
real<lower=0> beta_b;
}
transformed parameters {
// These parameters select the correct player
// coefficients for each game, so we don't have to
// do it in the model step
matrix<lower=0>[N, 2] alpha_teams;
matrix<lower=0>[N, 2] beta_teams;
for (n in 1:N) {
{
int team_a[3] = X[n, 1:3];
int team_b[3] = X[n, 4:6];
alpha_teams[n,1] = sum(alpha[TT[n], team_a]);
alpha_teams[n,2] = sum(alpha[TT[n], team_b]);
beta_teams[n, 1] = sum(beta[TT[n], team_a]);
beta_teams[n, 2] = sum(beta[TT[n], team_b]);
}
}
}
model {
// priors
for (t in 1:T) {
alpha[t] ~ gamma(alpha_a, alpha_b);
beta[t] ~ gamma(beta_a, beta_b);
}
alpha_a ~ cauchy(0, 2.5);
alpha_b ~ cauchy(0, 2.5);
beta_a ~ cauchy(0, 2.5);
beta_b ~ cauchy(0, 2.5);
tau ~ cauchy(0, 2.5);
L ~ lkj_corr_cholesky(2);
// time series
for (t in 2:T) {
for (j in 1:J) {
[alpha[t,j], beta[t,j]] ~ multi_normal_cholesky([alpha[t-1,j], beta[t-1,j]], diag_pre_multiply(tau,L));
}
}
// poisson model, the last bit excludes ties as they are not possible
for (n in 1:N) {
target += poisson_lpmf(GT[n,1] | alpha_teams[n,1]*beta_teams[n,2]) +
poisson_lpmf(GT[n,2] | alpha_teams[n,2]*beta_teams[n,1]) -
log1m_exp(-(alpha_teams[n,1]*beta_teams[n,2] + alpha_teams[n,2]*beta_teams[n,1])*
log(modified_bessel_first_kind(0,
2*sqrt(alpha_teams[n,1]*beta_teams[n,2]*alpha_teams[n,2]*beta_teams[n,1]))));
}
}
Any help would be appreciated! I feel like there are a lot of opportunities for vectorization here, but I’m not quite savvy enough to figure it out.