I found an awesome github repo that presents a fully functioning poisson model for estimating the outcome of football matches and other parameters of interest.
I want to know how to follow the modelled home_goals[g] and away_goals[g] so that I can build a grid of probabilities for different scores. These are fed in as ‘data’ but at the end of the model they are also estimated as home_goals[g] ~ poisson(home_expected_goals[g]).
How do i follow these parameters from my code?
The relevant lines of python (model fitting are here - src/soccerstan.py lines 79-104).
# dict of the model data
model_data = {
'n_teams': len(team_map),
'n_games': len(data),
'home_team': data['home_team_id'],
'away_team': data['away_team_id'],
'home_goals': data['home_goals'],
'away_goals': data['away_goals']
}
# sample the model
fit = stan_model.sampling(data=model_data, **kwargs)
output = fit.extract()
# odict_keys(['home_advantage', 'offense_raw', 'defense_raw',
# 'offense', 'defense', 'lp__'])
# Tidy the output a little...
reverse_map = {v: k for k, v in team_map.items()}
for param in model.team_parameters:
df = pd.DataFrame(output[param])
# rename the columns with the team names
df.columns = [reverse_map[id_ + 1] for id_ in df.columns]
# output (OrderedDict):
# keys: parameters from the model
# values: estimated values for iterations of sampling (ND-array)
output[param] = df
# stan/maher.stan
data {
int<lower=1> n_teams;
int<lower=1> n_games;
int<lower=1, upper=n_teams> home_team[n_games];
int<lower=1, upper=n_teams> away_team[n_games];
int<lower=0> home_goals[n_games];
int<lower=0> away_goals[n_games];
}
parameters {
real home_advantage;
real offense_raw[n_teams - 1];
real defense_raw[n_teams - 1];
}
transformed parameters {
// Enforce sum-to-zero constraint
real offense[n_teams];
real defense[n_teams];
for (t in 1:(n_teams-1)) {
offense[t] = offense_raw[t];
defense[t] = defense_raw[t];
}
offense[n_teams] = -sum(offense_raw);
defense[n_teams] = -sum(defense_raw);
}
model {
vector[n_games] home_expected_goals;
vector[n_games] away_expected_goals;
// Priors (uninformative)
offense ~ normal(0, 10);
defense ~ normal(0, 10);
home_advantage ~ normal(0, 100);
for (g in 1:n_games) {
home_expected_goals[g] = exp(offense[home_team[g]] + defense[away_team[g]] + home_advantage);
away_expected_goals[g] = exp(offense[away_team[g]] + defense[home_team[g]]);
home_goals[g] ~ poisson(home_expected_goals[g]);
away_goals[g] ~ poisson(away_expected_goals[g]);
}
}
I want to follow home_goals and away_goals for each match so that I have a trace of possible match outcomes. Is this as simple as following these parameters or does the model need more development?
not so - this isn’t what the sampling statement means - from the Stan Reference Manual:
7.4 Sampling Statements
Stan supports writing probability statements also in sampling notation, such as
y ~ normal(mu,sigma);
The name “sampling statement” is meant to be suggestive, not interpreted literally. Conceptually, the variable y, which may be an unknown parameter or known, modeled data, is being declared to have the distribution indicated by the right-hand side of the sampling statement.
Executing such a statement does not perform any sampling. In Stan, a sampling statement is merely a notational convenience. The above sampling statement could be expressed as a direct increment on the total log probability as
You can use the generated_quantites block to sample the posterior predictive distribution for each team, for each game, and then normalise to get a frequency table of outcomes.
Shamelessly self promoting here a little, I do this in a Stan model here, which is a model based on the Karlis and Ntzoufras paper that is in the github repo you link.
So I tried to put in my generated_quantities block modelled off of your code.
It doesn’t seem to want to work.
generated quantities{
vector[n_games] post_home_goals;
vector[n_games] post_away_goals;
for (g in 1:n_games) {
post_home_goals[g] = poisson_rng(home_expected_goals[g]);
post_away_goals[g] = poisson_rng(away_expected_goals[g]);
}
}
I was under the impression that the generated_quantities block gets called at each sampling iteration of the MCMC. Does that not mean that for each ‘run’ of the sampler I will have a unique value for the home_expected_goals and away_expected_goals for that game, and therefore, should be able to sample from my poisson_rng… but the code doesn’t seem to be working giving the following error message:
libc++abi.dylib: terminating with uncaught exception of type std::invalid_argument
[1] 67889 abort ipython src/soccerstan.py 'data/example.csv' 'maher'
Where is home_expected_goals currently defined? In the model block? Then it wouldn’t be in scope in generated quantities. Perhaps it should be declared in the transformed parameters block?
So if i define it in the transformed parameters block then it can be used in the model block and the generated quantities block?
If I simply move the vector[n_games] home_expected_goals; definition to the transformed parameters block then i get the same error in that the model won’t compile
Yes, global vars declared in the transformed parameters block are in scope in the model and generated quantities blocks. Not sure if it would solve your problem though.
I can compile the same model if I just remove the generated quantities block, or even comment out its contents and so I am assuming that it’s not a C++11 & osx bug.