Pystan Follow Model Outputs (soccer match outcomes) using Generated Quantities block

tommylees112 · October 17, 2018, 8:44am

I found an awesome github repo that presents a fully functioning poisson model for estimating the outcome of football matches and other parameters of interest.

I want to know how to follow the modelled home_goals[g] and away_goals[g] so that I can build a grid of probabilities for different scores. These are fed in as ‘data’ but at the end of the model they are also estimated as home_goals[g] ~ poisson(home_expected_goals[g]).

How do i follow these parameters from my code?

The relevant lines of python (model fitting are here - src/soccerstan.py lines 79-104).

    # dict of the model data
    model_data = {
        'n_teams': len(team_map),
        'n_games': len(data),
        'home_team': data['home_team_id'],
        'away_team': data['away_team_id'],
        'home_goals': data['home_goals'],
        'away_goals': data['away_goals']
    }
    # sample the model
    fit = stan_model.sampling(data=model_data, **kwargs)
    output = fit.extract()
    # odict_keys(['home_advantage', 'offense_raw', 'defense_raw',
    #             'offense', 'defense', 'lp__'])

    # Tidy the output a little...
    reverse_map = {v: k for k, v in team_map.items()}
    for param in model.team_parameters:
        df = pd.DataFrame(output[param])

        # rename the columns with the team names
        df.columns = [reverse_map[id_ + 1] for id_ in df.columns]
        # output (OrderedDict):
        #            keys: parameters from the model
        #          values: estimated values for iterations of sampling (ND-array)
        output[param] = df

# stan/maher.stan
data {
  int<lower=1> n_teams;
  int<lower=1> n_games;
  int<lower=1, upper=n_teams> home_team[n_games];
  int<lower=1, upper=n_teams> away_team[n_games];
  int<lower=0> home_goals[n_games];
  int<lower=0> away_goals[n_games];
}

parameters {
  real home_advantage;
  real offense_raw[n_teams - 1];
  real defense_raw[n_teams - 1];
}

transformed parameters {
  // Enforce sum-to-zero constraint
  real offense[n_teams];
  real defense[n_teams];

  for (t in 1:(n_teams-1)) {
    offense[t] = offense_raw[t];
    defense[t] = defense_raw[t];
  }

  offense[n_teams] = -sum(offense_raw);
  defense[n_teams] = -sum(defense_raw);
}

model {
  vector[n_games] home_expected_goals;
  vector[n_games] away_expected_goals;

  // Priors (uninformative)
  offense ~ normal(0, 10);
  defense ~ normal(0, 10);
  home_advantage ~ normal(0, 100);

  for (g in 1:n_games) {
    home_expected_goals[g] = exp(offense[home_team[g]] + defense[away_team[g]] + home_advantage);
    away_expected_goals[g] = exp(offense[away_team[g]] + defense[home_team[g]]);

    home_goals[g] ~ poisson(home_expected_goals[g]);
    away_goals[g] ~ poisson(away_expected_goals[g]);
  }
}

Data is here:
example.csv (97.6 KB)

I want to follow home_goals and away_goals for each match so that I have a trace of possible match outcomes. Is this as simple as following these parameters or does the model need more development?

mitzimorris · October 17, 2018, 11:30am

not so - this isn’t what the sampling statement means - from the Stan Reference Manual:

7.4 Sampling Statements
Stan supports writing probability statements also in sampling notation, such as

y ~ normal(mu,sigma);
The name “sampling statement” is meant to be suggestive, not interpreted literally. Conceptually, the variable y, which may be an unknown parameter or known, modeled data, is being declared to have the distribution indicated by the right-hand side of the sampling statement.

Executing such a statement does not perform any sampling. In Stan, a sampling statement is merely a notational convenience. The above sampling statement could be expressed as a direct increment on the total log probability as

target += normal_lpdf(y | mu, sigma);

tommylees112 · October 17, 2018, 2:07pm

That is really informative thankyou Mitzi. Is it possible that I want a generated quantities block to produce sampled match outcomes?

Thanks so much for your reply!

hhau · October 17, 2018, 2:08pm

You can use the generated_quantites block to sample the posterior predictive distribution for each team, for each game, and then normalise to get a frequency table of outcomes.

Shamelessly self promoting here a little, I do this in a Stan model here, which is a model based on the Karlis and Ntzoufras paper that is in the github repo you link.

tommylees112 · October 18, 2018, 8:46pm

So I tried to put in my generated_quantities block modelled off of your code.

It doesn’t seem to want to work.

generated quantities{
  vector[n_games] post_home_goals;
  vector[n_games] post_away_goals;
  for (g in 1:n_games) {
    post_home_goals[g] = poisson_rng(home_expected_goals[g]);
    post_away_goals[g] = poisson_rng(away_expected_goals[g]);
  }
}

I was under the impression that the generated_quantities block gets called at each sampling iteration of the MCMC. Does that not mean that for each ‘run’ of the sampler I will have a unique value for the home_expected_goals and away_expected_goals for that game, and therefore, should be able to sample from my poisson_rng… but the code doesn’t seem to be working giving the following error message:

libc++abi.dylib: terminating with uncaught exception of type std::invalid_argument
[1]    67889 abort      ipython src/soccerstan.py 'data/example.csv' 'maher'

Thanks again sorry for such newbie questions

Matthijs · October 18, 2018, 8:59pm

Where is home_expected_goals currently defined? In the model block? Then it wouldn’t be in scope in generated quantities. Perhaps it should be declared in the transformed parameters block?

tommylees112 · October 18, 2018, 9:07pm

So if i define it in the transformed parameters block then it can be used in the model block and the generated quantities block?

If I simply move the vector[n_games] home_expected_goals; definition to the transformed parameters block then i get the same error in that the model won’t compile

Matthijs · October 18, 2018, 9:11pm

Yes, global vars declared in the transformed parameters block are in scope in the model and generated quantities blocks. Not sure if it would solve your problem though.

hhau · October 18, 2018, 9:32pm

That, is an error message I haven’t seen before. Yikes.

Yes, once you get it working :)

ahartikainen · October 18, 2018, 9:56pm

Are you running this with ipython?

Can you run pystan.stanc for your model code in interactive mode. If everything works, can you try to run with normal python?

Also, you can compile other models and is this error coming with 2.17.1?

It might be somekind of C++11 + osx bug.

tommylees112 · October 19, 2018, 7:28am

I can compile the same model if I just remove the generated quantities block, or even comment out its contents and so I am assuming that it’s not a C++11 & osx bug.

Yep it’s Pystan 2.17.1

pystan                    2.17.1.0         py36hf8a1672_2    conda-forge

I get the same error running with

ipython src/soccerstan.py 'data/example.csv' 'maher'

or

python src/soccerstan.py 'data/example.csv' 'maher'

hhau · October 19, 2018, 12:05pm

This works for me (using RStan):

data {
  int<lower=1> n_teams;
  int<lower=1> n_games;
  int<lower=1, upper=n_teams> home_team[n_games];
  int<lower=1, upper=n_teams> away_team[n_games];
  int<lower=0> home_goals[n_games];
  int<lower=0> away_goals[n_games];
}

parameters {
  real home_advantage;
  real offense_raw[n_teams - 1];
  real defense_raw[n_teams - 1];
}

transformed parameters {
  // Enforce sum-to-zero constraint
  real offense[n_teams];
  real defense[n_teams];

  vector[n_games] home_expected_goals;
  vector[n_games] away_expected_goals;

  for (t in 1:(n_teams-1)) {
    offense[t] = offense_raw[t];
    defense[t] = defense_raw[t];
  }

  offense[n_teams] = -sum(offense_raw);
  defense[n_teams] = -sum(defense_raw);


  
  for (g in 1:n_games) {
    home_expected_goals[g] = exp(offense[home_team[g]] + defense[away_team[g]] + home_advantage);
    away_expected_goals[g] = exp(offense[away_team[g]] + defense[home_team[g]]);
  }

}

model {
  

  // Priors (uninformative)
  offense ~ normal(0, 10);
  defense ~ normal(0, 10);
  home_advantage ~ normal(0, 100);

  home_goals ~ poisson(home_expected_goals);
  away_goals ~ poisson(away_expected_goals);
  
}

generated quantities{
  vector[n_games] post_home_goals;
  vector[n_games] post_away_goals;
  for (g in 1:n_games) {
    post_home_goals[g] = poisson_rng(home_expected_goals[g]);
    post_away_goals[g] = poisson_rng(away_expected_goals[g]);
  }
}

The only weird thing I can think of is that you might need a empty line after the closing brace of the generated_quantities block.

Topic		Replies	Views
RuntimeError: Goal Soccer Model Modeling pystan	7	890	August 12, 2020
Coding related questions from a rookie's first project Modeling techniques , specification	0	357	June 5, 2019
Help with creating desired vector Modeling specification	3	416	January 8, 2021
Model fails initialization Modeling pystan , fitting-issues , specification	0	493	April 10, 2022
Generated Quantities for prediction data General	6	1095	July 28, 2020

Pystan Follow Model Outputs (soccer match outcomes) using Generated Quantities block

Related topics