World Cup model

Hey all,

at this link you may find a multinomial model for the soccer World Cup 2018 in Russia:

A preliminary simulation has been obtained with R, but further updates will be implemented with Stan. Hopefully, I will include some Stan code.

Glad to receive feedbacks!
Looking forward to see you soon.


1 Like

You might be interested in Andrew’s world-cup example, which we’ve been using for teaching purposes, or Milad’s case study on the Premier League, which has an accompanying video presentation.

The ball is round and the game lasts for 90 minutes.
After the game is before the game.
Sepp Herberger.


m_n\{1,2,X\} is a softmax, I suppose. Above is not.

My concern is:


Can we extrapolate? Can we say, because yesterday it was raining, today it will rain?
Can we just say, because Germany won the last championship, it will this one also?
Clearly our past data say so, but we know its not. I claim, if we fitting a model it does

To be a softmax, it’d have to be \exp(\eta_{nj}) in the denominator. It’s probably just a typo that it’s not.


Thanks Bob!
I already knew both the Andrew World Cup’s model and the Milad model for the Premier League as well. In fact I have been largely inspired by these models and I enjoyed reading.


Actually, instead of the softmax parametrization, I used the alternative multinomial logistic parametrization here, (Multinomial logistic regression - Wikipedia) modeling K-1 =2 probabilities and the K-th (the draw in this case ) as:

1/{1+\sum_{k=1}^{K-1}exp { beta_k x}

However, I realized now there is a typo since i did not exponentiate the etas in the denominators, and the sum is from 1 to K: thus, thanks!


As I motivated in the Andrew’s blog in the comments section (Stan goes to the World Cup | Statistical Modeling, Causal Inference, and Social Science), this table only represents the estimated probabilities obtained after simulating the World Cup 10000 times before each game is played… Thus, the reason why Germany is favored is mainly due to a high FIFA ranking, rather than past historical results

The nomenclature around all this is very inconsistent and confusing. What you’re calling “multinomial logistic” is just softmax with one of the inputs pinned to 0. The 0 in the version you’re using (1 after exp(0)) identifies the model, but comes with the disadvantage that priors become asymmetric. There’s a discussion in the manual around K vs. K - 1 parameter parameterizations of multinomial logistic regression.

Yeah, I got your point and I agree, the nomenclature I used above was confusing. Anyway , thanks for the suggestion about the priors.

I fixed the typos highlighted in your previous comment.

Took a second look at the model:

  1. \eta_{n.} not have an intercept resp. home advantage parameter. Is there any reason for that?
    At the same time you have u_{att} in att_t, same for defense. This is a constant for all t, so
    both \eta_{n.} gets added these. Is this a case for an identifiability problem?

  2. The model uses a mixture. What about using Sensor fusion instead?

Mmh, I still have to think about it. Anyway, at the time being, mu_att and mu_def do not appear in the model anymore.
See my website for model and predictions updates about the quarter of finals starting today!

I had no idea what sensor fusion was, thanks for the suggestion!

Hi @LeoEgidi

I saw your model a while ago, I was impressed but couldn’t really follow it.
I would like to try it myself now that I have a little more experience modeling, but the links seems to be broken. Could you upload the models again?

there is a version of Andrew’s world cup model updated for the 2019 FIFA Women’s world cup available here:

Github repo with models and data

one thing which Bob pointed out while I was working on this is that the model is a variant of the Bradley-Terry model used infer team abilities where each team has an estimated ability modeled as the expected number* of goals that they will score per game. The difference between team abilities predicts who will win the match.
(*number of goals is modeled as a continuous value, which isn’t correct)


@mitzimorris Thanks!!

*number of goals is modeled as a continuous value, which isn’t correct

is that because stan cannot deal with discrete parameters?

it’s possible to deal with discrete parameters in Stan by marginalizing them - see:

I didn’t do this because it was a first example, but it should be added - working on it.

1 Like

Matsutakehoyo, Mitzi:

The issue with the soccer models is not discrete parameters; it’s discrete data. Stan has no problem with discrete data. The only difficulty is that then we can’t use a simple normal or t distribution. The simplest way to proceed with a full generative model would be to use a continuous distribution with rounding, but then the likelihood is more complicated and expensive to compute, as it will be based on the normal or t cumulative distribution function. In practice, it makes more sense to just fit the continuous model to the data, do rounding when simulating fake data or posterior predictive checks, and then check that nothing much is lost by the rounding. I did this when playing with the original World Cup model. In any case, if you do want to fit a model to the discrete data, no marginalization is necessary, as there will be no latent discrete parameters.

1 Like

I think this model handles things with discrete way

One can also use the Skellam distribution to model the (discrete) goal difference. It’s a bit slower than other approaches, but it worked kind of nicely whn I tried it. I think not much is gained in terms of predictive power though. Here’s a pretty straightforward Stan function for the lpmf:

  real skellam_lpmf(int k, real mu1, real mu2){
    int abs_k = abs(k);
    real lp = -mu1 - mu2 + 0.5*k*(log(mu1) - log(mu2)) + log(modified_bessel_first_kind(abs_k, 2*sqrt(mu1*mu2)));
    return lp;

Hey all!

I reply to some points

Thanks a lot for your interest! I restored the files, there was a stupid mistake, now you can click on the hyperlink at this link:

and download the zip folder with the html files and all the R/Stan code to fit the models.

Yeah, this was my first, previous model for the Euro Cup 2016. Good memories! Though, the models for the World Cup 2018 I’ve posted above are better written in my opinion and clearer.

This is interesting, I am writing an R package to fit many alternative soccer models (Dixon & Coles, Bradley & Terry, Karlis & Ntzoufras, Baio & Blangiardo, Egidi et al., etc.), I could include your World Cup model as well, if you agree


As @andrewgelman said, it’s data, so that’s fine. Predicted number of goals will probably be an expectation and hence should be continuous.

I believe @andrewgelman likes these continuous approximations, even normal ones where you get the possibility of not only real-valued goals, but negative ones.

The alternative would be to have something like a Poisson or negative binomial or other count-based model of the data rather than a continuous approximation. Then the parameters of the Poisson (or alternative) would be continuous, but it’d be the right shape for the data.

The code is open source licensed under the new BSD license, so it doesn’t require our permission.