Doubts when modelling the probability of a team beating another team

First of all I want to congratulate the entire stan team. It’s incredible the contribution you’ve done to the world of bayesian modelling. Without you it would be so much more difficult to apply bayesian modelling to our problems.

My problem is not strictly stan-related but more generally on bayesian modelling. Some context on my problem: I want to model the probability of a soccer team beating another team on any given fixture. Since probabilities are not observable, I train the model on the result. Soccer results are three-way (draws allowed), but for simplicity I’m binarizing the result (win-no_win). Coming from the ML world, I’ve set up a logistic regression model on stan:

data {
    int<lower=0> N;
    int<lower=0> num_features;
    matrix[N, num_features] X;
    int<lower=0, upper=1> y[N];
parameters {
    vector[num_features] coefficients;
transformed parameters {
    vector[N] fitted_values = inv_logit(X * coefficients);
model {
    coefficients ~ normal(0, 10);
    y ~ bernoulli(fitted_values);

I start by training the model with an intercept and just one feature: the difference in points on the previous season. The following plot shows a bunch of samples from the posterior distribution for two unseen fixtures:
Screenshot 2020-11-21 at 12.11.17

The blue line is the mean of the posterior. In order to understand where the real probability might lie, I take a look at the odds that a certain bookmaker is giving to the event, which is represented with the orange line. Any soccer fan would know barcelona is hugely favored to beat granada, whereas celta de vigo and eibar are probably closer. This is, indeed, what the bookmakers think.
From the plot, you see two things:

  • Both distributions are centered around ~0.3x-0.4x (blue line). This makes sense to me as, with just one feature, I wouldn’t expect the model to learn much.
  • The only learning happening translates into the variance of the posterior distribution. The only feature we have: difference in points, should give some indication of the superiority of barcelona and the even-ness of celta and leganes. As a result, the model is less sure about Barcelona having a 0.45P of beating Granada than about the celta-leganes game.

This is very counter-intuitive to me. One would assume that if the model doesn’t have a lot of information to discriminate between Celta and Leganes, the mean should lie somewhere around the average frequency (all good) but variance of the posterior should be huge. As the model gets more information (Barsa vs Granada), the mean should shift and the variance should reduce. Also, in general, I would expect much higher variance on both distributions since the posterior considers that the probability of either team winning being higher than 0.5 is 0.

As I add more features, this dynamic strengthens. However, the model never gives such a high chance (over 0.5) to any team of winning a game. I understand bookmakers tweak the probabilities to be as high as possible in order to make more money, but I would expect Barsa to have around at least a 0.7-0.8 chance of beating Granada.

My hypothesis behind this confusion is that, while I’m interested in the probability, I’m using the binary target. So in the Celta-Leganes game, the model is saying: with this information I’m very sure the outcome of the game is random. And in the Barcelona-Granada game, the model is saying: I’m unsure about the actual probability of Barsa beating Granada, since it could be much higher than the baseline (0.3x).

Final questions:

  • After everything described: what feedback would you give me on my model?
  • How could I make my probabilities look more like a beta distribution with more variance as there’s less information and less variance as there is more information? (I tried specifying the fitted_values coming from a beta distribution and it wouldn’t work as expected)
  • Should I be modelling the probability directly? If so, how could I do this? Should I use the bookmaker’s probability as my target?

Hi, welcome to the Stan forums!

It looks like you are using the fixtures on the probability scale, which means it is a bit awkward for your model to use this information because you are doing the regression on the log-odds scale.

You could try including the fixtures on the logodds scale as well, so something like:
vector[N] fitted_values = inv_logit(logit(fixture) * coefficient). If the fixtures are correctly calibrated and coefficient is near 1, the inv_logit and logit cancel out and you get results that are close to your original input on the probability scale.

A nice way to think about probabilistic models is sometimes called the “information principle”. The idea is that a statistical model gets better when you include more information. This does not necessarily mean to include more predictors, sometimes it can already help to make the available information more accessible for your model.
In your case, you are throwing away a lot of data by using a binary outcome. When two teams compete, the goal difference can be an indicator of how much better team A is compared to team B. A badly performing team might get a lucky 1:0 against a top team, but you will probably never observe a 6:0. You might want to take a look at Gelman’s world cup model ( if you haven’t before.

When using the goal difference as an outcome, it of course becomes a bit more difficult to include the fixtures. A way to include this prior information might be to translate them to some shift in your intercept, so if you are using a normal or t distribution or whatever, you could translate a 70% win probability into some shift of magnitude x. This is just a quick idea off the top of my head, no idea if it works though.

The next step would probably be to add some partial pooling. You can see how quickly model complexity increases, so if at one point you feel overwhelmed, it might be a good idea to stay on a simpler model for some time. When using logistic regression as a base model, you could try modelling team ability by the difference of some latent variables for each team. You could also take a look at section 1.1 on the Stan model. It’s about item response theory models, but you might borrow something from there.

Anyways, your plots suggest that your model is missing some form of complexity, so maybe adding some team-specific parameters already helps.

Good luck with your modeling!


hi and welcome to statistical modeling,

There are many flavors of models for team matchups. The general idea is that there’s a latent variable for team ability and the outcome for a matchup between two teams is based on the difference between abilities.

Andrew’s 2014 World Cup model is an example of a Bradley-Terry model - there’s an unpublished case study here:

I’m not sure why this wasn’t finished up and published - I used it when putting together the CmdStanPy example for the 2019 Women’s World Cup:


Thank you very much to both of you!

You’re very right about this! My idea was to use the three classes via ordered logit at some point but I had started with a binary target for computational speed. I can’t believe I was so silly no to think about the score difference, which is not just richer in information but also much faster to model!

I’m a big fan of Gelman’s blog, it’s such an incredible source of ideas, but I had missed this post! This is great thank you so much.

This is also great. I had thought about trying to estimate an intercept per team and subtracting it, but I hadn’t thought about generalising this idea to estimating the quality of the two teams independently.

Again, thank you so much to both of you! I’m leaving with a lot of ideas to try :D

1 Like