First of all I want to congratulate the entire stan team. It’s incredible the contribution you’ve done to the world of bayesian modelling. Without you it would be so much more difficult to apply bayesian modelling to our problems.
My problem is not strictly stan-related but more generally on bayesian modelling. Some context on my problem: I want to model the probability of a soccer team beating another team on any given fixture. Since probabilities are not observable, I train the model on the result. Soccer results are three-way (draws allowed), but for simplicity I’m binarizing the result (win-no_win). Coming from the ML world, I’ve set up a logistic regression model on stan:
data {
int<lower=0> N;
int<lower=0> num_features;
matrix[N, num_features] X;
int<lower=0, upper=1> y[N];
}
parameters {
vector[num_features] coefficients;
}
transformed parameters {
vector[N] fitted_values = inv_logit(X * coefficients);
}
model {
coefficients ~ normal(0, 10);
y ~ bernoulli(fitted_values);
}
I start by training the model with an intercept and just one feature: the difference in points on the previous season. The following plot shows a bunch of samples from the posterior distribution for two unseen fixtures:
The blue line is the mean of the posterior. In order to understand where the real probability might lie, I take a look at the odds that a certain bookmaker is giving to the event, which is represented with the orange line. Any soccer fan would know barcelona is hugely favored to beat granada, whereas celta de vigo and eibar are probably closer. This is, indeed, what the bookmakers think.
From the plot, you see two things:
- Both distributions are centered around ~0.3x-0.4x (blue line). This makes sense to me as, with just one feature, I wouldn’t expect the model to learn much.
- The only learning happening translates into the variance of the posterior distribution. The only feature we have: difference in points, should give some indication of the superiority of barcelona and the even-ness of celta and leganes. As a result, the model is less sure about Barcelona having a 0.45P of beating Granada than about the celta-leganes game.
This is very counter-intuitive to me. One would assume that if the model doesn’t have a lot of information to discriminate between Celta and Leganes, the mean should lie somewhere around the average frequency (all good) but variance of the posterior should be huge. As the model gets more information (Barsa vs Granada), the mean should shift and the variance should reduce. Also, in general, I would expect much higher variance on both distributions since the posterior considers that the probability of either team winning being higher than 0.5 is 0.
As I add more features, this dynamic strengthens. However, the model never gives such a high chance (over 0.5) to any team of winning a game. I understand bookmakers tweak the probabilities to be as high as possible in order to make more money, but I would expect Barsa to have around at least a 0.7-0.8 chance of beating Granada.
My hypothesis behind this confusion is that, while I’m interested in the probability, I’m using the binary target. So in the Celta-Leganes game, the model is saying: with this information I’m very sure the outcome of the game is random. And in the Barcelona-Granada game, the model is saying: I’m unsure about the actual probability of Barsa beating Granada, since it could be much higher than the baseline (0.3x).
Final questions:
- After everything described: what feedback would you give me on my model?
- How could I make my probabilities look more like a beta distribution with more variance as there’s less information and less variance as there is more information? (I tried specifying the fitted_values coming from a beta distribution and it wouldn’t work as expected)
- Should I be modelling the probability directly? If so, how could I do this? Should I use the bookmaker’s probability as my target?