Implementing a factor graph model for the TrueSkill algorithm

I am working to implement a model in this book - http://www.mbmlbook.com, specifically, the two player TrueSkill algorithm described in Chapter 3. My initial attempt at this was:

data {
    int<lower=0,upper=1> JWin;
}
parameters {
   real JSkill;
   real FSkill;
   real JPerf;
   real FPerf;
  
}
model {
   JSkill ~ normal(120, 40);
   FSkill ~ normal(100, 5);
   
   JPerf ~ normal(JSkill,5);
   FPerf ~ normal(FSkill, 5);
   
   if (JPerf > FPerf)
   {
     JWin ~ bernoulli(1.0);
   }else{
     JWin ~ bernoulli(0.0);
   }
   
}

I also tried changing the model block to avoid the use of if expressions

model {
   JSkill ~ normal(120, 40);
   FSkill ~ normal(100, 5);
   
   JPerf ~ normal(JSkill,5);
   FPerf ~ normal(FSkill, 5);
   
   JWin ~ bernoulli(JPerf > FPerf);
}

But neither appeared to work. The observed data did not seem to influence the posterior of the parameters. After trying a few things and changing the model block to:

model {
   real p = JPerf > FPerf;
   JSkill ~ normal(120, 40);
   FSkill ~ normal(100, 5);
   
   JPerf ~ normal(JSkill, 5);
   FPerf ~ normal(FSkill, 5);
   
   JWin ~ bernoulli(p);
}

it seems to be working as expected. I am curious as to why these model blocks would behave differently. I am also curious if anyone has an opinion about whether Stan is an appropriate tool to implement this type of model. The only other discussion I have found about this was from a while ago on the stans-users google group.

2 Likes

I think what happens is that by declaring p as real, Stan tries to calculate its derivatives (which it does not for the if statement). I am actually amazed that statement real p = JPerf > FPerf compiles and actually does something - I would’ve guessed this is a bug :-).

Nevertheless, in Stan, you need to supply bernoulli with a probability (a real number between 0 and 1), not an outcome (an integer of 0 or 1). This often means doing some calculations manually. From the properties of normal distribution, you quickly get:

J_{perf} - F_{perf} \sim N(J_{skill} - F_{skill}, \sqrt{5^2 + 5^2})

so we can use normal_cdf to get the probability the difference is > 0 as in:

real p = normal_cdf(0, JSkill - FSkill, sqrt(50));

(the code is not tested, consider it to be just a sketch).

1 Like

Hi jeffmax,

I’m also interested in implementing skill ranking models in Stan. I’ve had a look at your models. For me, the first and second model give me decent results (as in, no divergences post warmup, decent n_eff, Rhat == 1), and the third model gives me much worse results (lots of divergences and much lower n_eff).

This was just looking at the diagnostics. When retrieving the samples and actually comparing the results, only the third model actually makes sense: only for the third model are the joint samples of Jperf and Fperf always such that Jperf >= Fperf. Just as you said, changing the data input for Jwin doesn’t change the inference for the first two models, and does for the third.

I’m not quite sure what’s going on and why these models are behaving so differently. I hope someone with more Stan experience will be able to shed some light. I did find one warning in the manual regarding ‘step like functions’ applied to parameters (https://mc-stan.org/docs/2_18/functions-reference/step-functions.html), which may be important here.

edit: it seems Martin posted while I was typing this - I’ll have a look into his suggestions…

1 Like

Hi Martin,

I had a look at your suggested parametrization but it doesn’t seem to work; while the value of p seems to be ok the inference is impossible - I get many values where Fperf > Jperf, even though Jwin == 1 (this should be impossible).

Technically, a Bernoulli distribution with a parameter of 0.0 or 1.0 is allowed - it’s a coin-toss that always comes up tails (or heads). In this model it is used because the underlying graphical model uses a deterministic node here (i.e. if(Jperf > Fperf, Jwin = 1, else Jwin = 0)). In other words, the event Jwin == 1 is completely determined once Jperf and Fperf are known.

I get slightly better sampling results using non-centered parametrization:

data {
  int<lower=0, upper=1> Jwin;
  real Jskill_true;
  real Jskill_sd;
  real Fskill_true;
  real Fskill_sd;
  real skill_sd;
}
parameters {
  real Jskill_raw;
  real Fskill_raw;
  real Jperf_raw;
  real Fperf_raw;
}
transformed parameters {
  real Jskill = Jskill_raw * Jskill_sd + Jskill_true;
  real Fskill = Fskill_raw * Fskill_sd + Fskill_true;
  real Jperf = Jperf_raw * skill_sd + Jskill;
  real Fperf= Fperf_raw * skill_sd + Fskill;
}
model {
  real p;
  p = (Jperf > Fperf);
  Jskill_raw ~ normal(0, 1);
  Fskill_raw ~ normal(0, 1);
  Jperf_raw ~ normal(0, 1);
  Fperf_raw ~ normal(0, 1);
  Jwin ~ bernoulli(p);
}

(Here I specified additional data inputs so I could experiment with different values).

Still, I’d like to know why the above model works, while

model {
  Jskill_raw ~ normal(0, 1);
  Fskill_raw ~ normal(0, 1);
  Jperf_raw ~ normal(0, 1);
  Fperf_raw ~ normal(0, 1);
  Jwin ~ bernoulli(Jperf > Fperf);
}

doesn’t. Does anyone have a suggestion?

1 Like

I forgot to say that in my model, JPerf and FPerf are rendered superfluous (they are integrated out of the distribution for p) and are not in any way connected to the observed data (JWin). So they are not restricted in order. You could delete them from the model and get the same results for p, JSkill and FSkill.

In my model, if you actually need to reason about JPerf or FPerf, you need to reconstruct them back from p with some algebra (but the only thing you actually now is that JPerf > FPerf so I don’t think this inference is of much value).

It is important to remember that Stan works with gradients and thus requires everything to be continuous. The fact that p = (Jperf > Fperf); somehow works is either a weird quirk or a bug and I would not put much effort into investigating it.

1 Like

Thanks for the clarification! I just quickly added your code without thinking about it properly - I’ll take some time to work out the math.

I agree; I also thought before actually running the models that ‘collapsing’ part of the parameter space via (Jperf > Fperf) would not work (or at least give very strange sampling results). It seems magical that it works - but then again, to me Stan often seems magical ;-)

Thanks @martinmodrak and @MauritsM for looking at this, it is very helpful. I tried updating the model

data {
  int<lower=0,upper=1> JWin;
}
parameters {
  real JSkill;
  real FSkill;
}
model {
  real p = normal_cdf(0, JSkill - FSkill, sqrt(50));
  JSkill ~ normal(120, 40);
  FSkill ~ normal(100, 5);
  
  JWin ~ bernoulli(p);
}  

This does seem to be working, but in reverse. If J wins, JSkill goes to about where I would expect if J Lost, and vice versa. I will continue to work with this to see if I can wrap my head around it.

One of the challenges I have is knowing when to expect the model to work. I know it cannot have any discrete parameters- I guess that by introducing an interim variable in the model block that is not technically a parameter, but still needs to be part of derivative calculations, I have really just found a way to get passed the compiler error about discrete parameters? It is odd that it does work though. I find it challenging to integrate out discrete latent parameters, so I am just curious as to what can be done before resorting to that. It is probably not an accident that these examples come from a book based on Infer.net which uses expectation propagation to solve this problem, but I am definitely interested in seeing how best to implement this in Stan.

Yes, you actually need

real p = 1 - normal_cdf(0, JSkill - FSkill, sqrt(50));

This was a small typo (I was about to comment on this when I saw your post).

I agree with the rest of your post; the way the model is described in the book naturally leads to their approach with EP, but you could rephrase the story of the model so that it is more consistent with your Stan model above.

1 Like

I think the technical reason is different, but is probably not worth investigating here.

Note that your model does not have discrete parameters per se, there are continuous parameters and then a discrete choice is based on those parameters, however once the continuous parameters are known, there is no remaining uncertainty in the value of JWin. The problem with your initial attempts was that you tried to have a deterministic observation model (once JPerf and FPerf are known, JWin is 100% known and no longer stochastic). Stan however does not work with deterministic observation models, data have to be tied to parameters via probability distributions. Stan would similarly break (although with a syntax error) if JWin was a real and you tried to have JWin = JPerf - FPerf as your observation model. Does that make sense?

1 Like

Yes, that explanation is helpful, specifically that the issue here is the deterministic component. Thanks for taking the time to respond!