Modelling the winner of board games


I have a hard time wrapping my head around a model we’d like to fit.

The objective is to estimate the probability of winning of each member of a board-game group.

The deterministic part of the model would be relatively simple: an intercept per player quantifying the averaged probability of winning, and a series of hierarchical effects: one for game identity, one for the number of players playing.

However, the response is ragged: not every player was present at each game. This leads to another aspect: one player can have a high probability of winning when another is absent, but a lower chance when another is absent. The table would look like :


Would someone have an idea about how to model this? I could create predictors for the presence or absence of a player and fit individual models, but I feel that a multivariate model with all players would be far more interesting (and a statistical challenge for me. We could also use ranks or scores). However, the ragged structure of the response blocks me, and I have no idea about how to formulate a model that could deal with such a structure.

Would anybody have an idea?? The good thing is that we have A LOT of data points for some of the configurations :P



Maybe something like an occupancy model (from ecology) with predictors and accounts for correlation between players?

have you looked at attempts to hack the Elo ratings system for multiplayer games? (everyone agrees that hack is appropriate)


Thank you for your answer, this is a great link (for other courses too :p). I found this pymc thread :

It seems that the “zero-inflation mixture” is unnecessary if you know which 0 are “absence related”, and you can make a brute force mask of the 0 proba for a player to win given he/she is absent (last post).

However, I like the mixture model approach proposed in the 4th post. It is unlikely to work in my case I think because : I do not have all combinations of player represented, and I am not patient enough to code the number of combinations…

Have a good day!

Edit : error


I have read the approach, and I find it great! I do not know how to translate this algorithmic approach into probabilistic statement, though. I guess it could be possible to approximate it in a stan program?

I could be a great idea to dive deeper inside, and the result should be simpler that the one described in this link, because there is no team, only individual results.

Have a good day!

Hi Lucas,

A couple other possibilities would be TrueSkill (discussion), TrueSkill2, the Stochastic Rank Ordered Logit model, or Plackett-Luce (overview, Stan implementation, 2012 paper on Bayesian version).

One way to avoid ragged arrays is to have data in tidy format like this:


In order to keep things vectorized you can also book-keep which row a game starts on and how many players there are, like:

data {
    // constants
    int n_rows;
    int n_players;
    int n_games;
    int n_game_kinds;

    // tidy data
    int player_won[n_rows];    // outcome
    int game_id[n_rows];       // which game is this row about
    int player_id[n_rows];     // which player is this row about

    // metadata
    int<lower=1, upper=n_game_kinds> game_kind[n_games]; // e.g. chess, checkers, ...

    int<lower=1, upper=n_rows> game_start[n_games];      // which row does game j start in?
    int<lower=2> n_players_in_game_j[n_games];           // how many rows does game j comprise?
parameters {
    vector[n_players] latent_abilities;    // whatever you're trying to model
model {
    latent_abilities ~ std_normal();   // prior of some kind
    for (j in 1..n_games) {
        int start = game_start[j];
        int end = start + n_players_in_game_j[j];
        int players_j[n_players_in_game_j] = player_id[start:end];
        int results_j[n_players_in_game_j] = player_won[start:end];
        int game_kind_j = game_kind[game_id[j]];
        vector[n_players_in_game_j] abilities_j = latent_abilities[players_j];
        results_j ~ your_likelihood_here(abilities_j, game_kind_j, ...);

That’s just a sketch, haven’t checked for syntax or off by one errors, but you can generalize from there, so instead of won in {0, 1} you might have rank in [1, 2, …] (e.g. in a horse race) or score (e.g. in a basketball game), or something more custom like # of seconds behind 1st place finisher (e.g. in a footrace).

1 Like



These are excellent suggestions! I will take a read at the links you provided, thank you very much :)


One more note: that data block is written to show what’s going on, it’s actually not very efficient or terse. You could write it like this using the segment() function and without all the intermediate variables:

    for (j in 1..n_games) {
        segment(player_won, game_start[j], n_players_in_game_j[j]) ~ your_likelihood_here(
            latent_abilities[segment(player_id, game_start[j], n_players_in_game_j[j])],