Modelling player strengths based on multiple observed rankings

There’s a group of K players, they play N games and for each game we observe their ranking. The goal is to estimate individual player strengths.

The player strengths are S_k \sim N(0, 0.5). In a game n, player k gives a performance P_{nk} \sim N(S_k, 1). The ranking is determined by the player performances.

The players which participate in a given game can be any subset of the K players.

An example dataset is: there are 3 players, 3 games, and the observed rankings are 312, 12, 132.

How would you model this with Stan?


This question boils down to: what is the likelihood of an observed ordering of a vector of independent normal variates with potentially different means?

It turns out that this question has a somewhat tractable answer based on the multivariate normal CDF; see here combinatorics - Compute probability of a particular ordering of normal random variables - Mathematics Stack Exchange

However, Stan does not have a multivariate normal CDF function, and in general it’s a hard one to compute. Some progress in this direction in Stan has been made e.g. here: Multivariate normal CDF and here Multivariate normal cdf (those posts have the same title but are different posts). Perhaps @spinkney or @martinmodrak has more to say about this?

1 Like

I’ve solved it with JAGS and dinterval.

# Specify that performances[1] > performances[2]
one ~ dinterval(performances[1], performances[2])
1 Like

That’s a clever approach based on latent variables! The dimensionality of the auxiliary parameters is quite large–one for each player-x-game combination, and as such Stan might be a particularly good tool for estimation. You can use Stan’s ordered type to create a set of latent performance scores that is bound to respect the ordering of the outcome, and then sample each of these ordered vectors from independent normal distributions with appropriate mean vectors.

I haven’t actually verified analytically that this works to give the right likelihood, but intuitively it feels like it should work. If you’re confident that the dinterval solution works, then I’m pretty sure this must work too.

If there’s a good way to compute the multivariate normal CDF in Stan, you can use it to marginalize out all of these latent variables, which would probably yield gains in computational performance.

Note also that there are families of distributions specifically for ranking data, most notable the exploding logit (see e.g.: A simple way to model rankings with Stan - Bruno Nicenboim). There’s also a bunch of published papers on Bayesian modelling of various racing sports which typically focus on modelling rankings and player strengths.