How to reduce model runtime?

I have a hierarchical model on pairwise comparisons (tennis players). I’m running it on a small example now–there are J=50 players who played against each other in N=4411 matches. I add p=3 covariates to the model. Even on this small example it takes a long time to run (several hours), and this is really problematic when I’d like to get a version working for J=1200 players and N=25000 matches. The model is:

data {
  int<lower=0> J;         // total number of players
  int<lower=0> N;         // number of matches
  int<lower=0> p;  // the number of covariates for each player (besides the skill parameters)
  matrix[N, 3*J] X_skill;  // design matrix for the skills
  matrix[N, p] X1;  // characteristic player 1 
  matrix[N, p] X2;  // characteristic player 2  
  int<lower=2, upper=3> winner_sets[N];  // number of sets won by the winner for match n
  int<lower=2, upper=5> total_sets[N];  // number of total sets for match n

parameters {
  vector[3*J] mu; // means of the player skills
  vector<lower=0>[3*J] sigma;  // SDs of the player skills
  vector[p] beta; // coefficients on player characteristics
  vector[3*J] alpha;  // player skills on hard, clay, and grass
model {
  // priors
  mu ~ double_exponential(0, 1);
  beta ~ normal(0, 1);
  alpha ~ normal(mu, sigma); 
  // sample
  winner_sets ~ binomial_logit(total_sets, (X1 * beta) - (X2 * beta) + (X_skill * alpha));

So I am using a uniform, improper prior over sigma. (SEE EDIT BELOW)

My code in the sampling statement is vectorized, and it runs fine. It just takes too long to run. The data are:

Thank you for any help you can provide!

EDIT: Okay, I just ran the model with 4 chains and iter=1000. It finished in a somewhat reasonable 45 minutes, but there were “1982 divergent transitions after warmup”. I looked at the following traceplot:

It seems the improper prior on sigma is causing the chains to be extremely sticky. I’m going to try to use a Half-Cauchy prior instead with location 0 and scale 1. I hope my understanding is right, and I’ll report back with how goes.

1 Like

The trace plots indicate that you have identifiability issues, no?

There’s a case study that never got published on the Bradley-Terry model which you might find useful:


You clearly have identifiability issues. Maybe start by setting sigma equal to unity. I’d also write (X1 - X2) * beta, which should run faster (smaller AD expression tree => lower memory usage).

ADD: And to scale to large data sets this size you will likely need reduce_sum based within-chain parallelization for which there is a case study on


ADD: And to scale to large data sets this size you will likely need reduce_sum based within-chain parallelization for which there is a case study on

I think the model is misspecified. You don’t want to estimate a separate sigma per player per skill, but rather one sigma per skill (so only 3 values in total). Once this works, consider correlated skills, i.e. estimating a 3x3 covariance matrix. There is a “right” way to do it in Stan using the Cholesky parametrisation of multivariate normal. Feel free to ask here when you get to this point.

1 Like

Actually the idea was to estimate a different sigma per player per skill. The idea is that a given player’s performance might vary depending on the surface of the court, of which there are three (hard, clay, grass).

That said, perhaps overparameterization is still an issue. I might try learning J different sigma’s instead of 3*J, which is less flexible but may be worth the tradeoff. What do you think?

OK, I got it now. You are trying to estimate per-player variation in performance, and on each surface separately. This can only work if you have ample data on every surface for every player. If you have even a single player without much data (or worse yet, with a single game), you will fail to estimate their sigma, and you can expect the problems to propagate to the other parameters. Your idea to have a single per-player variance might help a little, but is still likely overambitious, and not in line with domain expertise (there may be players who are consistent on one surface but not others).

There are established models in soccer analytics which estimate player skills based on the assumption that they (players/skills) form a single population with some variance. This is what I was getting at above, and it is where I would start with your data. If you insist on per-player variances, you may need a hyperprior on sigmas that can keep the poorly-identified ones in a reasonable neighbourhood, as defined by those that are well identified. But I’d be surprised if it worked well.


Thanks for your thoughts. Yeah, I think the issue is going to be then that there just aren’t many grass court matches. My thought was that for somebody like Nadal (if you don’t know, he’s the best clay court player ever by far), his sigma on clay might be quite small, while on grass it might be considerably larger. That was the idea. But I understand what you’re saying.

I will try your suggestion and try to estimate a different sigma for each surface, making sigma a 3-vector. Due to fewer matches being played on grass, I’d expect it to have the largest sigma of the surfaces. Again appreciate your insights and will keep this thread going if anything interesting comes up.

Just a minor comment:

This is not what I expect. The grass sigma will measure the spread of the players’ skills on grass. Its posterior mean/median will not be related to the amount of data.

Good luck and please do report back.

1 Like

Follow-up question: I’ve implemented a model where I learn 3 different sigma’s, one for each surface, as you suggested. I’m also using variational Bayes for efficiency. Interestingly, some of the players in my data set who’ve played a TINY amount of matches are getting rated extremely highly. For example, there’s a player named Dragos Dima who has 2 wins and 0 losses on hard court in my training set, and he hasn’t played at all on the other surfaces. Nonetheless, he is still rated as having the 3rd best hard court skill, 6th best clay court skill, and 4th best grass court skill in terms of the posterior mean. This is confusing to me, as I thought a limited amount of matches would mean the prior distribution would dominate these players’ ratings, and that because of my model, they should be close to 0. (What has gone as expected is that the credible intervals for these kinds of players are extremely wide).

What do you think? Does this issue get back to what you were saying earlier about if a player has not played at all on certain surfaces, there will be issues that propagate throughout the estimation process?

In the random effects soccer models I’m used to, a player like you describe would be shrunk to the mean very strongly (this may be wrong in itself by the way, since him having played so few matches presumably means that he’s below average). I don’t know if variational Bayes plays nicely with this kind of models, but you should not be forced to use it. If your model is a reasonable representation of data and implemented correctly, hundreds of thousands of games and tens of thousands of players should be doable with HMC overnight (if not much faster). And even if it is slower, it is well worth doing it at least once to see if all the diagnostics pass.

If you wouldn’t mind linking me to one or two of these soccer papers, that’d be great!

Don’t have them at hand. Look for Szczepański & McHale, 2014 and 2015. But they are straightforward frequentist mixed models, won’t help you with your fitting issues.

Hi everyone, I want to say thanks to both @huffyhenry and @wds15 or their answers. I changed my model to say that player skills on hard, clay, and grass surfaces come from normal distributions centered at 0 with their own scale parameters. I also am using some reduce_sum parallelization and things are running much faster. Upon training and testing my model I am getting results comparable to some Bradley Terry tennis models in the statistical literature, so I think this is positive.

Now I want to ask @huffyhenry his thoughts on estimating a 3x3 covariance matrix for the player skills. I suppose the idea is that, if we let \theta_i be the player skills for player i (a 3-vector), we want to say that \theta_i \sim N_3(0, \Sigma), where \Sigma is the covariance matrix. This will allow the skills to be correlated across surfaces, which I believe is a better model.

You can see my current model here, which assumes the player skills are independent across each surface.

functions {
  real partial_sum_lpmf(int[] slice_winner_sets, int start, int end, int[] total_sets, matrix X, vector p, int c, int J) {
    return binomial_logit_lupmf(slice_winner_sets | total_sets[start:end], block(X, start, 1, end-start+1, c*J)*p);
data {
  int<lower=0> J;         // total number of players
  int<lower=0> N_hard;         // number of matches on each surface
  int<lower=0> N_clay;
  int<lower=0> N_grass;
  matrix[N_hard, J] X_skill_hard;  // design matrix for skills on hard
  matrix[N_clay, 2*J] X_skill_clay;  // design matrix for skills on clay
  matrix[N_grass, 2*J] X_skill_grass;  // design matrix for skills on grass
  int winner_y_hard[N_hard];  // number of sets won by the winner for match n
  int winner_y_clay[N_clay];  // number of sets won by the winner for match n
  int winner_y_grass[N_grass];  // number of sets won by the winner for match n
  int total_n_hard[N_hard];  // number of total sets for match n
  int total_n_clay[N_clay];  // number of total sets for match n
  int total_n_grass[N_grass];  // number of total sets for match n
  int<lower=1> grainsize;
parameters {
  real<lower=0> sigma_hard;
  real<lower=0> sigma_clay;
  real<lower=0> sigma_grass;
  vector[J] alpha_hard;  // player skills on hard
  vector[J] alpha_clay;  // player skills on clay
  vector[J] alpha_grass;  // player skills on grass
transformed parameters {
  vector[2*J] alpha_hard_clay = append_row(alpha_hard, alpha_clay);
  vector[2*J] alpha_hard_grass = append_row(alpha_hard, alpha_grass);
  vector[3*J] alpha = append_row(append_row(alpha_hard, alpha_clay), alpha_grass);
model {
  // priors
  sigma_hard ~ inv_chi_square(3); 
  sigma_clay ~ inv_chi_square(3);  
  sigma_grass ~ inv_chi_square(3);  
  alpha_hard ~ normal(0, sigma_hard);
  alpha_clay ~ normal(0, sigma_clay);
  alpha_grass ~ normal(0, sigma_grass);
  // sample
  target += reduce_sum(partial_sum_lupmf, winner_y_hard, grainsize, total_n_hard, X_skill_hard, alpha_hard, 1, J);
  target += reduce_sum(partial_sum_lupmf, winner_y_clay, grainsize, total_n_clay, X_skill_clay, alpha_hard_clay, 2, J);
  target += reduce_sum(partial_sum_lupmf, winner_y_grass, grainsize, total_n_grass, X_skill_grass, alpha_hard_grass, 2, J);

Good to hear that the model performs better. If you intend to work with uncorrelated skills any further, you should rewrite it to use the non-centered parametrisation for your alphas. See e.g. 25.7 Reparameterization | Stan User’s Guide. With this you won’t need any within-chain parallelisation, it’s overkill for 50 players and 4411 matches.

You describe the correlated effects correctly. The parametrisation should be in terms of standard deviations of skills and the Cholesky factor of their correlation matrix. Check out the second example here: Using a LKJ prior in Stan.

The inverse chi-squared priors on your sigmas will force some between-player variation in skills even if the data does not support it. Are you sure that this is what you want? Half-Cauchy priors are often recommended for variance components.

Thanks for providing those links, and I assume your last point about the inverse chi-squared priors is in relation to the “correlated skills” model. Do you mean that they’ll force non-zero within-player correlations in the player skills? The correlation between skills between player should always be independent. Or am I misunderstanding something?

That comment applies to both models. The three sigmas measure the extent of between-player variation in the three skills, whether they’re correlated or not. Your current priors put an arbitrary soft lower bound on what this variation can be on each surface. It may be a good idea if it represents genuine domain knowledge, but it is a questionable choice as a default.

1 Like

I have a followup question about model runtime. When I fit the model and only estimate the alpha’s (the player skills), the model takes 6-7 minutes to finish. I run 4 chains of 500 warmup iterations and 500 sampling iterations. When I estimate the alpha’s AND the coefficients on the covariates beta at the same time (as in my original post), the runtime is much, much longer, sometimes taking an hour. If I estimate the alpha’s on their own and then plug in their posterior means to estimate beta, runtime is 10-15 minutes for 500 warmup iteration and 250 sampling iterations. All of this is curious to me because of the following

  • There are hundreds of alpha’s, and only 3 beta’s to estimate. So shouldn’t it be more expensive to estimate the alpha’s?
  • Maybe something is wrong with a normal prior on beta? …
  • A priori, I know the estimates for beta should be small, between 0 and 1 in absolute value. They could be as small as 10^{-4}. This is why I assign a N(0,1) prior in my original post
  • Could normalizing or standardizing the covariates help?

Thanks for any thoughts.

The HMC algorithm works fastest when the posterior variance of every actual parameter is ~1. This is the case for your alphas if you use the non-centered parametrisation, but not for betas. The fix is to rescale them: define a new vector beta_raw in the parameters block, and move your betas to transformed parameters defined as beta = 0.0001*beta_raw.

I’m still suspicious of your running times though given the small dataset. Maybe post your most recent code or a link to the repository.

Edit: standardising covariates will indeed help if it results in putting betas on the same unit scale as alphas. If so, it’s the cleaner solution.