Help with Bayesian Modelling

I am currently working on my postgraduate dissertation in statistics and I am using STAN extensively for this. My dissertation involves simulating a latent position model of chess games which estimates the ratings of the respective chess players. So, the model would be an alternative to the existing Elo Rating system.

My raw data consists of all the matches played during January 2024 on Lichess among all titled, non-BOT players. The only input I am using from this data are the names of each unique player and the outcome for each match.

With regards the STAN model, I have based the priors of both latent variables (ratings and white adjustment factor) on their respective distribution from the raw data.

Currently, my STAN model is as follows:
data {
int<lower=0> P; // number of players
int<lower=0> N_games_matrix[P, P]; // number of games matrix
int<lower=0> Y_matrix[P, P]; // scores matrix
}

parameters {
vector<lower=2004.86, upper=3200>[P-1] gamma_free;
vector<lower=0.015, upper=0.075>[P] W;
}

transformed parameters {
vector[P] gamma; // latent ratings for each player
gamma[1] = 0; // constrain the first player’s gamma to 0

for (p in 2:P) {
gamma[p] = gamma_free[p-1]; // assign the rest of the gammas
}
}

model {
// Priors
gamma_free ~ normal(2579.787, 191.9756); // prior for the free latent ratings
W ~ normal(0.045, 0.5); // prior for the white player advantage

// Likelihood
for (i in 1:P) {
for (j in 1:P) {
if (i != j) { // Ensure i is not equal to j
real eta = gamma[i] - gamma[j] + W[i];

    // Debugging: Print intermediate values and target log probability
    print("i: ", i, " j: ", j, " eta: ", eta);
    print("Y_matrix[i, j]: ", Y_matrix[i, j], " N_games_matrix[i, j]: ", N_games_matrix[i, j]);
    print("target(): ", target());
    
    Y_matrix[i, j] ~ binomial(2 * N_games_matrix[i, j], inv_logit(eta));
    
    // Debugging: Print updated target log probability
    print("Updated target(): ", target());
  }
}

}
}

However, I am consistently receiving errors like this:
“Chain 2: Log probability evaluates to log(0), i.e. negative infinity.
Chain 2: Stan can’t start sampling from this initial value.
Chain 2:
Chain 2: Initialization between (-2, 2) failed after 1 attempts.”

In addition, when I included print statements to see where exactly the intialization fails, I get the following for each estimate value:
Y_matrix[i, j]: 0 N_games_matrix[i, j]: 0
target(): -inf
Updated target(): -inf
i: 948 j: 714 eta: -7.06877
Y_matrix[i, j]: 0 N_games_matrix[i, j]: 0
target(): -inf

Since N_games_matrix[i, j] and Y_matrix[i, j] are zero, I am pretty sure the binomial distribution might be causing issues. Can you offer any advice on this matter?

Any help is greatly appreciated and all the best :)

Thanks,
Patrick

1 Like

As an aside,

may be confusing as the names indicate a matrix, but these are two dimensional arrays. Modern Stan syntax would have them like so:

array[P, P] int<lower=0> variable_name;

If you instead want a matrix, you would specify it like so:

matrix<lower=0>[P, P] variable_name;

Since in the model block you’re printing out these data variables and they are all zero, I assume this isn’t the data you are intending to pass into Stan, so you’ll need to check that you are actually passing Stan the proper structure.

Once you get that sorted …

I’m not familiar with what you are trying to achieve here, but your parameter constraints seem suspect to me. For gamma_free, you are saying it is impossible that the parameter falls below exactly 2004.86 or above 3200, for example, is this what you intend? The range of your prior says otherwise.

1 Like

The input matrices, N_games_matrix and Y_matrix, are not all zero though. The N_games_matrix is the number of games played by each pair of players, when a certain player is White. The Y_matrix is the cumulative score by each pair of players, when a certain player is White. Both matrices are sparse but there are some non-zero values in both matrices.

Secondly, the range of the gamma parameter should be Normally distributed and be between 2004.86 and 3200. However, I am unsure what prior to use to achieve this.

I appreciate ay help you can offer on this matter.

Regarding the zeros, I was responding to what you said here as you haven’t shown the data and such:

Hi Scott,

I understand what you mean now.

I attach the CSV file of my raw data. I am only using the fields “White”, “Black” and :Result" with my STAN model. The Elo ratings will be averaged per player and compared against the posterior ratings from my model.

In addition, I construct the 2 inputted matrices using the attached R script.

Once the data is prepared, I compile the STAN model with the prepared data using the attached MCMC R file script.

MCMC.R (3.48 KB)

filtered_games_between_TWO_titled_players_NO_BOTS_NAMES_LOWERCASE_NOWHITESPACE_NAMES_RATINGS_AND_RESULTS_ONLY_ALL_HEADINGS.csv (101 KB)

data_prepation_Input_Matrix_alternative.R (2.89 KB)

If you are seeing all zeros as you said, check in R that your two matrices:

data_list <- list(
  P = P,                              # Total number of unique players
  N_games_matrix = Number_of_games_matrix,  # Matrix of number of games
  Y_matrix = Score_per_matrix,             # Matrix of scores
  unique_players = all_players              # Unique player names
)

do not contain all zeros, actually have the data you expect. If they are large, Maybe make the matrix smaller so you can easily review it. And when you pass them into Stan you see that data.
For example, below shows that Stan gets the same data I gave it regardless of it being an array or matrix.

test.stan:

data {
  int<lower=0> P;
  array[P,P] real X;
  matrix[P,P] Y;
}

generated quantities {
  print("X =", X);
  print("Y =", Y);
}

In R,

m$sample(data = list(P = 2, X = diag(1:2), Y = diag(3:4)), fixed_param = TRUE, iter_warmup = 0, iter_sampling = 2)

Returns

Chain 1 X =[[1,0],[0,2]] 
Chain 1 Y =[[3,0],[0,4]]

Then, from your comment here:

And that you are incrementing target on all non-diagonal values (some or all or which have zeros),

means that you would see the errors you are getting.

Hi - sorry to jump in this late; I was interested because I’ve recently been modeling some chess games myself so I hope to learn from your approach :-)

A conceptual question: since chess also has draws as outcomes, does it make sense to use a binomial distribution for the scores?

Second, I agree with @ssp3nc3r that the constraints on your parameters can potentially harm your model. For instance, the value of W is constrained to be between 0.015 and 0.075, but this is then added to the difference between two parameters that are in the 2000-3000 range - does W have any impact then? I’m definitely not an expert on this, but I believe it would make sense to drop the constraint on the parameters entirely and instead use priors that place most of their mass on the range that you find most likely while allowing for the data to “surprise” you.