Time varying logistic regression- help with reparameterization

Hi,
I’m trying to use on a model which uses logistic regression to model the time varying ratings of multiple teams competing in a tournament. For a given match n, we assume that the score between teams follows a binomial logit model i.e. the number of rounds won by a team is dependent on the rating difference between the two teams on a logit scale. For simplicity, I assume that teams ratings can vary with time according to a Brownian process. It would probably be better to use a Gaussian process, but a Brownian process is easier to evaluate.

Unfortunately, Stan has real issues sampling from my model . I assume this is because we are sampling from the ratings directly, but logistic regression only cares about the difference between the ratings (centered parameterization problem). I tried incorporating a “mu” parameter to allow the mean of the ratings to vary, so I could sample from just a Gaussian distribution, but this doesn’t seem to work too well either.

I think the most efficient way to obtain team scores would be to sample from an M by M matrix of the differences between team ratings, with zeros along the diagonal. However, I’m not sure how to constrain a matrix to be antisymmetric in this way in stan, and I’m not sure what kind of prior I would use for this kind of matrix. I do think this might be the most efficient sampling method though.

Here’s my current code. If you have any suggestions on how to avoid this issue of centered parameterization, please let me know.

  data {
    int<lower=0> N;
    int<lower=0> M;
    int roundscores[N];
    int totalrounds[N];
    int bluteamnumber[N];
    int redteamnumber[N];
    real times[N];
  }
  parameters {
    matrix[M,N] team_scores;
    real<lower=0> sd_change;
    real <lower=0> sd_teams;
    real<lower=-6,upper=6> mu[N];
  }
  transformed parameters {
    
    vector[N] score_diff;
    matrix[M,N] team_scores_star;
    for (n in 1:N){
      team_scores_star[,n]=team_scores[,n]+mu[n];
      score_diff[n] = team_scores_star[bluteamnumber[n]][n]-team_scores_star[redteamnumber[n]][n];
    }
  }
  model{
    sd_change~ normal(0,0.1);
    sd_teams~ normal(0,2);
    for (m in 1:M){
      team_scores[m]~normal(0,sd_teams);
      }
    for (n in 2:N){
    team_scores_star[,n]~normal(team_scores_star[,n-1],(times[n]-times[n-1])*sd_change);
      }
    roundscores~binomial_logit(totalrounds,score_diff); 
  }

Ok, I’ve made some progress. I found the post on a similar problem here: Using Stan for week-by-week updating of estimated soccer team abilites « Statistical Modeling, Causal Inference, and Social Science
this was extremely helpful for this problem- and I now get time varying scores as expected- most of the diagnostics look ok, except for max_treedepth which is frequently getting saturated. I’m not sure how to solve this but my results are still greatly improved. Here is my new stan code:

  data {
    int<lower=0> N;
    int<lower=0> M;
    int roundscores[N];
    int totalrounds[N];
    int bluteamnumber[N];
    int redteamnumber[N];
    real times[N];
  }
  parameters {
   
    real<lower=0> sd_change;
    vector[M] init_scores;
    matrix[M,N-1] tau;
  }
  transformed parameters {
    matrix[M,N] team_scores;
    vector[N] score_diff;
    team_scores[,1]=init_scores;
    for (n in 2:N){
    team_scores[,n]=team_scores[,n-1]+tau[,n-1]*sd_change*(times[n]-times[n-1]);
      }
    for (n in 1:N){
      score_diff[n] = team_scores[bluteamnumber[n]][n]-team_scores[redteamnumber[n]][n];
    }
  }
  model{
    init_scores~normal(0,1);
    sd_change~ normal(0,0.1);
    to_vector(tau)~ std_normal();
    roundscores~binomial_logit(totalrounds,score_diff); 
  }
  generated quantities{
    matrix[M,N] team_scores_star;
    for (n in 1:N){
      team_scores_star[,n]=(team_scores[,n]-mean(team_scores[,n]));
      }
  }
1 Like

Hi,
sorry for not getting to you earlier. The model is a bit too complex for me to debug quickly. I’ll note that most of the advice at Divergent transitions - a primer will apply. In particular, it would be helpful to try to simplify the model until the issue disappears.

My first guess would be that since the init_scores and tau enter the likelihood only through score_diff you have a location non-identifiability, i.e. that adding the same constant to all elements of team_scores will give you exactly the same likelihood (as the constant cancels out). The solution would then be to either enforce a sum to zero constraint (see e.g. Test: Soft vs Hard sum-to-zero constrain + choosing the right prior for soft constrain) or to take one round/team as reference and fix their scores to 0 (or other constant), so that you’ll have one less row/column in the tau matrix… I am being a bit vague as I don’t understand the model well, but I hope the general idea is clear.

Best of luck with your model!