Multinomial Logit: probability of choice of a soccer action

Hello,

I’d like to do a stan model (in python) to get the probability of an action choice of a soccer game.
Specificly, i have 3 kinds of actions: Shots, Pass and Dribblings. Each soccer player “can choose” one of this options. Really, they are not chosen, they have a probability of being chosen (the sum of three is 1).

Then, mi dependent variable y_i is one of this 3 possibles actions, i.e. this variable is a categorical variable that contain this options. ( y[1]: shot, y[2]: pass, y[3]: dribbling).

The independent variables, regarding my data, are the soccer player involved in the action player_i, the area of the playing field where the action takes place zone_i and some other variables.

So, i think that a multinomial logit model is a good options to get that probabilities, but I have never made a model of this type (other Stan models, yes)

Could someone help me express the “multinomial logit” model and tell me how I should express the data that will be the input of the model?

Thanks!!

Sounds like this is similar to your previous model but now you have a three-state outcome instead of a binary one.

The main change would be categorical_logit instead of bernoulli_logit.
Here’s an example of how that might look.

data {
    int<lower=0> Nobs; // number of observations
    int Nplayers; // number of players
    int Nzones; // number of zones
    
    int<lower=1,upper=Nplayers> player_id[Nobs];
    int<lower=1,upper=Nzones> zone_id[Nobs];

    int<lower=1,upper=3> action[Nobs];
}
parameters {
    vector[3] beta_player[Nplayers];
    vector[3] beta_zone[Nzones];
}
model {
    for (i in 1:Nplayers){
        beta_player[i] ~ normal(0,1);
    }
    for (i in 1:Nzones) {
        beta_zone[i] ~ normal(0,1);
    }
    for (i in 1:Nobs) {
        action[i] ~ categorical_logit(beta_player[player_id[i]] + beta_zone[zone_id[i]]);
    }
}

The above model has some nonidentifiability because only differences in beta values matter for prediction. Due to this betas will probably have very vague posterior distributions no matter how much data you have. Despite that the posterior predictive distribution should be sharp.
It’s possible to fix the nonidentifiability but I left it in to keep this starting point model simple.

4 Likes

Thank you very much!

First, thank you for helping me with the construction of the model. I think that whit this i can do something more sophisticated.

Now, there is something that is not clear to me, when running the model will i get one beta coefficient associated with each independent variable for each of the categorical values of the independent variable? I mean, will i get one beta_{player_i} associated with shots, another beta_{player_i} associated with passes and another one associated with dribblings? or am i wrong?

Again, thank you very much for your help.

Yes, that’s correct. A player’s beta[1] is associated with shots, beta[2] associated with passes and beta[3] associated with dribblings.

1 Like

Uhh, no clue. Maybe @ahartikainen knows what the error means.
EDIT: @Sebastian_Mena removed the post so I guess it’s fixed?

Something to do with a compiler.

If there are still some errors, please let me know, we can fix them.

Thanks to both, I solved the problem.

@nhurre or @ahartikainen , do you know how to make the model run faster?, it takes about 6 days to run.

The model is as follows:

    choices_model = """

    data {
        int<lower=0> Nobs; // number of observations 513769
        int Nplayers; // number of players 515
        int Nzones; // number of zones 12
        int Nloc; // number of localias 2
        int Nres; // number of results 3
        int Ntimes; // number of times 7
        
        int<lower=1,upper=Nplayers> player_id[Nobs];
        int<lower=1,upper=Nzones> zone_id[Nobs];
        int<lower=1,upper=Nloc> loc_id[Nobs];
        int<lower=1,upper=Nres> res_id[Nobs];
        int<lower=1,upper=Ntimes> time_id[Nobs];

        int<lower=1,upper=3> action[Nobs];
    }
    parameters {
        vector[3] beta_player[Nplayers];
        vector[3] beta_zone[Nzones];
        vector[3] beta_loc[Nloc];
        vector[3] beta_res[Nres];
        vector[3] beta_time[Ntimes];
    }
    model {
        for (i in 1:Nplayers){
            beta_player[i] ~ normal(0,1);
        }
        for (i in 1:Nzones) {
            beta_zone[i] ~ normal(0,1);
        }
        for (i in 1:Nloc){
            beta_loc[i] ~ normal(0,1);
        }
        for (i in 1:Nres){
            beta_res[i] ~ normal(0,1);
        }
        for (i in 1:Ntimes){
            beta_time[i] ~ normal(0,1);
        }
        for (i in 1:Nobs) {
            action[i] ~ categorical_logit(beta_player[player_id[i]] + beta_zone[zone_id[i]] 
                                                + beta_loc[loc_id[i]] + beta_res[res_id[i]] + beta_time[time_id[i]]);
        }
    }

And, i do the iterations like this:

choice_fit = choices_reg.sampling(data=datos,
                          iter=500, chains=3,
                          warmup=200, n_jobs=-1,
                          seed=42)

Maybe if I try to lower the number of chains or iterations, do you think it is something significant?

I would appreciate if you could help me

Yikes. I see you have a lot of observations (Nobs=500k).
The number of localias and times are much, much lower so that loop repeats the same calculation many times. I tried to remove most of that redundancy and here’s the result

transformed data {
    int Nall = Nzones*Nloc*Nres*Ntimes; // 12*2*3*7 = 504
    int Kz = Nloc*Nres*Ntimes;
    int Kl = Nres*Ntimes;
    int Kr = Ntimes;
    int p_index[Nobs];
    for (n in 1:Nobs) {
        p_index[n] = Kz*(zone_id[n]-1) + Kl*(loc_id[n]-1) + Kr*(res_id[n]-1) + time_id[n];
    }
}
parameters {
    matrix[3,Nplayers] beta_player;
    matrix[3,Nzones] beta_zone;
    matrix[3,Nloc] beta_loc;
    matrix[3,Nres] beta_res;
    matrix[3,Ntimes] beta_time;
}
model {
    matrix[3,Nall] beta_all;
    for (z in 1:Nzones) {
        vector[3] bz = beta_zone[:,z];
        for (l in 1:Nloc) {
            vector[3] bl = bz + beta_loc[:,l];
            for (r in 1:Nres) {
                vector[3] br = bz + beta_res[:,r];
                for (t in 1:Ntimes) {
                    beta_all[:,(z-1)*Kz+(l-1)*Kl+(r-1)*Kr+t] = br + beta_time[:,t];
                }
            }
        }
    }
    to_vector(beta_player) ~ std_normal();
    to_vector(beta_zone) ~ std_normal();
    to_vector(beta_loc) ~ std_normal();
    to_vector(beta_res) ~ std_normal();
    to_vector(beta_time) ~ std_normal();
    for (i in 1:Nobs) {
        action[i] ~ categorical_logit(beta_player[:,player_id[i]] + beta_all[:,p_index[i]]);
    }
}
2 Likes

I took another stab at this.

You have 514 players and 504 predictor states. That makes a total of 514\times 504\approx259000 possible unique (player, p_index) pairs but you have almost twice that many observations so there must be some duplicate predictor conditions. That allows you to compress the dataset to about half its size by combining duplicate conditions and recording the actions as counts for that predictor combination. So that the data looks like

data {
    int Nobs;
    int<lower=1,upper=Nplayers> player_id[Nobs];
    int<lower=1,upper=Nall> pred_index[Nobs];
    int<lower=0> actions[Nobs,3];
}

where actions[i,1] is the observed number of shots, actions[i,2] is the number of passes and actions[i,3] is the number of dribblings.
Counts of repeated categorical draws follow the multinomial distribution.

model {
    ...
    for (i in 1:Nobs) {
        actions[i] ~ multinomial_logit(beta_player[:,player_id[i]] + beta_all[:,pred_index[i]]);
    }
}

(PyStan doesn’t have multinomial_logit. You can either use multinomial(softmax(...)) instead or install CmdStanPy.)

However, I think there’s a bigger problem with the model.

The beta parameter to categorical/multinomial has three components, with the components controlling the probability of a shot, a pass or a dribbling, respectively.

The idea is that larger beta[1] increases the probability of a shot and larger beta[2] increases the probability of a pass. But it’s not quite that simple because the three probabilities must add up to one. Increasing one beta necessarily decreases the other two probabilities. And changing all three betas by the same amount leaves the probabilities unchanged. Although it is convenient to parametrize the distribution by three numbers there are in fact only two degrees of freedom.

In theory this is not a problem; the model will simply integrate the extra parameter over its prior.
Unfortunately in practice (1.) it complicates interpretation of beta and (2.) the model has one redundant parameter for every player, zone, localia, etc for a total of one third of all parameters being useless and that’s quite a lot of work.

One way to remove the extra parameters is this:

parameters {
    matrix[2,Nplayers] beta_player;
    matrix[2,Nzones] beta_zone;
    matrix[2,Nloc] beta_loc;
    matrix[2,Nres] beta_res;
    matrix[2,Ntimes] beta_time;
}
model {
    matrix[2,Nall] beta_all;
    for (z in 1:Nzones) {
        vector[2] bz = beta_zone[:,z];
        ...
    }
    ...
    for (i in 1:Nobs) {
        vector[2] beta = beta_player[:,player_id[i]] + beta_all[:,pred_index[i]];
        actions[i] ~ multinomial_logit(append_row(0.0, beta));
    }
}

Here append_row builds three component vector [0.0, beta[1], beta[2]].
Shot probability is fixed as the reference value. beta[1]>0 means pass is more likely than a shot; beta[1]<0 means a pass is less likely than a shot. beta[2] is the same for dribblings.

Fixing one value like that makes thinking about priors a bit more complicated but standard normal on a logit scale is quite noninformative anyway so maybe you’re not too concerned about priors.

3 Likes

Thank you so much for your interest and help!

I understood most of what you explained to me, however I have doubts regarding the modifications in the model, this is the current model that I have:

I would like to know if it has any error before trying


data {
    int<lower=0> Nobs; // number of observations 513769
    int Nplayers; // number of players 515
    int Nzones; // number of zones 12
    int Nloc; // number of localias 2
    int Nres; // number of results 3
    int Ntimes; // number of times 7
        
    int<lower=1,upper=Nplayers> player_id[Nobs];
    int<lower=1,upper=Nzones> zone_id[Nobs];
    int<lower=1,upper=Nloc> loc_id[Nobs];
    int<lower=1,upper=Nres> res_id[Nobs];
    int<lower=1,upper=Ntimes> time_id[Nobs];
    
    int<lower=1,upper=Nall> pred_index[Nobs];

    int<lower=0> action[Nobs,3];
}
transformed data {
    int Nall = Nzones*Nloc*Nres*Ntimes; // 12*2*3*7 = 504
    int Kz = Nloc*Nres*Ntimes; // 2*3*7 = 42
    int Kl = Nres*Ntimes; // 3*7 = 21
    int Kr = Ntimes; // 7
    int pred_index[Nobs];
    for (n in 1:Nobs) {
        pred_index[n] = Kz*(zone_id[n]-1) + Kl*(loc_id[n]-1) + Kr*(res_id[n]-1) + time_id[n];
    }
}
parameters {
    matrix[2,Nplayers] beta_player;
    matrix[2,Nzones] beta_zone;
    matrix[2,Nloc] beta_loc;
    matrix[2,Nres] beta_res;
    matrix[2,Ntimes] beta_time;
}
model {
    matrix[2,Nall] beta_all;
    for (z in 1:Nzones) {
        vector[2] bz = beta_zone[:,z];
        for (l in 1:Nloc) {
            vector[2] bl = bz + beta_loc[:,l];
            for (r in 1:Nres) {
                vector[2] br = bz + beta_res[:,r];
                for (t in 1:Ntimes) {
                    beta_all[:,(z-1)*Kz+(l-1)*Kl+(r-1)*Kr+t] = br + beta_time[:,t];
                }
            }
        }
    }
    to_vector(beta_player) ~ std_normal();
    to_vector(beta_zone) ~ std_normal();
    to_vector(beta_loc) ~ std_normal();
    to_vector(beta_res) ~ std_normal();
    to_vector(beta_time) ~ std_normal();
    
    for (i in 1:Nobs) {
        vector[2] beta = beta_player[:,player_id[i]] + beta_all[:,pred_index[i]];
        action[i] ~ multinomial_logit(append_row(0.0, beta));
    }
}

Thaks for your time, I will be attentive to your answer.

That model does not compile because you cannot define pred_index in both data and transformed data.
The data and transformed data should look like this

data {
    int<lower=0> Npairs;
    int Nplayers; // number of players 515
    int Nzones; // number of zones 12
    int Nloc; // number of localias 2
    int Nres; // number of results 3
    int Ntimes; // number of times 7

    int<lower=1,upper=Nplayers> player_id[Npairs];
    // pred_index replaces (zone_id, loc_id, res_id, time_id)
    int<lower=1,upper=Nzones*Nloc*Nres*Ntimes> pred_index[Npairs];

    int<lower=0> actions[Npairs,3];
}
transformed data {
    int Nall = Nzones*Nloc*Nres*Ntimes; // 12*2*3*7 = 504
    int Kz = Nloc*Nres*Ntimes; // 2*3*7 = 42
    int Kl = Nres*Ntimes; // 3*7 = 21
    int Kr = Ntimes; // 7
}

Also it might be just an outdated comment but Nobs is not 513769–or rather, switching to multinomial doesn’t help if the data format remains the same.
You need to preprocess the data outside of Stan, for example (python code where action is int[Nobs] like in your previous model)

import collections
unique_pairs = collections.defaultdict(lambda : [0,0,0])

Kz = Nloc*Nres*Ntimes
Kl = Nres*Ntimes
Kr = Ntimes
for n in range(Nobs):
    pidx = Kz*(zone_id[n]-1) + Kl*(loc_id[n]-1) + Kr*(res_id[n]-1) + time_id[n]
    counts = unique_pairs[(player_id[n], pidx)]
    # action[i] is 1, 2, or 3; list `counts` is zero-indexed in python
    counts[action[i]-1] += 1

Npairs = len(unique_pairs)
player_id = []
pred_index = []
actions = []
for (player, predictor), counts in unique_pairs.items():
    player_id.append(player)
    pred_index.append(predictor)
    actions.append(counts)
1 Like

Thanks! I already fixed the model according to your corrections.
However, the second part of your post is not clear to me. I’m going to show you how is the structure of my data, to see if you can explain that part to me again, knowing this:

The dataset i’m using contains 513769 obs, where in each row shows an event that ocurred in a soccer match, for example, a shot (assigned with #1), a pass (assigned with #2) or a dribbling (assigned with #3), and all the information associated with that event.

The dataset have six columns (all of them already categorized with correlative numbers from 1 onwards).

  1. "event_id": The first column indicates what type of event it is (shot, pass or dribbling)
  2. "new_id": The id of the player associated with the event. (from 1 to 515)
  3. "cat_zone": The area of the field where the event occurs. (from 1 to 12)
  4. "timeFrame": The time segment. (from 1 to 7)
  5. "localia": If the team associated with the event is local or not. (binary)
  6. "cat_res": The current result associated with the team (winning # 1, losing # 2 or drawing # 3)

So, given the above, could you explain to me how to modify my data to use it as input to this model, again?
Because, for my previous model I used this same structure.

I appreciate your time and interest in helping me again @nhuurre

Right, that’s what I thought when writing the Python code above.

The difference between categorical and multinomial is that

model {
    // three events
    1 ~ categorical_logit(beta);
    1 ~ categorical_logit(beta);
    2 ~ categorical_logit(beta);
}

is the same as

model {
    // two 1s, one 2, no 3s
    {2,1,0} ~ multinomial_logit(beta);
}

The predictor beta depends only on columns 2-6. Structure your dataset so that each row corresponds to a unique beta and instead of event_id you have three columns that count the number of shots, passes, and dribblins for that beta. The number of rows to loop over in this format is at most 515\times12\times7\times2\times3=259560 which is about half as many as in the per-event format.

2 Likes

Perfect, now i understand you. I only doubt remains:

  1. Is Npairs the number of combinations between (new _ id, cat _ zone, timeFrame, localia, cat_res) in my data? (100.659 combinations)
  2. Are Nplayers, Nzones, Nloc, Nres, Ntimes the total number of players, zones, localias, results and times?
  3. Is player_id[Npairs] a vector of length Npairs that has as values all the new _ id of each Npairs combinations?
  4. How is the input for pred_index ?

Thanks!

  1. Yes, it’s the number of rows, ie unique beta values. “Pairs” wasn’t the best name for it. I was thinking of it as a pair of (player_id, other predictors).
  2. Yes.
  3. Yes.
  4. It’s calculated from the other predictors (all except new_id)
pred_index[n] = ( Kz*(zone_id[n]-1) + Kl*(loc_id[n]-1)
                       + Kr*(res_id[n]-1) + time_id[n] );
1 Like

Perfect, Thanks you very much @nhuurre !!

Now, when i run the model it throws an error about the last “for” in the model, because Nobs isn’t defined:

model {
    matrix[2,Nall] beta_all;
    for (z in 1:Nzones) {
        vector[2] bz = beta_zone[:,z];
        for (l in 1:Nloc) {
            vector[2] bl = bz + beta_loc[:,l];
            for (r in 1:Nres) {
                vector[2] br = bz + beta_res[:,r];
                for (t in 1:Ntimes) {
                    beta_all[:,(z-1)*Kz+(l-1)*Kl+(r-1)*Kr+t] = br + beta_time[:,t];
                }
            }
        }
    }
    to_vector(beta_player) ~ std_normal();
    to_vector(beta_zone) ~ std_normal();
    to_vector(beta_loc) ~ std_normal();
    to_vector(beta_res) ~ std_normal();
    to_vector(beta_time) ~ std_normal();
    
    for (i in 1:Nobs) {
        vector[2] beta = beta_player[:,player_id[i]] + beta_all[:,pred_index[i]];
        action[i] ~ multinomial_logit(append_row(0.0, beta));
    }
}

Should it be Npairs or do I have to define Nobs as the length of my database?
Because in the data of the model isn’t defined:

data {
    int<lower=0> Npairs;
    int Nplayers; // number of players 515
    int Nzones; // number of zones 12
    int Ntimes; // number of times 7
    int Nres; // number of results 3
    int Nloc; // number of localias 2
    
    int<lower=1,upper=Nplayers> player_id[Npairs];
    // pred_index replaces (zone_id, loc_id, res_id, time_id)
    int<lower=1,upper=Nzones*Nloc*Nres*Ntimes> pred_index[Npairs];

    int<lower=0> actions[Npairs,3];
}

Yes, it’s for (i in ...) { actions[i] ~ ... } so the loop visits each row once.

1 Like

Do you know about this error?

ValueError: Failed to parse Stan model ‘choices_reg_6b98fcea64a7106541ade673b6d5207e’. Error message:
SYNTAX ERROR, MESSAGE(S) FROM PARSER:
Probability function must end in _lpdf or _lpmf. Found distribution family = multinomial_logit with no corresponding probability function multinomial_logit_lpdf, multinomial_logit_lpmf, or multinomial_logit_log

You’re using PyStan, right? I think multinomial_logit was added recently and PyStan hasn’t caught up yet. You could install CmdStanPy, that should have the most recent Stan.
Alternatively, you can write it with softmax (it’s the same function)

action[i] ~ multinomial(softmax(append_row(0.0, beta)));
1 Like

Thanks!

I run this:

choice_fit = choices_reg.sampling(data=datos,
                          iter=500, chains=3,
                          warmup=200, n_jobs=-1,
                          seed=42)

And, i get this error about the range of the index, what is wrong?:

RuntimeError: Exception: : accessing element out of range. index 100660 out of range; expecting index to be between 1 and 100659; index position = 1pred_index (in ‘unknown file name’ at line 52)

The line 52 of my model is this:

50    for (i in 1:Nobs) {
51       vector[2] beta = beta_player[:,player_id[i]] + beta_all[:,pred_index[i]];
52       actions[i] ~  multinomial(softmax(append_row(0.0, beta)));
    }
}```

Nobs = 513.796
Npairs = 100.659

I know how much you've helped me, but I feel like we're nearing the end! :)