Doesn't work: Passes Soccer Model (never ends)


I’m trying to create a model to calculate a probability to do a pass (accurate) in a soccer game. To do that, my data considering who the passer is (player_id), in which zone is the kicker (cat_zone_i), in which zone is the receiver (cat_zone_f), in what period of time is the shot (timeFrame), what is the current result of the player team (cat_res) and if the player team is home/away (localia).

I think the best option is using a bernoulli_logit.

My model is here:

passes_model = """
data {
    int<lower=0> N; // number of observations (328530 solo jugadores con >10) (328657 todos los pases)
    int players; // number of players 488
    int zones_i; // number of field zones 8
    int zones_f; // number of field zones 8
    int time; // number of time frames 7
    int res; // types of results (winning, losing, tying)
    int loc; // localia
    int<lower=1,upper=players> player_id[N];
    int<lower=1,upper=zones_i> cat_zone_i[N];
    int<lower=1,upper=zones_f> cat_zone_f[N];
    int<lower=1,upper=time> time_frame[N];
    int<lower=1,upper=res> cat_res[N];
    int<lower=1,upper=loc> localia[N];
    int pase[N]; // dependent variable
parameters {

    real alpha; // intercept
    vector[players] beta_player; // coefficient associated with each player
     vector[zones_i] beta_zones_i; // coefficient associated with each zone_i
    vector[zones_f] beta_zones_f; // coefficient associated with each zone_f
    vector[time] beta_time; // coefficient associated with each time frame
    vector[res] beta_res; // coefficient associated with each result
    vector[loc] beta_loc; // coefficient associated with each type of localia
    real epsilon; //Uncertainty / unexplained variance
model {
    // priors
    alpha ~ normal(0,1);
    beta_player ~ normal(0,1);
    beta_zones_i ~ normal(0,1);
    beta_zones_f ~ normal(0,1);
    beta_time ~ normal(0,1);
    beta_res ~ normal(0,1);
    beta_loc ~ normal(0,1);
    pase ~ bernoulli_logit(alpha + beta_player[player_id] + beta_zones_i[cat_zone_i] + 
        beta_zones_f[cat_zone_f] + beta_time[time_frame] + beta_res[cat_res] + beta_loc[localia]);

Then, i ran:

passes_reg = pystan.model.StanModel(model_code=passes_model, 

The database is created like this:

N = len(df_passes_ps.new_id)
players = len(df_passes_ps.new_id.unique())
zones_i = len(df_passes_ps.cat_zone_i.unique())
zones_f = len(df_passes_ps.cat_zone_f.unique())
time = len(df_passes_ps.timeFrame.unique())
res = len(df_passes_ps.cat_res.unique())
loc = len(df_passes_ps.localia.unique())

player_id = df_passes_ps.new_id
cat_zone_i = df_passes_ps.cat_zone_i
cat_zone_f = df_passes_ps.cat_zone_f
time_frame = df_passes_ps.timeFrame
cat_res = df_passes_ps.cat_res
localia = df_passes_ps.localia

pase = df_passes_ps.accurate

datos = {'N': N, 'players': players, 'zones_i': zones_i, 'zones_f': zones_f, 'time': time, 'res': res, 'loc': loc, 
        'player_id': player_id, 'cat_zone_i': cat_zone_i, 'cat_zone_f': cat_zone_f, 'time_frame': time_frame, 'cat_res': cat_res, 'localia': localia, 
        'pase': pase}

But, when i try this, never ends:

passes_fit = passes_reg.sampling(data=datos,
                          iter=1000, chains=4,
                          warmup=500, n_jobs=-1,

I would appreciate if you told me what my problem is

Well you have 330k observations, so that’ll take some time even if the model is sampling efficiently.

You should start with a smaller dataset or a smaller/simpler model to develop your problem. Maybe just use the games from one day or something.

It could be that there are problems with the model somewhere that you can fix and will make the 330k observation model run at a suitable speed.

It could also be that the 330k observation model is just too expensive, in which case you’ll need to make extra assumptions to simplify your model to get it to run (so the experiments with the simpler models will also be useful here).