I’m running a Stan model in PyStan, and am noticing significant delays between the time that the final chain completes sampling and the time that the `.sampling()`

returns. This delay seems to be proportional to the number of iterations that I run, and to the number of parameters/generated quantities that I have, so I’m assuming it’s data-volume related. Does anyone have any suggestions on reducing the post-sampling time, or on reducing the overall memory footprint?

For reference, here’s my model:

```
data {
int<lower=0> NPT; // Number of player/tournaments: ~2500
int<lower=0> NG; // Number of games: ~20000
int<lower=0> NM; // Number of matchups: 220
int<lower=0> NP; // Number of players: ~300
int<lower=0, upper=NPT> prev_tournament[NPT]; // Previous tournament for player/tournament
int<lower=1, upper=NP> tp[NPT]; // Player n game
int<lower=0, upper=1> win[NG]; // Did player 1 win game
int<lower=1, upper=NPT> pt1[NG]; // Player/tournament 1 in game
int<lower=1, upper=NPT> pt2[NG]; // Player/tournament 2 in game
int<lower=1, upper=NM> mup[NG]; // Matchup in game
vector<lower=0, upper=1>[NG] non_mirror; // Is this a mirror matchup: 0 = mirror
}
parameters {
vector[NPT] raw_skill; // Skill change before player/tournament
vector[NM] mu; // Matchup value
vector<lower=0>[NM] muv; // Matchup skill multiplier
real baseline_skill; // Average baseline player skill
vector[NP] raw_player_skill; //
real tournament_variance; // How much variance is there around a player skill per tournament
real player_variance;
}
transformed parameters {
vector[NP] player_skill;
vector[NPT] skill;
player_skill = baseline_skill + raw_player_skill * player_variance;
skill = player_skill[tp] + raw_skill * tournament_variance;
}
model {
baseline_skill ~ std_normal();
raw_player_skill ~ std_normal();
raw_skill ~ std_normal();
mu ~ normal(0, 0.5);
muv ~ std_normal();
win ~ bernoulli_logit(muv[mup] .* (skill[pt1] - skill[pt2]) + non_mirror .* mu[mup]);
}
generated quantities{
vector[NG] log_lik;
vector[NG] win_hat;
for (n in 1:NG) {
log_lik[n] = bernoulli_logit_lpmf(win[n] | muv[mup[n]] * (skill[pt1[n]] - skill[pt2[n]]) + non_mirror[n] * mu[mup[n]]);
win_hat[n] = bernoulli_logit_rng( muv[mup[n]] * (skill[pt1[n]] - skill[pt2[n]]) + non_mirror[n] * mu[mup[n]]);
}
}
```

I’m computing `lok_lik`

and `win_hat`

in `generated quantities`

in order to use `arviz.loo`

to estimate the predictive power of this (and several simpler) models. Leaving the generated quantities out seems to speed up the process significantly.

Interrupting the process in the middle of the post-sampling step, it stopped in the middle of `pystan.diagnostics.check_hmc_diagnostics`

. I’m going to try running without that, but separately, is there anything to be done to speed up the hmc diagnostics? Do they run on the generated quantities? Can I tell them to leave those out? I’m already using `pars`

to trim down to only the params that I’m interested in, rather than the `raw_`

parameters.