Long processing time after sampling

I’m running a Stan model in PyStan, and am noticing significant delays between the time that the final chain completes sampling and the time that the .sampling() returns. This delay seems to be proportional to the number of iterations that I run, and to the number of parameters/generated quantities that I have, so I’m assuming it’s data-volume related. Does anyone have any suggestions on reducing the post-sampling time, or on reducing the overall memory footprint?

For reference, here’s my model:

data {
    int<lower=0> NPT; // Number of player/tournaments: ~2500
    int<lower=0> NG; // Number of games: ~20000
    int<lower=0> NM; // Number of matchups: 220
    int<lower=0> NP; // Number of players: ~300

    int<lower=0, upper=NPT> prev_tournament[NPT]; // Previous tournament for player/tournament

    int<lower=1, upper=NP> tp[NPT]; // Player n game

    int<lower=0, upper=1> win[NG]; // Did player 1 win game
    int<lower=1, upper=NPT> pt1[NG]; // Player/tournament 1 in game
    int<lower=1, upper=NPT> pt2[NG]; // Player/tournament 2 in game
    int<lower=1, upper=NM> mup[NG]; // Matchup in game
    vector<lower=0, upper=1>[NG] non_mirror; // Is this a mirror matchup: 0 = mirror

parameters {
    vector[NPT] raw_skill; // Skill change before player/tournament
    vector[NM] mu; // Matchup value
    vector<lower=0>[NM] muv; // Matchup skill multiplier
    real baseline_skill; // Average baseline player skill
    vector[NP] raw_player_skill; //
    real tournament_variance; // How much variance is there around a player skill per tournament
    real player_variance;
transformed parameters {
    vector[NP] player_skill;
    vector[NPT] skill;

    player_skill = baseline_skill + raw_player_skill * player_variance;
    skill = player_skill[tp] + raw_skill * tournament_variance;
model {
    baseline_skill ~ std_normal();
    raw_player_skill ~ std_normal();
    raw_skill ~ std_normal();
    mu ~ normal(0, 0.5);
    muv ~ std_normal();

    win ~ bernoulli_logit(muv[mup] .* (skill[pt1] - skill[pt2]) +  non_mirror .* mu[mup]);
generated quantities{
    vector[NG] log_lik;
    vector[NG] win_hat;

    for (n in 1:NG) {
        log_lik[n] = bernoulli_logit_lpmf(win[n] | muv[mup[n]] * (skill[pt1[n]] - skill[pt2[n]]) + non_mirror[n] * mu[mup[n]]);
        win_hat[n] = bernoulli_logit_rng( muv[mup[n]] * (skill[pt1[n]] - skill[pt2[n]]) + non_mirror[n] * mu[mup[n]]);

I’m computing lok_lik and win_hat in generated quantities in order to use arviz.loo to estimate the predictive power of this (and several simpler) models. Leaving the generated quantities out seems to speed up the process significantly.

Interrupting the process in the middle of the post-sampling step, it stopped in the middle of pystan.diagnostics.check_hmc_diagnostics. I’m going to try running without that, but separately, is there anything to be done to speed up the hmc diagnostics? Do they run on the generated quantities? Can I tell them to leave those out? I’m already using pars to trim down to only the params that I’m interested in, rather than the raw_ parameters.

To follow up: disabling diagnostics definitely sped up the process significantly. So, now my question is, what’s the best way to run the subset of diagnostics I want, or to speed up the diagnostics that run by default?

Custom model diagnostics are possible using CmdStan. I don’t have much experience or technical details using that feature specifically, I’d be interested if you share your progress.

Looking at the PyStan source, it seems pretty feasible to make it possible to run the diagnostics on a subset of parameters. However, from a statistical standpoint, I’m not sure which things it’s reasonable to run them on or not. Is it meaningful to run them on the generated quantities (if those are in fact included in fit.sim['fnames_oi'])? Is it ok to only run the diagnostics on the same subset of parameters that pars covers?

I’m guessing yes if you want to prevent subtle bugs. In other cases where it’s a question of optimizing matrix operations it’s usually not an arbitrary choice and the gains may only be marginally better.

I think an interesting question is why the diagnostics options are limited to gradients of log_prob.

I made one PR to speed up the diagnostics and now there is one PR that proposes limits for automatic diagnostics (for n_eff and rhat) to maximum of 1000 flat parameters. Larger models would need to run those manually.

Also the current PR enables to run tests only for subset of parameters (also for elements --> X[3,4]).

Oh, nice, I’ll have to install PyStan from Github and check out the speed improvements.