Inconsistency when measuring computational efficiency

I would like to compare two models’ computational efficiency by comparing their computational time ( get_elapsed_time focussing on sample time).
I ran 50 runs, each run fit 2 models with 4 chains and 5000 iterations per chain to ensure both models are converged then I calculate the mean and sd from these 50 runs for these 2 models.
However, when examining these sampling time, I find that there are certain runs that have time to sample significantly longer than the rest.

          [,1]     [,2]     [,3]     [,4]     [,5]
 [1,]  2310.47  2318.04  2317.43  2308.44  2316.98
 [2,]  2427.79  2437.33  2423.77  2446.28  2426.18
[3,] 11961.75 11924.75 12056.68 12089.85 11985.91
 [4,]  8412.72  8427.76  8416.03  8428.85  8398.48
[5,] 15148.02 15131.96 15211.89 15012.20 15183.94
 [6,]  8624.61  8604.96  8688.13  8565.82  8662.20
 [7,]  2182.10  2178.02  2192.84  2188.47  2208.27
 [8,]  1754.30  1759.40  1761.95  1760.14  1769.97
[9,] 13985.97 13995.58 14081.56 13995.71 14117.02
[10,]  8479.73  8480.08  8496.28  8512.52  8463.10

The above table records the time to sample for one of the model under consideration, the highlighted numbers are significantly larger than the rest of the sampling time. Also there is large variation between time to sample. I am wondering what does this consistency in time to sample tell me? I check each run the model is converged.
The following a first few row from one of the run and one of the model under consideration

Inference for Stan model: QUT_2comp_5quan_overall.
4 chains, each with iter=5000; warmup=2500; thin=1; 
post-warmup draws per chain=2500, total post-warmup draws=10000.

                                mean se_mean   sd      2.5%       25%       50%       75%     97.5% n_eff Rhat
beta_eta_raw[1]                 0.00    0.01 0.99     -1.93     -0.66     -0.01      0.67      1.94 18368 1.00
beta_eta_raw[2]                -0.32    0.01 0.51     -1.16     -0.66     -0.38     -0.02      0.84  1851 1.00
beta_eta_raw[3]                -0.38    0.01 0.36     -1.13     -0.61     -0.37     -0.14      0.28  2170 1.00
beta_eta_raw[4]                 0.27    0.00 0.29     -0.23      0.07      0.24      0.43      0.93  5157 1.00
beta_eta_raw[5]                 1.63    0.00 0.37      1.01      1.37      1.60      1.86      2.45  5684 1.00
beta_eta_raw[6]                -2.14    0.01 0.55     -3.26     -2.51     -2.13     -1.77     -1.11  1872 1.00
tau                             2.02    0.01 0.53      1.12      1.63      1.97      2.35      3.13  1645 1.00

The below is my code to fit the models, only 10 runs are shown below as each run is quite long so I divide up into 5 parts and each fits the model 10 times and record the time.

data_set = list(N=N, Ncol_complete = length(hour_index), Ncol=ncol(summary_5quan), index1=index1, index2 = index2, hour_index = hour_index, index_length = (length(index1)),
                index_11=index_11, index_12=index_12, index_18=index_18, index_25=index_25, index_26=index_26, 
                kk=kk,K=56, D=44, y = five_quan_complete, K_TD =K_TD, K=ncol(K_TD), num_basis = num_basis, B_annual = B_annual, eta_bspline = eta_bspline)

compiled_model <- stan_model("QUT_2comp_5quan_overall.stan")

## collect mean run time for RUNS=10
runs = 10
sym_2comp_n12_time = sym_2comp_neff = vector("list", length = runs)
sym_2comp_sample_time = rep(0, 50)

for(i in 1:runs){
  sym_2comp_n12<- sampling(compiled_model, data = data_set,verbose=FALSE, iter=5000)
  sym_2comp_n12_time[[i]] = get_elapsed_time(sym_2comp_n12)
  sym_2comp_neff[[i]] = summary(sym_2comp_n12)$summary[,'n_eff']
  sym_2comp_sample_time[i] =  sum(sym_2comp_n12_time[[i]][,2])

thanks in advance for any insights.

1 Like

Sorry, out of time to reply, maybe @maxbiostat can answer?

First, it’s generally a bad idea to simply look at time per iteration; you want to look at time per effective sample, which becomes complicated with models with many parameters and comparing between models with different parameterizations.

Second, if you have some chains that take considerably longer than others this is usually[^1] a signal that something is fundamentally wrong in your model and you should look at how those chains look compared to those that sample faster to try to diagnose what’s going on. Are you keeping an eye out for divergences?

[^1] One exception is in a distributed computing environment where a misconfigured scheduler might cause a chain to run at low priority. This shouldn’t happen on a personal computer.


I wouldn’t be quite as assertive but I agree that huge differences in run time between chains suggest different adaptation and sampling regimes, which point to different target geometries and indeed indicate potential problems with the model.

Apart from this I think @mike-lawrence’s answer is spot on as regards computational comparisons. People like @wds15 and @bbbales2 would possibly be more able to give good tips on how to compare computational efficiency.

Lastly, I’d like to give a public shoutout to @mike-lawrence, @emiruz and @nhuurre, who have been mopping up a bunch of questions on the forum lately. Cheers!


You can learn possible reasons for variability in computation time by looking at the number of leapfrogs per iteration. You can obtain the number of leapfrogs per iteration with
If the mean number of leapfrogs per iteration over runs vary a lot, then the adaptation has high variability which may indicate problems in adaptation and longer adaptation can help. If the number of leapfrogs per iteration over iterations vary a lot, the posterior is likely to have a funnel shape or thick tails (ie curvature is highly variable). If the number of leapfrogs per iteration for iterations after warmup don’t vary a lot, but the number of leapfrogs per iteration during the warmup is very high for some chains, then the sampling probably works fine, but the some of the random initial values are far from typical set and scale of the posterior is far from unit scale.


thank you all for your replies and suggestions. I found a post

which also suggests to compare time per ESS for parameters. In the cases where there are many parameters, the authors suggested to use mean MCMC or minimum MCMC for comparison. Following their suggestions, now I only ran a single long chain for both models (ensuring they have both converged by checking Rhat and diagnostics plots etc) and then calculate mean MCMC and minimum MCMC which both gave me consistent results.