How to determine the combination of adapt_delta / max_treedepth / metric=dense_e vs. diag_e / term_buffer / window / warmup / iter?

EDIT: Those interested in this topic might be interested in my findings in a related post (after two months of further exploration).

=========================================
Lately, I have been learning Stan by building my first model, which is modified from a state-space model with an AR(1) latent state process. After a long process (of trials and errors, reading related discussions, …), I now have a model that seems to be runnig all right on simulated data at a small scale (eg, 20 firms with 8x4 quarters). There appears to be no divergent transition or other warnings after several trials of newly simulated data.

My questions:

  1. How to optimize the combination of iter/ warmup/ adapt_delta/ max_treedepth before scaling to a large scale (eg, 200 firms with 8x4 quarters)?

    • Each run at the large scale can take many hours. Best to have a principled way to determine the combination of the options in order to minimize the run time without introducing problematic issues.
  2. Does a sharp difference between the fastest and slowest chains (eg, 1 to 4 or even over 15 times) from at a small-scale run suggest hidden issues that are likely to exacerbate in large-scale runs?

    • Again, each large-scale run takes a long time. Best if hidden issues could be identified and fixed beforehand.
  3. The estimates of sd_gamma and/or sd_omega and/or beta in my model (outlined below) still tend to be downward biased (even when there’s no divergent transition). Should I expect this issue (if really an issue) would largely disappear with one/more of the following:

    • increasing the sampling substantially (ie, the iter - warmup difference)
    • scaling up to many firms (so that the estimates of the pooled parameters might become much more accurate)

(Updated on 4 May 2019)

Currently, I set n_chains=8, iter=3000; warmup=1000; adapt_delta = 0.9999; max_treedepth = 13

My model is outlined below (SSM2.2.stan (3.0 KB)):

state process: y_t = \alpha + \beta y_{t-1} + \epsilon, with \epsilon ~ N(0,sd_y)
observation process: r_t = y_t + D_t, with D_t being a weighted moving average of
m_t = base_t \times f(X\gamma, G\omega),
where f() is a product of two sigmoid functions (ie, with values between 0 and 1)
X and G are covariate matrices,
\gamma ~ N(g,sd_\gamma)
\omega ~ N(w,sd_\omega)
base ~ logNormal(\mu_{base}, sd_{base}), with \mu_{base} \in [0,1]

Now sampling with the following explicit priors (rather than those in SSM2.2.stan):
sd_y ~ cauchy(0, 5);
sd_gamma ~ cauchy(0,5);
sd_omega ~ cauchy(0,5);
sd_base ~ cauchy(0, 5);
g ~ normal(0, 10);
w ~ normal(0, 10);

Simulated data are based on the following parameters:

J = 200 (firms), N=10x4 (quarters)

alpha<lower=-1,upper=1> = 0.04 
(sampling instead alpha01<lower=0,upper=1> = 0.52 
    using the conversion alpha = 2*alpha01 - 1)  
beta<lower=0,upper=1> = 0.6  
g = c(0.5, -0.8, 2.5), w = c(-3.5, 7, 2), 
d = c(0.05, 0.7, 0.25)  // simplex
sd_y = 0.08, sd_gamma = 0.05, sd_omega = 0.1, sd_base = 0.1   

Sample_pairsplot.pdf (updated)
Sample_traceplot.pdf (updated)

Inference for Stan model:
8 chains, each with iter=3000; warmup=1000; thin=1;
post-warmup draws per chain=2000, total post-warmup draws=16000.

                          mean se_mean     sd       10%       50%       90% n_eff  Rhat
alpha01                  0.520   0.000  0.001     0.520     0.520     0.521 15935 1.000
beta                     0.602   0.000  0.009     0.591     0.602     0.614 13851 1.000
g[1]                     0.504   0.000  0.014     0.486     0.504     0.522 12432 1.000
g[2]                    -0.782   0.000  0.015    -0.801    -0.782    -0.763  5096 1.001
g[3]                     2.369   0.001  0.071     2.279     2.367     2.460  3672 1.002
w[1]                    -3.348   0.001  0.081    -3.451    -3.348    -3.245  4069 1.002
w[2]                     7.386   0.001  0.131     7.220     7.386     7.554  7762 1.001
w[3]                     1.831   0.001  0.053     1.763     1.831     1.898  3686 1.002
d[1]                     0.049   0.000  0.003     0.046     0.049     0.053 15065 1.000
d[2]                     0.702   0.000  0.002     0.700     0.702     0.705 19465 1.000
d[3]                     0.249   0.000  0.003     0.245     0.249     0.252 14967 1.000
sd_y                     0.079   0.000  0.001     0.078     0.079     0.080  6252 1.001
sd_gamma                 0.053   0.000  0.012     0.037     0.053     0.067  2015 1.002
sd_omega                 0.090   0.000  0.018     0.068     0.091     0.113  2523 1.002
sd_base                  0.101   0.000  0.002     0.097     0.100     0.104  5535 1.001
mu_base[1]               0.114   0.001  0.049     0.051     0.112     0.178  8958 1.000
mu_base[2]               0.357   0.001  0.083     0.250     0.356     0.465  9748 1.000
mu_base[3]               0.497   0.002  0.255     0.137     0.502     0.846 21046 1.000
mu_base[4]               0.121   0.001  0.053     0.052     0.119     0.191  8284 1.001
mu_base[5]               0.366   0.001  0.103     0.236     0.364     0.498  8418 1.000
mu_base[6]               0.201   0.001  0.059     0.125     0.200     0.277  9180 1.000
mu_base[7]               0.062   0.000  0.041     0.013     0.057     0.119 10464 1.001
mu_base[8]               0.221   0.001  0.064     0.139     0.221     0.303  8704 1.000
mu_base[9]               0.158   0.001  0.057     0.085     0.157     0.232 10909 1.000
mu_base[10]              0.152   0.001  0.052     0.085     0.152     0.220  6998 1.001
mu_base[11]              0.368   0.001  0.102     0.239     0.367     0.500  7437 1.000
...
1 Like

I am not really an expert on this, but since nobody else replied, I will give it a go.

I am not aware of any. I would try a combination that works both with simulated data and with small subsets of the actual data. But note that according to the “folk theorem”, you usually either have a well-behaved model and the settings don’t really have strong effect or you model behaves badly and changing the settings won’t help (there are exceptions though).

The easiest answer is for the number of post-warmup iterations - the number of effective samples you need is related to what you want to estimate - for central tendencies low hundreds are more than enough, for extreme tail probabilities you might need tens of thousands or more. Use the n_eff/n_iterations on the fit to a small subset of actual data to choose how many actual iterations you are likely to need (and then add some slack).

Possibly, but not necessarily. It might just mean that your variables are hard to initialize. You could try providing sensible initial values and/or reparametrize so that the actual parameters are as close to N(0,1) as possible. This might actually be the case In the model you’ve sent, e.g.:

sd_gamma ~ normal(sd_gamma_init, 0.025); 

if sd_gamma_init is large in absolute value, this will be hard to init (Stan by default inits between -2 and 2 on the unconstrained scale). The narrow sd doesn’t help either. You could explicitly init sd_gamma to sd_gamma_init or (better) have sd_gamma_raw ~ normal(0,1) and then compute sd_gamma = sd_gamma_raw * 0.025 + sd_gamma_init.

The priors you have generally look weird, where did those come from?

I don’t think any of the two are likely to help. Nevertheless, it is possible that the bias is simply because the correct posterior is notably skewed. The best way to check, whether there actually is bias is to use Simulation-Based Calibration.

That is quite odd - do you need such high adapt_delta and max_treedepth to achieve a good fit? If so, you likely have a problem. If no, I would use less extreme values for the large datasets, as those settings often just result in the sampler taking ages to complete when it encounters problems, while those problems are rarely actually mitigated.

I also believe there are likely some possible efficiency gains by rephrasing some of the computations with matrix algebra. Possibly also using map_rect and within-chain paralellism. You may also want to check out normal_id_glm.

Hope that helps.

2 Likes

Thanks for taking the time to share your experience.

With some more exploration lately, I cannot agree more with your remark.

At the time of the original post, the model was already complex but seemed to be still identifiable when there’s enough data – which was why I asked for advice to optimize the run time. Because it was hard to identify, the settings would help to reach the destination without running into divergent transitions or exceeding the max depth; otherwise, it would take even longer to complete and if there were any divergent transitions, the estimates were unlikely to be reliable.

I was in the process of learning how things work and confused the prior choice with setting the initial values. Thanks for pointing this out. I now can do the latter properly.

Thanks for these guidelines.

So true.