EDIT: Those interested in this topic might be interested in my findings in a related post (after two months of further exploration).
=========================================
Lately, I have been learning Stan by building my first model, which is modified from a statespace model with an AR(1) latent state process. After a long process (of trials and errors, reading related discussions, …), I now have a model that seems to be runnig all right on simulated data at a small scale (eg, 20 firms with 8x4 quarters). There appears to be no divergent transition or other warnings after several trials of newly simulated data.
My questions:

How to optimize the combination of iter/ warmup/ adapt_delta/ max_treedepth before scaling to a large scale (eg, 200 firms with 8x4 quarters)?
 Each run at the large scale can take many hours. Best to have a principled way to determine the combination of the options in order to minimize the run time without introducing problematic issues.

Does a sharp difference between the fastest and slowest chains (eg, 1 to 4 or even over 15 times) from at a smallscale run suggest hidden issues that are likely to exacerbate in largescale runs?
 Again, each largescale run takes a long time. Best if hidden issues could be identified and fixed beforehand.

The estimates of sd_gamma and/or sd_omega and/or beta in my model (outlined below) still tend to be downward biased (even when there’s no divergent transition). Should I expect this issue (if really an issue) would largely disappear with one/more of the following:
 increasing the sampling substantially (ie, the iter  warmup difference)
 scaling up to many firms (so that the estimates of the pooled parameters might become much more accurate)
(Updated on 4 May 2019)
Currently, I set n_chains=8, iter=3000; warmup=1000; adapt_delta = 0.9999; max_treedepth = 13
My model is outlined below (SSM2.2.stan (3.0 KB)):
state process: y_t = \alpha + \beta y_{t1} + \epsilon, with \epsilon ~ N(0,sd_y)
observation process: r_t = y_t + D_t, with D_t being a weighted moving average of
m_t = base_t \times f(X\gamma, G\omega),
where f() is a product of two sigmoid functions (ie, with values between 0 and 1)
X and G are covariate matrices,
\gamma ~ N(g,sd_\gamma)
\omega ~ N(w,sd_\omega)
base ~ logNormal(\mu_{base}, sd_{base}), with \mu_{base} \in [0,1]
Now sampling with the following explicit priors (rather than those in SSM2.2.stan):
sd_y ~ cauchy(0, 5);
sd_gamma ~ cauchy(0,5);
sd_omega ~ cauchy(0,5);
sd_base ~ cauchy(0, 5);
g ~ normal(0, 10);
w ~ normal(0, 10);
Simulated data are based on the following parameters:
J = 200 (firms), N=10x4 (quarters)
alpha<lower=1,upper=1> = 0.04
(sampling instead alpha01<lower=0,upper=1> = 0.52
using the conversion alpha = 2*alpha01  1)
beta<lower=0,upper=1> = 0.6
g = c(0.5, 0.8, 2.5), w = c(3.5, 7, 2),
d = c(0.05, 0.7, 0.25) // simplex
sd_y = 0.08, sd_gamma = 0.05, sd_omega = 0.1, sd_base = 0.1
Sample_pairsplot.pdf (updated)
Sample_traceplot.pdf (updated)
Inference for Stan model:
8 chains, each with iter=3000; warmup=1000; thin=1;
postwarmup draws per chain=2000, total postwarmup draws=16000.
mean se_mean sd 10% 50% 90% n_eff Rhat
alpha01 0.520 0.000 0.001 0.520 0.520 0.521 15935 1.000
beta 0.602 0.000 0.009 0.591 0.602 0.614 13851 1.000
g[1] 0.504 0.000 0.014 0.486 0.504 0.522 12432 1.000
g[2] 0.782 0.000 0.015 0.801 0.782 0.763 5096 1.001
g[3] 2.369 0.001 0.071 2.279 2.367 2.460 3672 1.002
w[1] 3.348 0.001 0.081 3.451 3.348 3.245 4069 1.002
w[2] 7.386 0.001 0.131 7.220 7.386 7.554 7762 1.001
w[3] 1.831 0.001 0.053 1.763 1.831 1.898 3686 1.002
d[1] 0.049 0.000 0.003 0.046 0.049 0.053 15065 1.000
d[2] 0.702 0.000 0.002 0.700 0.702 0.705 19465 1.000
d[3] 0.249 0.000 0.003 0.245 0.249 0.252 14967 1.000
sd_y 0.079 0.000 0.001 0.078 0.079 0.080 6252 1.001
sd_gamma 0.053 0.000 0.012 0.037 0.053 0.067 2015 1.002
sd_omega 0.090 0.000 0.018 0.068 0.091 0.113 2523 1.002
sd_base 0.101 0.000 0.002 0.097 0.100 0.104 5535 1.001
mu_base[1] 0.114 0.001 0.049 0.051 0.112 0.178 8958 1.000
mu_base[2] 0.357 0.001 0.083 0.250 0.356 0.465 9748 1.000
mu_base[3] 0.497 0.002 0.255 0.137 0.502 0.846 21046 1.000
mu_base[4] 0.121 0.001 0.053 0.052 0.119 0.191 8284 1.001
mu_base[5] 0.366 0.001 0.103 0.236 0.364 0.498 8418 1.000
mu_base[6] 0.201 0.001 0.059 0.125 0.200 0.277 9180 1.000
mu_base[7] 0.062 0.000 0.041 0.013 0.057 0.119 10464 1.001
mu_base[8] 0.221 0.001 0.064 0.139 0.221 0.303 8704 1.000
mu_base[9] 0.158 0.001 0.057 0.085 0.157 0.232 10909 1.000
mu_base[10] 0.152 0.001 0.052 0.085 0.152 0.220 6998 1.001
mu_base[11] 0.368 0.001 0.102 0.239 0.367 0.500 7437 1.000
...