N best tips & tricks (or the go-to checklist) for new Stan model builders?

tlyim · May 2, 2019, 2:17am

Is there a one-page checklist of the captioned written somewhere?

In my recent exploration as a newbie (see this post), I find the following tricks most useful:

specify any known bounds of a parameter (thereby help Stan to pick the appropriate default prior, eg, uniform vs. normal)
explicitly specify a prior for a parameter that better reflects your knowledge about it than Stan’s default prior choice could possibly represent in general
use non-centered parametrization EDIT: (eg, to fix problematic correlated parameters following this example)
- This fix might not work out of the box. In my case, need to set very tight priors for the additional _sd, _L, _err parameters resulting from the non-centred parametrization: eg,

  pgw_sd ~ normal(0, 0.1);
  pgw_L ~ lkj_corr_cholesky(2);
  pgw_err ~ normal(0, 0.05);

EDIT: must-read checklist, for new model builders, in the beginning of this post by @martinmodrak
- clear and intuitive visualization of the typical causes of various problematic pairs plot patterns in the rest of the post
- also, the Stan best practices link mentioned in the post

EDIT: (after two months of further exploration)

These two posts should also be interesting to new Stan model builders:
- Rhat < 1 (as low as 9.94e-01). Why?
- Divergence vs hitting max_treedepth
if you have a model that runs fine in terms of no divergences but the run time is slow (perhaps due to unnecessarily long warmup and iter), my findings about choosing the combination of adapt_delta / max_treedepth / metric=dense_e vs. diag_e / term_buffer / window / warmup / iter in another post might be useful to you.

Can you add to this list based on your experience? And perhaps even order your tips according to their usefulness/importance/priority.

Ara_Winter · May 2, 2019, 2:24pm

Hmmm in order

Build the simplest model possible first. So if you have a time series start with a linear model as y ~ x then build that out to y ~ x + time + (1|time) then on to your choice of time series models.

Make some fake data and run that against your model. You should get back your parameters.

Keep all your models where you were playing around with priors and your justification for those priors. I stick these into supplemental materials.

Where ever possible cite specific threads in this user group in your R and python workflow and in your reports/articles.

andre.pfeuffer · May 4, 2019, 1:14pm

Build a non-Bayes model first and get it to run. It’s useful as crosscheck later against Stan.
Check your matrices have full rank.
Try to get to run your Stan with low iterations 50 and one chain.
Use multiple chains to detect multi-modalities
Always look at neff for low values, these indicate problematic parameters.
Do a traceplot of problematic parameters, find suspicious interactions with pairs plot.
Buy a fast computer with lots of CPU cache.
Use exponential / half-normal distribution instead of half-cauchy in case of problems with standarddev.
Suppress the output of large arrays/matrices you don’t need in RStan or use Cmdstan.
Check the problematic scaling of the mass-matrix given from Cmdstan. Adjust parameterization.
Run optimizing instead of NUTS to check if the values look reasonable, if not find out why.
Step back and think about what you model should do, and don’t limit yourself to a key algorithm. There are many ways. Learn about your data and apply different models, the output will help to understand your data in many ways.
Use a routine, always save.image (complete dump) your session, before running a model.
Keep all models and simulations runs.

tlyim · May 4, 2019, 3:10pm

Thanks for adding your useful checklist.

Do you have a pointer to any concise (or thorough) discussion on the diagnostic steps using traceplot and pairs plot that can be added here as a reference for newbies?

tlyim · May 4, 2019, 3:30pm

How to determine whether neff is unacceptable?

For example, those parameters below with neff between 2000 to 5000 (ie, sd_gamma, sd_omega, g[3], w[1], w[3]) are particularly problematic had the sampling not been taken with 8 chains and 3000 iterations or had the data points not been so high as 200x10x4 = 8000.

But now with all these, are the relatively low neff (and the model based on those parameters) considered acceptable?
(relatively low when compared to those with neff > 10,000)

=====================================================
Inference for Stan model:
8 chains, each with iter=3000; warmup=1000; thin=1;
post-warmup draws per chain=2000, total post-warmup draws=16000.

                          mean se_mean     sd       10%       50%       90% n_eff  Rhat
alpha01                  0.520   0.000  0.001     0.520     0.520     0.521 15935 1.000
beta                     0.602   0.000  0.009     0.591     0.602     0.614 13851 1.000
g[1]                     0.504   0.000  0.014     0.486     0.504     0.522 12432 1.000
g[2]                    -0.782   0.000  0.015    -0.801    -0.782    -0.763  5096 1.001
g[3]                     2.369   0.001  0.071     2.279     2.367     2.460  3672 1.002
w[1]                    -3.348   0.001  0.081    -3.451    -3.348    -3.245  4069 1.002
w[2]                     7.386   0.001  0.131     7.220     7.386     7.554  7762 1.001
w[3]                     1.831   0.001  0.053     1.763     1.831     1.898  3686 1.002
d[1]                     0.049   0.000  0.003     0.046     0.049     0.053 15065 1.000
d[2]                     0.702   0.000  0.002     0.700     0.702     0.705 19465 1.000
d[3]                     0.249   0.000  0.003     0.245     0.249     0.252 14967 1.000
sd_y                     0.079   0.000  0.001     0.078     0.079     0.080  6252 1.001
sd_gamma                 0.053   0.000  0.012     0.037     0.053     0.067  2015 1.002
sd_omega                 0.090   0.000  0.018     0.068     0.091     0.113  2523 1.002
sd_base                  0.101   0.000  0.002     0.097     0.100     0.104  5535 1.001

andre.pfeuffer · May 4, 2019, 4:05pm

You have to see the traceplots and pairs plots. I mean less than 10% of the sample size and
Rhat away from 1.0 more than 1%. Then I would think, maybe I can and should improve this.

https://www.martinmodrak.cz/2018/05/14/identifying-non-identifiability/
https://cran.r-project.org/web/packages/bayesplot/vignettes/visual-mcmc-diagnostics.html

tlyim · May 4, 2019, 4:25pm

Thanks for the useful pointers. Appreciated.

Below are the sample plots of the example given. There seems to be nothing critical in this case. Have I overlooked anything that is alarming? Would appreciate very much your comment. Thanks in advance.

Sample_pairsplot.pdf
Sample_traceplot.pdf

avehtari · May 4, 2019, 7:10pm

Instead of traceplot we recommend using rank histogram plots [1903.08008] Rank-normalization, folding, and localization: An improved $\widehat{R}$ for assessing convergence of MCMC. The online appendix has some examples comparing traceplots and rank histogram plots. Traceplots are not good for long tailed distributions and get fuzzy with long chains.

See the paper [1903.08008] Rank-normalization, folding, and localization: An improved $\widehat{R}$ for assessing convergence of MCMC. We recommend minimum neff of about 100 per chain, so that the needed quantities for convergence diagnostic are estimated reliably. Given that, then instead of focusing to neff, you should figure out what is the needed accuracy for the quantity of interest (for reporting usually 2 significant digits is enough) and then you can check that mcse is sufficient (mcse computation uses neff, but it’s enough to focus on mcse for the quantity of interest).

tlyim · May 5, 2019, 3:43pm

Thanks for these guidelines. Appreciated.

denise-colombano · May 6, 2019, 12:28am

What is the notation for these other distributions or where are they documented?

Thank you,
Denise C.

andre.pfeuffer · May 6, 2019, 4:10am

First note that Stan does not necessarily need to sample from conjugation priors:

The story of priors for variance parameters goes back into history:
http://www.stat.columbia.edu/~gelman/research/published/taumain.pdf

Half-student-t or half-normal are discussed here:

exponential(1.0) and half - student_t(4, 0, 1) are similar, the exponential(1.0)
is more efficient to calculate.

Topic		Replies	Views
Help with understanding my results Modeling fitting-issues , specification , performance	5	1221	February 8, 2019
Looking for a stan program critique Modeling techniques	14	1729	April 17, 2018
My first Stan model - hierarchical logistic regression Modeling cognitive-science	28	8422	June 6, 2017
Request for help understanding model behavior Modeling	17	1573	March 24, 2018
How to speed up sampling in rstan? Modeling rstan , performance , hierarchical-model	6	6136	July 24, 2020

N best tips & tricks (or the go-to checklist) for new Stan model builders?

Related topics