How to speed up sampling in rstan?

Hi stan community,

I am new to stan, and now need to fit a hierarchical model to a small simulated data via rstan, but I found sampling is very time-consuming (takes more than 12 hours even for 2000 iteration/500 warmup). I am not good at computing, so would be greatly appreciated if there are any tips!

The model has some features: (1) The responses are 2-dimensional; (2) it follows a specific special distribution (so I have to define the likelihood function by myself); (3) it has many parameters (we do not need care about the overfitting problem).

In order to speed up the sampling, I have tried my best to improve the stan model, including:

  1. Try to use vectorize operation, and avoid to use loops.
  2. Remove any not-needed parts to save time, for example, the generated quantities block
  3. Sampling the chains parallelly, so I simply use cores=4 at the stan(…) code in R studio

Besides these, I am also considering other possible strategies:
4. Adjust the adapt_delta and max_treedepth values, but I think it will influence the quality of posterior samples. Currently I use adapt_delta=0.95 and max_treedepth=15.
5. Any other parts or effective and well-designed codes in stan, for example, transformed parameters block. However, I do not know how to do in that way to speed things up.
6. Adjust priors, but I have no any good ideas.
7. Adjust initial starting points. To my knowledge, they may affect posterior samples’ quality, but not too much to the time cost of sampling.

My stan code is as below

functions{
  real two_d_st_lpdf(matrix XY, vector xi1, vector xi2, vector sigma1, vector sigma2, vector rho12, vector alpha1, vector alpha2){
    vector [cols(XY)] A;
    vector [cols(XY)] B;
    vector [cols(XY)] C1;
    vector [cols(XY)] C2;
    vector [cols(XY)] singlelogprob; 
    real lprob;
    A=sigma1 .* sigma1 .* sigma2 .* sigma2 -  rho12 .* rho12 .* sigma1 .* sigma1 .* sigma2 .* sigma2;
    B=(to_vector(XY[1,])- xi1 ).*(to_vector(XY[1,])- xi1) .* sigma2 .* sigma2-
      (to_vector(XY[1,])- xi1 ).*(to_vector(XY[2,])- xi2) .* rho12 .* sigma1 .* sigma2+
      (to_vector(XY[2,])- xi2 ).*(to_vector(XY[2,])- xi2) .* sigma1 .* sigma1-
      (to_vector(XY[1,])- xi1 ).*(to_vector(XY[2,])- xi2) .* rho12 .* sigma1 .* sigma2;
    C1=alpha1 .* ( to_vector(XY[1,])-xi1   ) ./  sigma1+alpha2 .* ( to_vector(XY[2,])-xi2   ) ./  sigma2;
    C2=(to_vector(XY[1,])-xi1).*(to_vector(XY[1,])-xi1) ./ sigma1+(to_vector(XY[2,])-xi2).*(to_vector(XY[2,])-xi2) ./ sigma2; 
    for(i in 1:cols(XY)){
      if(sigma1[i]<0){singlelogprob[i]=-1000000;   }      
      else{
          if(sigma2[i]<0){singlelogprob[i]=-1000000;}
            else{
              singlelogprob[i]=log(23)-0.5*log(pi()*23)-0.5*log(A[i])-12.5*log( 1+(B[i]/A[i])/23 )+student_t_lcdf(C1[i]*sqrt(25/(C2[i]+2))|25, 0, 1);
            }
      }
    }
    lprob = sum(singlelogprob);
    return lprob;
  }
}
data{
    int<lower=1> N;
    int<lower=1> N_Rj;
    int<lower=1> N_j;
    matrix[2, N] BH; //responses 
    int Rj[N]; //index for groups
    vector[N] Is;
    vector[N] Ic;
    vector[N] mid_year_corrected;
}
parameters{
    real ag;
    real<lower=0,upper=100> sd_ar;
    real As;
    real Ac;
    real bg;
    real<lower=0,upper=100> sd_br;
    real Bs;
    real Bc;
    real<lower=0,upper=300> ag2;///////////////////////////////////////////
    real<lower=0,upper=100> sd_ar2;
    real As2;
    real Ac2;
    real bg2;
    real<lower=0,upper=100> sd_br2;
    real Bs2;
    real Bc2;
    real<lower=0,upper=300> ag3;///////////////////////////////////////////
    real<lower=0,upper=100> sd_ar3;
    real As3;
    real Ac3;
    real bg3;
    real<lower=0,upper=100> sd_br3;
    real Bs3;
    real Bc3;
    real<lower=0,upper=300> ag4;///////////////////////////////////////////
    real<lower=0,upper=100> sd_ar4;
    real As4;
    real Ac4;
    real bg4;
    real<lower=0,upper=100> sd_br4;
    real Bs4;
    real Bc4;
    real<lower=0,upper=300> ag5;///////////////////////////////////////////
    real<lower=0,upper=100> sd_ar5;
    real<lower=0,upper=100> sd_ac5;
    real As5;
    real Ac5;
    real bg5;
    real<lower=0,upper=100> sd_br5;
    real<lower=0,upper=100> sd_bc5;
    real Bs5;
    real Bc5;
    real<lower=0,upper=300> ag6;///////////////////////////////////////////
    real<lower=0,upper=100> sd_ar6;
    real As6;
    real Ac6;
    real bg6;
    real<lower=0,upper=100> sd_br6;
    real Bs6;
    real Bc6;
    vector[N_Rj] ar;
    vector[N_Rj] br;
    vector[N_Rj] ar2;//////
    vector[N_Rj] br2;
    vector[N_Rj] ar3;//////
    vector[N_Rj] br3;
    vector[N_Rj] ar4;//////
    vector[N_Rj] br4;
    vector[N_Rj] ar5;//////
    vector[N_Rj] br5;
    vector[N_Rj] ar6;//////
    vector[N_Rj] br6;
    real <lower=-1,upper=1> rho12parameter;////////////////
}
model{
    vector[N] sigma1;
    vector[N] xi1;
    vector[N] sigma2;
    vector[N] xi2;
    vector[N] rho12;
    vector[N] alpha1;
    vector[N] alpha2;
    rho12parameter ~ uniform(-1,1);
    Bc ~ normal( 0 , 10 );
    Bs ~ normal( 0 , 10 );
    br ~ normal( 0 , sd_br );
    bg ~ normal( 0 , 10 );
    Ac ~ normal( 0 , 10 );
    As ~ normal( 0 , 10 );
    ar ~ normal( 0 , sd_ar );
    ag ~ normal( 0 , 100 );
    Bc2 ~ normal( 0 , 10 );
    Bs2 ~ normal( 0 , 10 );
    br2 ~ normal( 0 , sd_br2 );  
    bg2 ~ normal( 0 , 10 );
    Ac2 ~ normal( 0 , 10 );
    As2 ~ normal( 0 , 10 );
    ar2 ~ normal( 0 , sd_ar2 );
    ag2 ~ uniform( 0 , 300 );//
    Bc3 ~ normal( 0 , 10 );
    Bs3 ~ normal( 0 , 10 );
    br3 ~ normal( 0 , sd_br3 );  
    bg3 ~ normal( 0 , 10 );
    Ac3 ~ normal( 0 , 10 );
    As3 ~ normal( 0 , 10 );
    ar3 ~ normal( 0 , sd_ar3 );
    ag3 ~ uniform( 0 , 300 );
    Bc4 ~ normal( 0 , 10 );
    Bs4 ~ normal( 0 , 10 );
    br4 ~ normal( 0 , sd_br4 ); 
    bg4 ~ normal( 0 , 10 );
    Ac4 ~ normal( 0 , 10 );
    As4 ~ normal( 0 , 10 );
    ar4 ~ normal( 0 , sd_ar4 );
    ag4 ~ uniform( 0 , 300 );//
    Bc5 ~ normal( 0 , 10 );
    Bs5 ~ normal( 0 , 10 );
    br5 ~ normal( 0 , sd_br5 ); 
    bg5 ~ normal( 0 , 10 );
    Ac5 ~ normal( 0 , 10 );
    As5 ~ normal( 0 , 10 );
    ar5 ~ normal( 0 , sd_ar5 );
    ag5 ~ uniform( 0 , 300 );
    Bc6 ~ normal( 0 , 10 );
    Bs6 ~ normal( 0 , 10 );
    br6 ~ normal( 0 , sd_br6 ); 
    bg6 ~ normal( 0 , 10 );
    Ac6 ~ normal( 0 , 10 );
    As6 ~ normal( 0 , 10 );
    ar6 ~ normal( 0 , sd_ar6 );
    ag6 ~ uniform( 0 , 300 );
    xi1 = ag + ar[Rj] + As * Is + Ac * Ic +     (bg + br[Rj]  +Bs * Is + Bc * Ic).* mid_year_corrected ;
    xi2 = ag2 + ar2[Rj] + As2 * Is + Ac2 * Ic +     (bg2 + br2[Rj]  +Bs2 * Is + Bc2 * Ic).* mid_year_corrected ;
    sigma1 = ag3 + ar3[Rj] + As3 * Is + Ac3 * Ic +     (bg3 + br3[Rj]  +Bs3 * Is + Bc3 * Ic).* mid_year_corrected ;
    sigma2 = ag4 + ar4[Rj] + As4 * Is + Ac4 * Ic +     (bg4 + br4[Rj]  +Bs4 * Is + Bc4 * Ic).* mid_year_corrected ;
    for ( i in 1:N ) {
        rho12[i] = rho12parameter;
    }
    alpha1 = ag5 + ar5[Rj] + As5 * Is + Ac5 * Ic +     (bg5 + br5[Rj]  +Bs5 * Is + Bc5 * Ic).* mid_year_corrected ;
    alpha2 = ag6 + ar6[Rj] + As6 * Is + Ac6 * Ic +     (bg6 + br5[Rj]  +Bs6 * Is + Bc6 * Ic).* mid_year_corrected ;
    BH~ two_d_st(xi1,xi2,sigma1,sigma2,rho12,alpha1,alpha2);
}

In the function block, A, B, C1, C2 are four parts in the special likelihood function, and I have tried to avoid to use loops to calculate them, but I think I may still need to use loops in the user-defined function block.
I also used if-else because some parameters should be strictly positive to be well-defined by definition, so I set the log-likelihood to be very small (-1000000) when these parameters are negative.

Any tips and comments are welcome. Thanks in advance!

Do you have like a one term version of the model to work with? Especially since you’re working with simulated data it’ll be better to start small.

Presumably you got warnings about divergent transitions and hitting max treedepth.

For divergences, start by doing a non-centered parameterization for br/ar/br2, etc.

Check out this: https://mc-stan.org/users/documentation/case-studies/divergences_and_bias.html

That might help out the treedepth issue a bunch, so just put that back to 10 or so and see what you get.

2 Likes

Hi Ben,

Many thanks for your reply and help.

Regarding to the model, it is multivariate skewed t (ST) distribution with the form proposed by https://core.ac.uk/download/pdf/82143587.pdf in the equation (16). Because here the model responses are only two dimensional, I have expressed the likelihood by every single parameter. Every single parameter of the ST distribution is also further expressed by linear predictors/parameters/covariates/etc. I do not have better ideas in stan to express this likelihood function in more effective ways.

Regarding fitted dataset, yes I did use a very small simulated data, and it only has 95 samples.

I have followed your suggestion, and used non-centered parameterization for the group random effects br/ar, etc. Now the corresponding stan blocks are changed to the one below. Hope this is what that html page means.

parameters{
    real ag;
    real<lower=0,upper=100> sd_ar;
    real As;
    real Ac;
    real bg;
    real<lower=0,upper=100> sd_br;
    real Bs;
    real Bc;
    real<lower=0,upper=300> ag2;///////////////////////////////////////////
    real<lower=0,upper=100> sd_ar2;
    real As2;
    real Ac2;
    real bg2;
    real<lower=0,upper=100> sd_br2;
    real Bs2;
    real Bc2;
    real<lower=0,upper=300> ag3;///////////////////////////////////////////
    real<lower=0,upper=100> sd_ar3;
    real As3;
    real Ac3;
    real bg3;
    real<lower=0,upper=100> sd_br3;
    real Bs3;
    real Bc3;
    real<lower=0,upper=300> ag4;///////////////////////////////////////////
    real<lower=0,upper=100> sd_ar4;
    real As4;
    real Ac4;
    real bg4;
    real<lower=0,upper=100> sd_br4;
    real Bs4;
    real Bc4;
    real<lower=0,upper=300> ag5;///////////////////////////////////////////
    real<lower=0,upper=100> sd_ar5;
    real<lower=0,upper=100> sd_ac5;
    real As5;
    real Ac5;
    real bg5;
    real<lower=0,upper=100> sd_br5;
    real<lower=0,upper=100> sd_bc5;
    real Bs5;
    real Bc5;
    real<lower=0,upper=300> ag6;///////////////////////////////////////////
    real<lower=0,upper=100> sd_ar6;
    real As6;
    real Ac6;
    real bg6;
    real<lower=0,upper=100> sd_br6;
    real Bs6;
    real Bc6;
    vector[N_Rj] ar_norm;
    vector[N_Rj] br_norm;
    vector[N_Rj] ar2_norm;//////
    vector[N_Rj] br2_norm;
    vector[N_Rj] ar3_norm;//////
    vector[N_Rj] br3_norm;
    vector[N_Rj] ar4_norm;//////
    vector[N_Rj] br4_norm;
    vector[N_Rj] ar5_norm;//////
    vector[N_Rj] br5_norm;
    vector[N_Rj] ar6_norm;//////
    vector[N_Rj] br6_norm;
    real <lower=-1,upper=1> rho12parameter;//////////////////////////////////////
}
transformed parameters{
  vector[N_Rj] ar;
  vector[N_Rj] br;
  vector[N_Rj] ar2;
  vector[N_Rj] br2;
  vector[N_Rj] ar3;
  vector[N_Rj] br3;
  vector[N_Rj] ar4;
  vector[N_Rj] br4;
  vector[N_Rj] ar5;
  vector[N_Rj] br5;
  vector[N_Rj] ar6;
  vector[N_Rj] br6;
  for (i in 1:N_Rj)
    ar[i] =  sd_ar * ar_norm[i];
  for (i in 1:N_Rj)
    br[i] =  sd_br * br_norm[i];
  for (i in 1:N_Rj)
    ar2[i] =  sd_ar2 * ar2_norm[i];
  for (i in 1:N_Rj)
    br2[i] =  sd_br2 * br2_norm[i];
  for (i in 1:N_Rj)
    ar3[i] =  sd_ar3 * ar3_norm[i];
  for (i in 1:N_Rj)
    br3[i] =  sd_br3 * br3_norm[i];
  for (i in 1:N_Rj)
    ar4[i] =  sd_ar4 * ar4_norm[i];
  for (i in 1:N_Rj)
    br4[i] =  sd_br4 * br4_norm[i];
  for (i in 1:N_Rj)
    ar5[i] =  sd_ar5 * ar5_norm[i];
  for (i in 1:N_Rj)
    br5[i] =  sd_br5 * br5_norm[i];
  for (i in 1:N_Rj)
    ar6[i] =  sd_ar6 * ar6_norm[i];
  for (i in 1:N_Rj)
    br6[i] =  sd_br6 * br6_norm[i];
}
model{
    vector[N] sigma1;
    vector[N] xi1;
    vector[N] sigma2;
    vector[N] xi2;
    vector[N] rho12;
    vector[N] alpha1;
    vector[N] alpha2;
    rho12parameter ~ uniform(-1,1);
    Bc ~ normal( 0 , 10 );
    Bs ~ normal( 0 , 10 );
    br_norm ~ normal( 0 , 1 );
    bg ~ normal( 0 , 10 );
    Ac ~ normal( 0 , 10 );
    As ~ normal( 0 , 10 );
    ar_norm ~ normal( 0 , 1);
    ag ~ normal( 0 , 100 );
    Bc2 ~ normal( 0 , 10 );
    Bs2 ~ normal( 0 , 10 );
    br2_norm ~ normal( 0 , 1 );  
    bg2 ~ normal( 0 , 10 );
    Ac2 ~ normal( 0 , 10 );
    As2 ~ normal( 0 , 10 );
    ar2_norm ~ normal( 0 , 1 );
    ag2 ~ uniform( 0 , 300 );//
    Bc3 ~ normal( 0 , 10 );
    Bs3 ~ normal( 0 , 10 );
    br3_norm ~ normal( 0 , 1 );  
    bg3 ~ normal( 0 , 10 );
    Ac3 ~ normal( 0 , 10 );
    As3 ~ normal( 0 , 10 );
    ar3_norm ~ normal( 0 , 1);
    ag3 ~ uniform( 0 , 300 );//
    Bc4 ~ normal( 0 , 10 );
    Bs4 ~ normal( 0 , 10 );
    br4_norm ~ normal( 0 , 1 );  
    bg4 ~ normal( 0 , 10 );
    Ac4 ~ normal( 0 , 10 );
    As4 ~ normal( 0 , 10 );
    ar4_norm ~ normal( 0 , 1 );
    ag4 ~ uniform( 0 , 300 );//
    Bc5 ~ normal( 0 , 10 );
    Bs5 ~ normal( 0 , 10 );
    br5_norm ~ normal( 0 , 1 );  
    bg5 ~ normal( 0 , 10 );
    Ac5 ~ normal( 0 , 10 );
    As5 ~ normal( 0 , 10 );
    ar5_norm ~ normal( 0 , 1 );
    ag5 ~ uniform( 0 , 300 );//
    Bc6 ~ normal( 0 , 10 );
    Bs6 ~ normal( 0 , 10 );
    br6_norm ~ normal( 0 , 1 );  
    bg6 ~ normal( 0 , 10 );
    Ac6 ~ normal( 0 , 10 );
    As6 ~ normal( 0 , 10 );
    ar6_norm ~ normal( 0 , 1 );
    ag6 ~ uniform( 0 , 300 );//
    xi1 = ag + ar[Rj] + As * Is + Ac * Ic +     (bg + br[Rj]  +Bs * Is + Bc * Ic).* mid_year_corrected ;
    xi2 = ag2 + ar2[Rj] + As2 * Is + Ac2 * Ic +     (bg2 + br2[Rj]  +Bs2 * Is + Bc2 * Ic).* mid_year_corrected ;
    sigma1 = ag3 + ar3[Rj] + As3 * Is + Ac3 * Ic +     (bg3 + br3[Rj]  +Bs3 * Is + Bc3 * Ic).* mid_year_corrected ;
    sigma2 = ag4 + ar4[Rj] + As4 * Is + Ac4 * Ic +     (bg4 + br4[Rj]  +Bs4 * Is + Bc4 * Ic).* mid_year_corrected ;
    for ( i in 1:N ) {
        rho12[i] = rho12parameter;
    }
    alpha1 = ag5 + ar5[Rj] + As5 * Is + Ac5 * Ic +     (bg5 + br5[Rj]  +Bs5 * Is + Bc5 * Ic).* mid_year_corrected ;
    alpha2 = ag6 + ar6[Rj] + As6 * Is + Ac6 * Ic +     (bg6 + br5[Rj]  +Bs6 * Is + Bc6 * Ic).* mid_year_corrected ;
    BMIheight~ two_d_st(xi1,xi2,sigma1,sigma2,rho12,alpha1,alpha2);
}

Not quite sure about the improvement for divergences, since unfortunately with fixed 2000 iteration whatever adapt_delta and max_treedepth values I used, ESS and Rhat are terrible (many ESS<3 and Rhat>1.1). I am now presuming if there is something logically or mathematically dangerous with my stan code, especially for the my-defined lpdf, and particularly under so many parameters.

However, If I put max_treedepth back to 10, sampling can become significantly faster. With 2000 iterations and 500 warmup, max_treedepth=15 will need roughly 10 hours, but max_treedepth=10 only take 20 mins, which is honestly quite surprising to me.

I think using non-centered parameterization may release the pressure to use high adapt_delta and max_treedepth values, hence to save sampling time with smaller but appropriate adapt_delta and max_treedepth values. However, divergence was not solved for me. As the ESS and Rhat are ‘terrible’, I guess in this case simply increasing iterations may not effective at the root.

Always greatly appreciated for any possible tips. Thanks so much.

Seems the divergence could be the bigger problem. I guess some reparameterization methods can help to get rid of that. Besides, I do not know if initial starting points can lead to divergence as well.

Also great thanks for any tips to improve the divergences.

It sounds like the sampler is making use of the super deep trees. This can happen if you have correlations in your posterior or non-identified parameters.

I think my recommendation remains trying to simplify the model (can you drop ar2/ar3/ar4 and just keep ar?), but if that’s not possible:

  1. Look at the correlations of the posterior and see if there’s anything interesting

  2. Look at the parameters that have larger Rhats/low neffs and see if there are any things you can just make constants/tighten the priors on.

1 Like

Thanks for the advice!

I have used a quite simplified model (the model block is as below), where I only remained ar/br and the flat priors of them were also shrunk.

model{
    vector[N] sigma1;
    vector[N] xi1;
    vector[N] sigma2;
    vector[N] xi2;
    vector[N] rho12;
    vector[N] alpha1;
    vector[N] alpha2;
    rho12parameter ~ uniform(-1,1);
    Bc ~ normal( 0 , 10 );
    Bs ~ normal( 0 , 10 );
    br_norm ~ normal( 0 , 1 );
    bg ~ normal( 0 , 10 );
    Ac ~ normal( 0 , 10 );
    As ~ normal( 0 , 10 );
    ar_norm ~ normal( 0 , 1);
    ag ~ normal( 0 , 100 );
    Bc2 ~ normal( 0 , 10 );
    Bs2 ~ normal( 0 , 10 );
    bg2 ~ normal( 0 , 10 );
    Ac2 ~ normal( 0 , 10 );
    As2 ~ normal( 0 , 10 );
    ag2 ~ uniform( 0 , 300 );//
    Bc3 ~ normal( 0 , 10 );
    Bs3 ~ normal( 0 , 10 );
    bg3 ~ normal( 0 , 10 );
    Ac3 ~ normal( 0 , 10 );
    As3 ~ normal( 0 , 10 );
    ag3 ~ uniform( 0 , 300 );//
    xi1 = ag + ar[Rj] + As * Is + Ac * Ic +     (bg + br[Rj]  +Bs * Is + Bc * Ic).* mid_year_corrected ;
    xi2 = ag2 + As2 * Is + Ac2 * Ic +     (bg2  +Bs2 * Is + Bc2 * Ic).* mid_year_corrected ;
    sigma1 = ag3 + As3 * Is + Ac3 * Ic +     (bg3   +Bs3 * Is + Bc3 * Ic).* mid_year_corrected ;
    sigma2 = sigma2parameter * mid_year_corrected ;
    for ( i in 1:N ) {
        rho12[i] = rho12parameter;
    }
        for ( i in 1:N ) {
        alpha1[i] = alpha1parameter;
    }
        for ( i in 1:N ) {
        alpha2[i] = alpha2parameter;
    }
    BMIheight~ two_d_st(xi1,xi2,sigma1,sigma2,rho12,alpha1,alpha2);
}

This model works well in my eyes, and I did some workflows for it in R:

> check_treedepth(stanfit)
0 of 1500 iterations saturated the maximum tree depth of 15.
> check_energy(stanfit)
E-BFMI indicated no pathological behavior.
> check_divergences(stanfit)
0 of 1500 iterations ended with a divergence.

I feel that solving the divergence is the root key of speeding things up, and it may be more significantly effective than any other speeding up strategies.

I am afraid that I may have to use the initial model, so now back to the bloody original model…

Regarding to hitting super deep trees, there are also kind person also saying in a similar way:

Sorry I do not know well what the correlations in posterior really means. Since all parameters in posterior generally will be correlated (they basically cannot be independent in probability), I think the correlation of these parameter samples is not surprising. Or does it mean that, from the algorithm point of view, in the stage of exploring/sampling some parameters cannot explore ‘freely’ or ‘appropriately’ because of the ‘constraint’ from the correlation with other parameters? Hence, what is the main reason causing the correlation?

The stan user guide used the Neal’s Funnel example: https://mc-stan.org/docs/2_18/stan-users-guide/reparameterization-section.html, in which the stan model is

parameters {
  real y;
  vector[9] x;
}
model {
  y ~ normal(0, 3);
  x ~ normal(0, exp(y/2));
}

In this case the posterior=prior*likelihood=prior (as no likelihood defined), and we can see x and y in this case are highly correlated, and our stepsize cannot satisfy both high-density region and low-density region simultaneously well, so the model may not converge. Is it also the main similar reason for my model? However, I have tried to avoid using correlated expression between each parameter’ priors, though I cannot do anything for the likelihood part.

For the badly-fitted initial model, the pairs of some parameters and the four chains for ad_ar are:



Quite strange results. Many MCMC samples are even piling up on the edge of priors unif(0,300) (The green chain in traceplot is not constant, but moving in a very small size). Any good suggestion here? Thanks a lot.

Besides, putting tighten on some priors can improve the divergence, but cannot speed up sampling significantly because the sampler is still hitting the maximum tree depth very often. However, the treaceplot (see below) looks better (one strange chain is constant, but at least quite better than the figure above), and perhaps I need adjust priors further. Thank you for your previous help.

Yeah, so just large correlations, like if you look at the posterior correlation just take the biggest few. If they’re close to 1, they’d be something to worry about.

All you’re doing here is looking for pieces of the model that aren’t performing well and trying to figure out what is breaking, so these correlations are just telling you what parameters to look at in the model and think about. Correlations can come from doing inference on a non-identifiable model (a large range of reasonable solutions and you just don’t have the data to narrow it down).

Can you replace uniform priors with something else? Like just do half-normals, and make them as small as reasonably possible (i.e. normal(0, 10) maybe).

Uniform priors lead to computational problems a lot. They don’t regularize well.

Just add one more piece back, maybe ar2, and see how things go.