Bad RHat values and divergent transitions when sampling from the prior

I am trying to sample from the prior to compare the posterior to the prior. However, if I just take out the distribution of the data, then I get bad RHat values. Strangely, this only happens, if I just take out those parts of the model. If I actually take some more and just reduce the prior to the bare minimum (i.e. remove all other variables) the problem seems to disappear. So if I just use this model:

data {
  int<lower=1>                 M;          // Number of subjects
}

parameters {
  // Group level parameters
  real                                   alpha_mu;          // Boundary separation mean     (parameter to log normal)
  real<lower=0>                          alpha_sigma;       // Boundary separation variance (parameter to log normal)

  // Individual parameters
  vector<lower=0>[M]                     alpha;      // Individual boundary separation
}

model {
  alpha_mu    ~ std_normal();
  alpha_sigma ~ exponential(10);

  alpha   ~ lognormal(alpha_mu,alpha_sigma);
}

Then everything is fine (note that all individual alphas are about the same):

Inference for Stan model: LogNormal_Priors.
4 chains, each with iter=4000; warmup=1000; thin=1; 
post-warmup draws per chain=3000, total post-warmup draws=12000.

             mean se_mean   sd  2.5%   25%   50%  75% 97.5% n_eff Rhat
alpha_mu    -0.01    0.11 1.03 -1.79 -0.73 -0.03 0.64  2.25    95 1.04
alpha_sigma  0.16    0.01 0.12  0.04  0.08  0.12 0.21  0.48   202 1.03
alpha[1]     1.82    0.33 2.90  0.16  0.48  0.97 1.90  9.84    76 1.06
alpha[2]     1.82    0.33 2.91  0.16  0.48  0.97 1.90  9.49    76 1.06
alpha[3]     1.82    0.33 2.92  0.16  0.48  0.97 1.90  9.57    78 1.06
alpha[4]     1.82    0.34 2.96  0.16  0.48  0.97 1.92  9.60    76 1.06
alpha[5]     1.81    0.33 2.93  0.16  0.47  0.97 1.89  9.55    77 1.06
alpha[6]     1.83    0.34 3.11  0.16  0.48  0.97 1.92  9.63    83 1.05
alpha[7]     1.83    0.34 2.96  0.16  0.48  0.97 1.93  9.81    78 1.06
alpha[8]     1.82    0.33 2.95  0.16  0.48  0.97 1.91  9.65    78 1.06
alpha[9]     1.82    0.33 2.92  0.16  0.48  0.96 1.90  9.65    78 1.06
alpha[10]    1.82    0.33 2.92  0.16  0.47  0.97 1.90  9.48    79 1.06
alpha[11]    1.82    0.33 2.89  0.16  0.48  0.97 1.92  9.48    77 1.06
alpha[12]    1.82    0.33 2.92  0.16  0.48  0.97 1.91  9.55    77 1.06
alpha[13]    1.82    0.33 2.92  0.16  0.47  0.97 1.92  9.47    76 1.06
alpha[14]    1.82    0.33 2.96  0.16  0.47  0.97 1.92  9.59    80 1.05
alpha[15]    1.82    0.33 2.88  0.16  0.47  0.97 1.90  9.60    76 1.06
alpha[16]    1.82    0.34 2.94  0.16  0.47  0.98 1.89  9.64    76 1.06
alpha[17]    1.83    0.33 2.95  0.16  0.47  0.97 1.89  9.63    79 1.06
alpha[18]    1.82    0.33 2.92  0.16  0.48  0.97 1.92  9.52    79 1.06
alpha[19]    1.83    0.33 2.99  0.16  0.48  0.97 1.91  9.65    80 1.05
alpha[20]    1.83    0.34 2.97  0.16  0.48  0.96 1.92  9.56    77 1.06

Samples were drawn using NUTS(diag_e) at Mon Sep 21 14:24:46 2020.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

However, if I include more of the original model, suddenly RHat values become worse (note that the parts concerning alpha are exactly the same):

data {
  int<lower=1>                 N;          // Number of trials
  int<lower=1>                 M;          // Number of subjects
  vector<lower=0>[N]           RT;         // Reaction times
  int<lower=1>                 subj[N];    // Subject number for RT i
  vector<lower=0,upper=1>[N]   resp_l;     // 1 if the response was on the left, 0 otherwise
  vector<lower=0,upper=1>[N]   incomp;     // 1 if the trial was incompatible, 0 otherwiese
  vector<lower=0,upper=1>[N]   acc;        // Accuracy: correct (1) or incorrect (0) response
  real<lower=0>                NDTMin;     // Minimal Non-Decision time
  real<lower=0>                minRT[M];
}

parameters {
  // Group level parameters
  real                                   alpha_mu;          // Boundary separation mean     (parameter to log normal)
  real<lower=0>                          alpha_sigma;       // Boundary separation variance (parameter to log normal)

  real<lower=1>                          beta_alpha;        // alpha parameter to beta distribution for starting value
  real<lower=1>                          beta_beta;         // beta  parameter to beta distribution for starting value

  real                                   delta_mu;          // mean drift rate (group level)
  real<lower=0>                          delta_sigma;       // variance

  real                                   eta_mu;            // Drift rate difference between compatible / incompatible trials
  real<lower=0>                          eta_sigma;         // Drift rate difference (variance component)

  // Individual parameters
  vector<lower=0>[M]                     alpha;      // Individual boundary separation
  vector<lower=0,upper=1>[M]             beta;       // Individual starting value
  vector[M]                              delta;      // Individual drift rate
  vector[M]                              eta_z;      // Effect of this participants (z-score)
  vector<lower=NDTMin>[M]                tau;        // non-decision time (no hierarchical model)
}

transformed parameters {
  vector[N] beta_trl;   // Beta for each trial
  vector[N] delta_trl;  // Drift rate in each trial

  vector[M] eta;        // Individual compatibility effects

  eta = eta_mu + eta_z*eta_sigma;

  // initial offset should mostly depend on handedness etc.
  // i.e. a single offset towards left/right responses
  // therefore, we reverse the beta, if the response was on
  // the left
  beta_trl = beta[subj] + resp_l-2*beta[subj] .* resp_l;


  delta_trl = (delta[subj] + incomp .* eta[subj]) .* (2*acc-1);
}

model {
  alpha_mu    ~ std_normal();
  alpha_sigma ~ exponential(10);

  tau         ~ uniform(NDTMin, minRT);

  delta_mu    ~ normal(0,10);
  delta_sigma ~ cauchy(0,10);

  beta_alpha ~ exponential(1);
  beta_beta  ~ exponential(1);

  eta_mu      ~ normal(0,10);
  eta_sigma   ~ cauchy(0,100);

  alpha   ~ lognormal(alpha_mu,alpha_sigma);
  beta[1] ~ beta(beta_alpha, beta_beta);
  delta   ~ normal(delta_mu,delta_sigma);
  eta_z   ~ std_normal();
}

Then I get the result:

Inference for Stan model: DDM_Priors.
4 chains, each with iter=4000; warmup=1000; thin=1; 
post-warmup draws per chain=3000, total post-warmup draws=12000.

             mean se_mean   sd  2.5%   25%   50%  75% 97.5% n_eff Rhat
alpha_mu    -0.22    0.18 0.38 -0.86 -0.46 -0.27 0.01  0.56     5 1.71
alpha_sigma  0.48    0.12 0.22  0.07  0.33  0.48 0.63  0.90     4 2.27
alpha[1]     0.99    0.18 0.69  0.23  0.59  0.83 1.20  2.55    14 1.25
alpha[2]     0.96    0.12 0.61  0.12  0.56  0.82 1.24  2.45    25 1.12
alpha[3]     0.87    0.37 0.69  0.11  0.23  0.79 1.26  2.88     4 2.52
alpha[4]     1.07    0.14 0.72  0.22  0.57  0.89 1.40  3.11    25 1.18
alpha[5]     0.96    0.11 0.56  0.25  0.55  0.77 1.25  2.30    24 1.13
alpha[6]     0.80    0.29 0.61  0.10  0.35  0.66 1.12  2.30     4 1.45
alpha[7]     0.98    0.15 0.73  0.26  0.55  0.80 1.22  2.63    24 1.16
alpha[8]     1.19    0.22 0.77  0.27  0.64  0.97 1.58  3.13    12 1.29
alpha[9]     0.86    0.20 0.67  0.09  0.37  0.56 1.22  2.61    11 1.46
alpha[10]    0.89    0.15 0.57  0.15  0.52  0.71 1.13  2.32    14 1.34
alpha[11]    1.02    0.20 0.50  0.30  0.66  0.88 1.30  2.19     6 1.35
alpha[12]    0.99    0.22 0.62  0.15  0.49  0.87 1.38  2.52     8 1.31
alpha[13]    1.23    0.26 1.47  0.29  0.59  0.90 1.40  3.86    33 1.08
alpha[14]    0.94    0.13 0.57  0.26  0.52  0.78 1.25  2.28    20 1.15
alpha[15]    0.91    0.14 0.60  0.21  0.49  0.75 1.23  2.25    17 1.16
alpha[16]    1.15    0.09 0.64  0.26  0.75  1.09 1.43  2.68    48 1.05
alpha[17]    0.91    0.21 0.58  0.18  0.48  0.74 1.26  2.32     8 1.27
alpha[18]    1.20    0.18 0.54  0.40  0.76  1.13 1.53  2.42     9 1.28
alpha[19]    1.07    0.16 0.71  0.25  0.56  0.84 1.40  2.93    20 1.25
alpha[20]    0.88    0.18 0.48  0.35  0.50  0.74 1.13  2.04     7 1.35

Samples were drawn using NUTS(diag_e) at Mon Sep 21 15:01:45 2020.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

So the individual alphas differ from each other and the RHat have become worse. I don’t really understand what is going on here, as the parts concerning these variables remain unchanged.

Hi Tillman,

This is likely happening because some part of the model is causing troubles for the sampler which is resulting in poor mixing for the entire model. Something to keep in mind with Stan is that you’re building a single joint log-likelihood for the model, so all of the parameters are not ‘explored’ independently of each other (@nhuurre have I got that right?).

I’d recommend iteratively building up your model from the prior-only (i.e., just alpha) model, to see which part of your model is tripping up the sampling

1 Like

Thanks for your hint. It somehow seems to be related to the uniform distribution over tau, but I am really surprised that a simple uniform distribution can wreak havoc on the model. So I get the same bad RHat values for this model:

data {
  int<lower=1>                 M;          // Number of subjects
  real<lower=0>                NDTMin;     // Minimal Non-Decision time
  real<lower=0>                minRT[M];
}

parameters {
  // Group level parameters
  real                                   alpha_mu;          // Boundary separation mean     (parameter to log normal)
  real<lower=0>                          alpha_sigma;       // Boundary separation variance (parameter to log normal)
  vector<lower=0>[M]                     alpha;      // Individual boundary separation
  vector<lower=NDTMin>[M]                tau;        // non-decision time (no hierarchical model)
}

model {
  alpha_mu    ~ std_normal();
  alpha_sigma ~ exponential(1);

  tau         ~ uniform(NDTMin, minRT);

  alpha   ~ lognormal(alpha_mu,alpha_sigma);
}

Which gives me:

Inference for Stan model: DDM_Priors_uniform.
4 chains, each with iter=4000; warmup=1000; thin=1; 
post-warmup draws per chain=3000, total post-warmup draws=12000.

             mean se_mean   sd  2.5%   25%   50%  75% 97.5% n_eff Rhat
alpha_mu    -0.15    0.25 0.50 -1.57 -0.42 -0.04 0.20  0.59     4 2.07
alpha_sigma  0.87    0.07 0.22  0.50  0.71  0.87 1.02  1.31     9 1.40
alpha[1]     2.75    1.72 4.73  0.12  0.58  1.09 2.57 17.99     8 1.32
alpha[2]     1.90    0.91 4.05  0.12  0.43  0.84 1.42 14.77    20 1.24
alpha[3]     1.82    0.69 1.76  0.17  0.53  0.93 2.80  5.94     6 1.39
alpha[4]     1.28    0.25 1.70  0.08  0.45  0.84 1.60  4.30    46 1.05
alpha[5]     1.00    0.29 0.96  0.19  0.35  0.67 1.25  3.69    11 1.55
alpha[6]     1.72    0.46 2.13  0.12  0.51  1.04 1.91  7.79    21 1.18
alpha[7]     2.79    1.04 3.16  0.12  0.65  1.49 4.26 10.61     9 1.30
alpha[8]     0.91    0.13 0.78  0.12  0.40  0.74 1.15  2.99    38 1.11
alpha[9]     0.82    0.11 0.65  0.06  0.40  0.63 1.11  2.35    36 1.12
alpha[10]    1.44    0.32 1.62  0.11  0.42  0.86 1.99  5.76    26 1.21
alpha[11]    0.87    0.16 0.75  0.13  0.34  0.66 1.16  3.03    21 1.15
alpha[12]    1.31    0.45 2.08  0.12  0.43  0.68 1.35  8.07    21 1.12
alpha[13]    2.36    1.72 3.51  0.24  0.51  0.80 2.27 13.74     4 1.84
alpha[14]    1.20    0.21 1.38  0.11  0.40  0.76 1.52  4.47    42 1.06
alpha[15]    1.23    0.25 1.17  0.16  0.49  0.84 1.54  4.65    22 1.16
alpha[16]    1.93    0.48 2.18  0.21  0.61  1.13 2.45  8.86    21 1.19
alpha[17]    1.15    0.25 0.70  0.25  0.62  1.02 1.54  2.80     8 1.32
alpha[18]    1.12    0.27 1.33  0.13  0.37  0.69 1.40  4.64    24 1.16
alpha[19]    2.27    0.94 3.03  0.13  0.44  1.28 2.73 11.63    10 1.30
alpha[20]    1.24    0.25 1.06  0.11  0.52  0.94 1.69  3.72    18 1.21

Samples were drawn using NUTS(diag_e) at Mon Sep 21 15:52:25 2020.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

Weirdly, if I just slightly change the model:

data {
  int<lower=1>                 M;          // Number of subjects
  real<lower=0>                NDTMin;     // Minimal Non-Decision time
  real<lower=0>                minRT[M];
}

parameters {
  // Group level parameters
  real                                   alpha_mu;          // Boundary separation mean     (parameter to log normal)
  real<lower=0>                          alpha_sigma;       // Boundary separation variance (parameter to log normal)
  vector<lower=0>[M]                     alpha;      // Individual boundary separation
  vector<lower=NDTMin>[M]                tau;        // non-decision time (no hierarchical model)
}

model {
  alpha_mu    ~ std_normal();
  alpha_sigma ~ exponential(1);

  for(i in 1:M) {
    tau[i]         ~ uniform(NDTMin, minRT[i]);
  }

  alpha   ~ lognormal(alpha_mu,alpha_sigma);
}

Sampling is still bad, but not as bad.

Is there something that I could use instead of a uniform here? And why is a simple uniform causing this problem here?

It’s likely because your uniform prior doesn’t match the bounds you’ve specified for tau. In other words, your bounds have specified that tau can take any value from NDTMin to positive infinity, but your prior is specifying that it can only take values from NDTMin to minRT. So you need to add the upper bound to your declaration of tau.

Additionally, if you add the upper bound in the parameter declaration then you don’t need to specify a prior in the model block, as it will implicitly be given a uniform prior across the bounds

You have to non-center the alphas when there is vanishing data, although the uniform prior on tau is only going to make the funnel worse. For more see https://betanalpha.github.io/assets/case_studies/divergences_and_bias.html.

2 Likes

The problem with that approach is, that there is a different upper bound for each subject. I can’t really specify this in the definition, so I am stuck with something like this.

You can actually pass a vector of bounds to the parameter (if you’re set on a uniform prior). So you would pass as data a vector of bounds, and then use that in the declaration:

data {
  ...
  vector[M] u_bounds;
}

parameters {
  ...
  vector<lower=NDTMin, upper=u_bounds>[M]  tau;
}

But I believe this is only available in Stan 2.24, so you would need to use the cmdstanR interface if you wanted to use that

This works in the latest CmdStan

data {
  ...
  vector<lower=NDTmin>[M] minRT;
}
parameters {
  ...
  vector<lower=NDTmin, upper=minRT>[M] tau;
}

Also, you can always do

parameters {
  vector<lower=0, upper=1>[M] tau_z;
}
transformed parameters {
  vector[M] tau = NDTmin + tau_z .* (minRT-NDTmin);
}
1 Like

Neat trick in the second block, haven’t seen that before!

Ok, it makes sense that the centering causes issues with the priors. However, the reason I have these centered priors is that with the actual model it is quite the opposite: The model behaves very badly if I use non-centered parametrization and only behaves well when I use the centered version as above. The parameters are used for a Weiner-Distribution, which just does not seem to like the non-centered version in this case.

I had a question on CrossValidated on this before, which concerned the exact same model.

I am trying it out again with a more complete version of the model (the CV question was about a preliminary version). But so far it seems the problem with non-centered parametrization remains. As soon as I apply Matt’s trick, the model becomes very slow. I can’t give any definite insights on this problem, yet, as sampling requires 1h+ with the current parameters and it is not done yet.

I have now tried the non-centered version (Matt’s trick) again, and found that it works well on the prior. At the same time, the actual model fit to the data is at least 50% slower. At least I don’t get divergences as I did when I was still developing the model.

Still, 50% more computation time is not something I could use right now. This model is part of a set of 16 models that I am using in a model-comparison using looic. So 1.5h computation time for each of these models would give me 24h total computation time.

Maybe it would make sense to use the centered parametrization when sampling from the data and use the non-centered only when I am sampling from the prior. The results seem exactly the same for both models when the data is used, so there doesn’t seem to be any mistake.

The optimality of the centered vs the non-centered parameterization depends on the shape of the likelihood function. When the likelihood function is diffuse (or absent in the case of a pure prior) then the non-centered parameterization yields the better posterior density geometry but when the likelihood function is narrow then the centered parameterization yields the better geometry.

1 Like