Stan do not use my init and divergent transition issue

Hello everyone,

I am posting because I have two major issue with my model and one minor.

First, Stan struggle to randomly find good initial value for my parameters, especially for the transformed parameters covS100, covS101, … I try to specify initial value for these parameters (to 0 in this code) but, the sampler persist in randomly choosing initial value for these parameters. They are highly restricted parameters but it should be working when initializing them at zero.
Error message:
Chain 1: Error evaluating the log probability at the initial value.
Chain 1: Exception: model1d0861eb27c2_3kit_fem_se_model_namespace::write_array: covS100 is -0.181278, but must be greater than or equal to -0.0621561 (in ‘model1d0861eb27c2_3kit_fem_se_model’ at line 24)

Second, when Stan finally sample, it report that there were many divergent transition after warm up. I already increased the adapt_delta to 0.99 but it does not improve sampling. Should I be worried about a potential bias during sampling? I look at the pairs and they seem quite good, but I do not know exactly what to look for in the pairs.
Warning messages:
1: There were 93794 divergent transitions after warmup. Increasing adapt_delta above 0.99 may help. See
*Runtime warnings and convergence problems *
2: Examine the pairs() plot to diagnose sampling problems

Third and final minor problem: I would like to change the model in order to allow the pt parameter (simplex vector of probability for the multinomial distribution), but multinomial distribution seem to not accept row vector as a distribution parameters. Is it possible to use a row of a matrix or something like that as a vector probability for a multinomial distribution? I solve this problem by write as many times as I wanted the “pt” parameters (for example tenfold with pt1, pt2, …, pt10 each of them being a vector of probability for the multinomial distribution) but the code for the model is much heavier and it is not very pretty…

Thank you very much in advance for your help.

Stan model


data {
	int<lower=0> J;   // Number of all types of possible test results. J=8 for three tests.
	int<lower=0> observed_frequency[J];	//number of each type of test result (T1,T2,T3), in order of (1,1,1), (1,1,0), ..., (0,0,0)
}

parameters {
	real<lower=0, upper=1> pi;	// prevalence
	real<lower=0, upper=1> c1;	// specificity of T1
	real<lower=0, upper=1> c2;	// specificity of T2
	real<lower=0, upper=1> c3;	// specificity of T3
	real<lower=1-c1, upper=1> s1;	// sensitivity of T1
	real<lower=1-c2, upper=1> s2;	// sensitivity of T2
	real<lower=1-c3, upper=1> s3;	// sensitivity of T3

// covS111 = conditional dependence term for result (T1=1, T2=1, T3=1 | D=1)
	real<lower=-s1*s2*s3, upper=fmin(s1, fmin(s2, s3))-s1*s2*s3> covS111;	
	real<lower=-(1-s1)*s2*s3, upper=fmin(1-s1, fmin(s2, s3))-(1-s1)*s2*s3> covS011;
	real<lower=-(1-s1)*(1-s2)*s3, upper=fmin(1-s1, fmin(1-s2, s3))-(1-s1)*(1-s2)*s3> covS001;
	real<lower=-(1-s1)*(1-s2)*(1-s3), upper=fmin(1-s1, fmin(1-s2, 1-s3))-(1-s1)*(1-s2)*(1-s3)> covS000;
// conditional dependence term for result (T1=1, T2=1, T3=1 | D=0)
  real<lower=-(1-c1)*(1-c2)*(1-c3), upper=fmin(1-c1, fmin(1-c2, 1-c3))-(1-c1)*(1-c2)*(1-c3)> covC111;
	real<lower=-c1*(1-c2)*(1-c3), upper=fmin(c1, fmin(1-c2, 1-c3))-c1*(1-c2)*(1-c3)> covC011;
	real<lower=-c1*c2*(1-c3), upper=fmin(c1, fmin(c2, 1-c3))-c1*c2*(1-c3)> covC001;
	real<lower=-c1*c2*c3, upper=fmin(c1, fmin(c2, c3))-c1*c2*c3> covC000;
	

}

transformed parameters {
	real<lower=-s1*(1-s2)*(1-s3), upper=fmin(s1, fmin(1-s2, 1-s3))-s1*(1-s2)*(1-s3)> covS100;
	real<lower=-s1*(1-s2)*s3, upper=fmin(s1, fmin(1-s2, s3))-s1*(1-s2)*s3> covS101;
	real<lower=-s1*s2*(1-s3), upper=fmin(s1, fmin(s2, 1-s3))-s1*s2*(1-s3)> covS110;
	real<lower=-(1-s1)*s2*(1-s3), upper=fmin(1-s1, fmin(s2, 1-s3))-(1-s1)*s2*(1-s3)> covS010;
	
	real<lower=-(1-c1)*c2*c3, upper=fmin(1-c1, fmin(c2, c3))-(1-c1)*c2*c3> covC100;
	real<lower=-(1-c1)*c2*(1-c3), upper=fmin(1-c1, fmin(c2, 1-c3))-(1-c1)*c2*(1-c3)> covC101;
	real<lower=-(1-c1)*(1-c2)*c3, upper=fmin(1-c1, fmin(1-c2, c3))-(1-c1)*(1-c2)*c3> covC110;
	real<lower=-c1*(1-c2)*c3, upper=fmin(c1, fmin(1-c2, c3))-c1*(1-c2)*c3> covC010;
	
	vector<lower=0, upper=1>[J] pt;	// joint probability of each type of possible test result

	real covST12;	// the pairwise conditional dependence between (T1=1|D=1) and (T2=1|D=1)
	real covST13;	// the pairwise conditional dependence between (T1=1|D=1) and (T3=1|D=1)
	real covST23;	// the pairwise conditional dependence between (T2=1|D=1) and (T3=1|D=1)
	real covCT12;	// the pairwise correlation between (T1=1|D=0) and (T2=1|D=0)
	real covCT13;	// the pairwise correlation between (T1=1|D=0) and (T2=3|D=0)
	real covCT23;	// the pairwise correlation between (T1=2|D=0) and (T2=3|D=0)

	// calculate the transformed conditional dependence terms
	covS100 = covS011 + covS111 - covS000;
	covS101 = -(covS001 + covS011 + covS111);
	covS110 = covS000 + covS001 - covS111;
	covS010 = -(covS000 + covS001 + covS011);
	covC100 = covC011 + covC111 - covC000;
	covC101 = -(covC001 + covC011 + covC111);
	covC110 = covC000 + covC001 - covC111;
	covC010 = -(covC000 + covC001 + covC011); 

	
	// calculate the pairwise conditional dependence terms
	covST12 = covS111 + covS110;
	covST13 = covS111 + covS101;
	covST23 = covS011 + covS111;
	
	covCT12 = covC111 + covC110;
	covCT13 = covC111 + covC101;
	covCT23 = covC011 + covC111;

	
	// pt[1]: probability for having result 111, pt[2] for 110, pt[8] for 000
	pt[1] = pi*((1-s1)*(1-s2)*(1-s3)+covS000)+(1-pi)*((c1)*c2*c3 + covC000); 
	pt[2] = pi*((1-s1)*(1-s2)*s3+covS001)+(1-pi)*(c1*(c2)*(1-c3) + covC001);
	pt[3] = pi*((1-s1)*s2*(1-s3)+covS010)+(1-pi)*(c1*(1-c2)*(c3) + covC010);
	pt[4] = pi*((1-s1)*s2*s3+covS011)+(1-pi)*((c1)*(1-c2)*(1-c3) + covC011);
	pt[5] = pi*(s1*(1-s2)*(1-s3)+covS100)+(1-pi)*((1-c1)*c2*c3 + covC100);
	pt[6] = pi*(s1*(1-s2)*s3+covS101)+(1-pi)*((1-c1)*(c2)*(1-c3) + covC101);
	pt[7] = pi*(s1*s2*(1-s3)+covS110)+(1-pi)*((1-c1)*(1-c2)*(c3) + covC110);
	pt[8] = pi*(s1*s2*s3+covS111)+(1-pi)*((1-c1)*(1-c2)*(1-c3) + covC111);
}

model {
	// priors:
	pi ~ beta(0.5,0.5);
	s1 ~ beta(10,1);	// 
	s2 ~ beta(10,1);
	s3 ~ beta(10,1);
	c1 ~ beta(10,1);  //
	c2 ~ beta(10,1);  // 
	c3 ~ beta(10,1);  //

	// distribution			
	observed_frequency ~ multinomial(pt);
}

generated quantities {
  real log_lik;
  log_lik = multinomial_lpmf(observed_frequency | pt);
}

R code for the launching model and init :
library(rstan)
data4stan ← list(J = 8,
observed_frequency = c(1000,50,150,12,20,5,10,100))

Initialisation

inits ← list(list(),list(),list(),list())
for (k in 1:4) {
s1 ← rbeta(1,10,0.5)
s2 ← rbeta(1,10,0.5)
s3 ← rbeta(1,10,0.5)
c1 ← rbeta(1,10,0.5)
c2 ← rbeta(1,10,0.5)
c3 ← rbeta(1,10,0.5)
covS111 ← 0
covS011 ← 0
covS001 ← 0
covS000 ← 0
covS100 ← 0
covS101 ← 0
covS110 ← 0
covS010 ← 0
inits[[k]] ← list(s1 = s1, s2 = s2, s3 = s3,c1 = c1, c2 = c2, c3 = c3, covS111 = covS111, covS011 = covS011, covS001 = covS001, covS000 = covS000,covS100 = covS100, covS101 =covS101, covS110 = covS110, covS010 = covS010)
}

sampling

fit ← stan(file = ‘3kit_fem_se_model.stan’, data = data4stan, init = inits,control = list(adapt_delta = 0.99,max_treedepth = 15),iter = 50000)

And part of the R output

Chain 4: Rejecting initial value:
Chain 4: Error evaluating the log probability at the initial value.
Chain 4: Exception: model1d0861eb27c2_3kit_fem_se_model_namespace::write_array: covS100 is 0.309272, but must be less than or equal to 0.190704 (in ‘model1d0861eb27c2_3kit_fem_se_model’ at line 24)

Chain 4: Rejecting initial value:
Chain 4: Error evaluating the log probability at the initial value.
Chain 4: Exception: model1d0861eb27c2_3kit_fem_se_model_namespace::write_array: covS100 is -0.0675094, but must be greater than or equal to -0.0271819 (in ‘model1d0861eb27c2_3kit_fem_se_model’ at line 24)

Chain 4: Rejecting initial value:
Chain 4: Error evaluating the log probability at the initial value.
Chain 4: Exception: model1d0861eb27c2_3kit_fem_se_model_namespace::write_array: covS100 is -0.365994, but must be greater than or equal to -0.00929893 (in ‘model1d0861eb27c2_3kit_fem_se_model’ at line 24)

Chain 4: Rejecting initial value:
Chain 4: Error evaluating the log probability at the initial value.
Chain 4: Exception: model1d0861eb27c2_3kit_fem_se_model_namespace::write_array: covS100 is -0.390722, but must be greater than or equal to -0.0181616 (in ‘model1d0861eb27c2_3kit_fem_se_model’ at line 24)

Chain 4:
Chain 4: Gradient evaluation took 0 seconds
Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
Chain 4: Adjust your expectations accordingly!
Chain 4:
Chain 4:
Chain 4: Iteration: 1 / 50000 [ 0%] (Warmup)
Chain 4: Iteration: 5000 / 50000 [ 10%] (Warmup)
Chain 4: Iteration: 10000 / 50000 [ 20%] (Warmup)
Chain 4: Iteration: 15000 / 50000 [ 30%] (Warmup)
Chain 4: Iteration: 20000 / 50000 [ 40%] (Warmup)
Chain 4: Iteration: 25000 / 50000 [ 50%] (Warmup)
Chain 4: Iteration: 25001 / 50000 [ 50%] (Sampling)
Chain 4: Iteration: 30000 / 50000 [ 60%] (Sampling)
Chain 4: Iteration: 35000 / 50000 [ 70%] (Sampling)
Chain 4: Iteration: 40000 / 50000 [ 80%] (Sampling)
Chain 4: Iteration: 45000 / 50000 [ 90%] (Sampling)
Chain 4: Iteration: 50000 / 50000 [100%] (Sampling)
Chain 4:
Chain 4: Elapsed Time: 114.509 seconds (Warm-up)
Chain 4: 116.137 seconds (Sampling)
Chain 4: 230.646 seconds (Total)
Chain 4:
Warning messages:
1: There were 93794 divergent transitions after warmup. Increasing adapt_delta above 0.99 may help. See
http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
2: Examine the pairs() plot to diagnose sampling problems

get_inits(fit) # only shown the first one
[[1]]
[[1]]$pi
[1] 0.3049692
[[1]]$c1
[1] 0.9998908
[[1]]$c2
[1] 0.9967834
[[1]]$c3
[1] 0.9950182
[[1]]$s1
[1] 0.8352106
[[1]]$s2
[1] 0.9600342
[[1]]$s3
[1] 0.9797707
[[1]]$covS111
[1] -8.722165e-17
[[1]]$covS011
[1] -2.380066e-17
[[1]]$covS001
[1] -8.139976e-19
[[1]]$covS000
[1] 2.271552e-18
[[1]]$covS100
[1] -1.132939e-16
[[1]]$covS101
[1] 1.118363e-16
[[1]]$covS110
[1] 8.86792e-17
[[1]]$covS010
[1] 2.23431e-17
[[1]]$pt
[1] 0.6893091004 0.0054188855 0.0032002879 0.0472823143 0.0002811954 0.0099742740 0.0049469999 0.2395869425
[[1]]$covST12
[1] 1.457555e-18
[[1]]$covST13
[1] 2.461465e-17
[[1]]$covST23
[1] -1.110223e-16
[[1]]$log_lik
[1] -661.8647

RE: divergences, the short answer is yes, you should be very worried.

For basic information about divergences see: https://mc-stan.org/misc/warnings.html.

For more detailed information see: https://mc-stan.org/users/documentation/case-studies/divergences_and_bias.html

Thank you very much for your answer nerutenbeck

I take me some times read the documentation you send me and it really help me to understand a bit better the divergent transition.

I think the major problem of my model are the high constraint on the covS and covC parameters for example:


real<lower=-s1*s2*s3, upper=fmin(s1, fmin(s2, s3))-s1*s2*s3> covS111;

In the eight-school case study, Michael Betancourt said:

"These infrequent divergences do not seem concentrate anywhere in parameter space,

which is indicative of the divergences being false positives. As expected of false positives, we can remove the divergences entirely by decreasing the step size"

In my case, even if almost half of all the transitions are divergent, this divergent transition do not seem to concentrate anywhere in parameter space (see pairs graph for illustration). However, increasing the step size does not help at all in reducing the number of divergent transition.

Do you think that these divergent transitions could only be due to the high constraint on the covS and covC parameters? If, yes, what do you think about the potential bias induced by this phenomena?

My admittedly very basic understanding of divergences is that sometimes if you have a very few then increasing the step size via adapt_delta can help, but that it won’t save you in the case of severe model misspecification and/or in sampling from model spaces that HMC can’t explore fully, which seems to be your case.

Again, I don’t claim any kind of real expertise, but if I were experiencing the kind of divergence issues that you are showing, I would think that there are significant problems with the specification of my model either through how it is written or indicating a need to reparameterize via transformations, though I don’t think I understand your model well enough to give you any kind of concrete advise there.

Thank’s again for answering.
The model I used is a model which was published in stat in medicine two years ago.
I’ll try to re-write it in order to relax the constraint on parameters and I will see if divergent transition are still present.

Do you have any idea about the initialization issue or about the modelling problem of the multinomial parameters?

I’m not sure what you mean by “high” constraint, but I noticed that you are using fmin in some of your constraints. That can be a problem. As pointed out in the Stan functions manual: " Warning: These functions can seriously hinder sampling and optimization efficiency for gradient-based methods (e.g., NUTS, HMC, BFGS) if applied to parameters (including transformed parameters and local variables in the transformed parameters or model block). The problem is that they break gradients due to discontinuities coupled with zero gradients elsewhere."

1 Like

tank’s for your answer jjramsey
For illustrate the high constraint see for example this stan code :

real<lower=-(1-s1)*(1-s2)*(1-s3), upper=fmin(1-s1, fmin(1-s2, 1-s3))-(1-s1)*(1-s2)*(1-s3)> covS000;

As S1, S2 and S3 are fit around 0.9, covS000 approximately belong to [-0.001; 0.099]. Which is not, to my opinion, a very large interval…

I understand the potential issue with fmin, but I only use it in the constraint. I do not use it in the expression of the parameters or in the model definition. Could it still be a problem?

Yes. Due to how Stan deals with constraints, the expression fmin(1-s1, fmin(1-s2, 1-s3))-(1-s1)*(1-s2)*(1-s3) still becomes a part of the log-posterior that’s calculated every time the model is evaluated. Since s1, s2, and s3 are parameters, not constants, that expression can introduce discontinuities in the gradient of the log-posterior.

1 Like

It seems that stan does not take into acount the constraint of the transformed parameters when sampling the parameters in the block “parameters” whereas constraint one these parameters and transformed parameters are linked.
For example let’s look at the parameters covS111 and covS100 :
covS111 is declared in the parameters block like that :

real<lower=-s1*s2*s3, upper=fmin(s1, fmin(s2, s3))-s1*s2*s3> covS111

then covS100 is declared and expressed in the transformed parameters block like that :

real<lower=-s1*(1-s2)*(1-s3), upper=fmin(s1, fmin(1-s2, 1-s3))-s1*(1-s2)*(1-s3)> covS100;
	covS100 = covS011 + covS111 - covS000;

If covS111 while beeing sampled in the interval <lower=-s1s2s3, upper=fmin(s1, fmin(s2, s3))-s1s2s3> is too high for the constraint on covS100, it can force covS100 to go out of it’s own constraint intervalle.
I noticed that by trying to include the constraint of the transformed parameters in the constraint of the parameters in the parameters block. By doing this, i managed to reduced the number of divergent transition by almost 2/3. But there is still a lot of divergent transition (approximatly 1/8 of all transitions) and I get a new error message during the initialisation step which I do not undersand :
Chain 4: Error evaluating the log probability at the initial value.
Chain 4: lub_constrain: lb is -0.00436392, but must be less than -0.0565556
Chain 4: Rejecting initial value:
Chain 4: Error evaluating the log probability at the initial value.
Chain 4: lub_constrain: lb is -0.0260213, but must be less than -0.0664385

does this message is talking to someone? I do not understand especialy what is the lub_constraint an the “lb” thing which needs to be les than -0.0664385.

I think lub_constrain is a Stan function that attempts to make sure parameter values are within their constraints, and lb is short for “lower bound”. One possible cause for the error message is that you may not have set your constraints in the transformed parameters block correctly.