Two questions: ①Rejecting initial value but still sampling. ②regarding divergent transitions

Hello Stan users and experts,

I have got a lot of helpful advices from Stan community before, and finished my model roughly. But when I fit the model using rstan, The following warning appears:

Chain 1: Rejecting initial value:
Chain 1: Error evaluating the log probability at the initial value.
Chain 1: Exception: validate transformed params: proc[i_0__][i_1__][2] is 12.2138, but must be less than or equal to 6.90776 (in ‘model30047c5d3740_1021’ at line 31)

my model is as follow, and the lines regarding this warning are marked, and my questions are below the model.

data{
  int I;
  int T;
  int J;
  int y[T,J,I];
  real log_max;
}

parameters{
  real mu_U;
  real <lower=0> tau_U;
  vector[I-1] eta_U[J];
  real mu_intra;
  real <lower=0> tau_intra;
  vector[I-1] eta_intra[J];
  real mu_inter;
  real <lower=0> tau_inter;
  vector[I-1] eta_inter[J];
  
  vector<upper=log_max>[I-1]log_A[T,J];  
  vector<lower = 0>[I - 1] sigma[J];      
}

transformed parameters{
  //set up parameters
  vector[I - 1] U[J];                   
  vector[I - 1] inter[J];              
  vector[I - 1] intra[J];
  vector<upper=log_max>[I]log_A_3[T,J];  
  vector<lower=0,upper=1>[I]theta[T,J];  
  vector<upper=log_max>[I-1]proc[T-1,J]; //**mean of the process model.**       
  matrix[2, 2] B[J];
  real <upper=log_max> summ[T,J];
  
  //for dynamics parameters
  for (j in 1:J){
    U[j]=mu_U+tau_U*eta_U[j];
    inter[j]=mu_inter+tau_inter*eta_inter[j];
    intra[j]=mu_intra+tau_intra*eta_intra[j];    
    B[j, 1, 1] = intra[j, 1];
    B[j, 2, 2] = intra[j, 2];
    B[j, 1, 2] = inter[j, 1];
    B[j, 2, 1] = inter[j, 2];
  }
  
  //process term
  for (t in 2:T){
    for (j in 1:J){
      proc[t-1,j]=U[j]+B[j]*log_A[t-1,j];//**equation of the mean**
    }
  }
      
  //for coverage 
  for (t in 1:T){
    for (j in 1:J){
      summ[t,j]=log_sum_exp(log_A[t,j]);
      log_A_3[t,j,1]=log_A[t,j,1];
      log_A_3[t,j,2]=log_A[t,j,2];
      log_A_3[t,j,3]=log_diff_exp(log_max,summ[t,j]);
      theta[t,j]=softmax(log_A_3[t,j]);
    }
  }

  
}

model{
  //piror
  mu_U~normal(0,5);
  mu_inter~normal(0,5);
  mu_intra~normal(0,5);
  tau_U~normal(0,5);
  tau_intra~normal(0,5);
  tau_inter~normal(0,5);
  
  for (j in 1:J){
  eta_U[j]~normal(0,5);
  eta_intra[j]~normal(0,5);
  eta_inter[j]~normal(0,5);
  sigma[j,1] ~ cauchy(0, 5);
  sigma[j,2] ~ cauchy(0, 5);
  }
  
  //process model
//**process model regarding proc**
  for (t in 2:T){
    for (j in 1:J){
     target += multi_normal_lpdf(log_A[t, j] |proc[t-1,j], diag_matrix(sigma[j]));
    }
  }
  
    // For t = 1
  for (j in 1:J) {
    target += uniform_lpdf(log_A[1, j] | 0, log_max);
  }
  
  //observational model
  for (t in 1:T) {
    for (j in 1:J) {
      y[t, j,] ~ multinomial(theta[t, j,]);
    }
  }
}


The first question is:
What led to this warning, and Stan still warmup and sampling after this warning.

I imitate the 8-schools model to write my model. Especially this part

U[j]=mu_U+tau_U*eta_U[j];
inter[j]=mu_inter+tau_inter*eta_inter[j];
intra[j]=mu_intra+tau_intra*eta_intra[j]; 

And I think I have given a reasonable bound for each parameter (Especially for log_A and other transformed parameters that related to it), but I do not know whether these bound are causes of the warning. And I have no confidence that I have given the right prior.

And the second question:

Before this model, I wrote another model that using non-centered parameterization form. Although that model can be fitted, There were 151 divergent transitions after warmup (iter=1000). I would like to ask that did this proportion of divergent transitions indicate that results are completely unreliable, or roughly usable.

Because of divergent transitions of last model, I write the above model recently, but The results seem even worse.

I hope someone can help me, and thank you in advance.


YAO

The warning was triggered by the computed value for proc not matching the constraint. Unlike for parameters which are fully controlled by the sampler and satisfy the constraints by construction, for transformed parameters the bounds are just a check - which tends to be useful to prevent bugs in your code. So for some paramter values U[j]+B[j]*log_A[t-1,j] can be larger then log_max. This doesn’t seem surprising as if I read the model correctly there is no reason why both U and B should avoid values that would lead to proc exceeding the bounds.

If the check for constraints fails, the sample is rejected (i.e. taken as having 0 posterior density) and the sampler tries to continue.

Regarding divergent transitions, see Divergent transitions - a primer

As a general advice, this looks like a quite complex model. I would suggest you start with a simpler version (e.g. without the linear predictors for everything) and get that working first. Debugging such complex models is IMHO almost impossible even for very experienced Stan users.

Hope that helps!

2 Likes

Thanks for your advice!
I read the link and tried to debug my code step by step.
I am glad to make substantial progress, but at the same time there are some new problems.

At first, I I’ve simplified my code as much as I can.

data{
  int I;// here I=3
  int T;// here T=18
  int J;// here J=1, but necessary.
  int y[T,J,I];
  real log_max;
}

parameters{
  vector[I-1] A[T,J];
  real r1;
  real r2;
  real a11;
  real a12;
  real a21;
  real a22;
  real sigma1;
  real sigma2;
}

transformed parameters{
  vector[I] A_3[T,J];
  real<upper=log_max> sumA[T,J];
  for (t in 1:T){
    for (j in 1:J){
  sumA[t,j]=log_sum_exp(A[t,j,1],A[t,j,2]);
  A_3[t,j,1]=A[t,j,1];
  A_3[t,j,2]=A[t,j,2];
  A_3[t,j,3]=log_diff_exp(log_max,sumA[t,j]);
    }
  }
}

model{
  r1~normal(0,10);
  r2~normal(0,10);
  a11~normal(0,10);
  a12~normal(0,10);
  a21~normal(0,10);
  a22~normal(0,10);
  sigma1~cauchy(0,10);
  sigma2~cauchy(0,10);

  for (t in 1:T){
    for (j in 1:J){
      y[t,j]~multinomial(softmax(A_3[t,j]));
    }
  }
  for (t in 2:T){
    for (j in 1:J){
      target+=normal_lpdf(A_3[t,j,1]|r1+a11*A_3[t-1,j,1]+a12*A_3[t-1,j,2], simga1);
      target+=normal_lpdf(A_3[t,j,2]|r1+a22*A_3[t-1,j,2]+a21*A_3[t-1,j,1], sigma2);
    }
  }
}

And then, I checked the observation model separately using each multinomial set.
I am not sure my description is right, it means check each

y[ , , ] ~ multinomial(softmax(A_3[ , , ]))

one by one.
Unfortunately, some of my observational data cannot even overcome divergent transitions with observational models alone. Does this mean that these failed data are unusable? Most of them including many 0 term.

Next, I generate a simulation dataset using R, and each parameter with a known specified distribution (prior?). This dataset pass the observational model successfully and quickly (my own data with the same size is very slowly).

And then, I added the process model and prior as above code. On the first try divergent transitions emerged again. Because “r” and “a” are free, which cannot given a reasonable hard bound. According to pair( ) plot, the posterior of sigma1 and sigma2 is not bell-shape, I tried to adjust prior many times but have no use. Because of the known specified distribution, I changed the variance of process model as 0.1.

  for (t in 2:T){
    for (j in 1:J){
      target+=normal_lpdf(A_3[t,j,1]|r1+a11*A_3[t-1,j,1]+a12*A_3[t-1,j,2],0.1);
      target+=normal_lpdf(A_3[t,j,2]|r1+a22*A_3[t-1,j,2]+a21*A_3[t-1,j,1],0.1);
    }
  }

It worked! No divergent transitions and Rhat/ESS is right value!
But this wiles is not useful for formal estimate, how do I set the prior of variance?

There is one last digression.
My computer often crashes while running Stan.
Is it a matter of computer configuration or system setting? Or do I need to clean up temporary files in R (I do not know where they are)? My computer configuration as follow.

Win 10 Home X64
Core i7 9th.
Memory 16Gb.

Thank you again for your help!
Sincerely,
YAO

Hi, sorry for taking long to respond. I unfortunately currently don’t have time to delve deeply into this, so at least some surface obseravations:

What do you mean by “crashes” - if it is the whole computer, including the operating system, than I would guess this is not related to Stan. If it is the R session that crashes, that might be a bug in rstan, but it usually helps to run just a single chain in which case you might see some output that describes an error.

This usually means that your model is not a good representation of your data, but finding exactly what part of the model is a problem can be hard.

Did you manage to make any progress in the meantime?

Thanks for your reply!

I have finished my estimation by JAGS using the same model .
It looks like the question is that “multinomial distribution” in Stan does not support the data including 0. Because when the vector contains a value of 0, even the simplest model will not get a result without divergent transitions.

data{
vector[3] y; // y<-c(0,1,2)
}

parameter{
simplex[3] theta
}

model{
y~multinomial (theta)
}

I am not sure what is the real cause, but thanks for your advice!
And if possible, I want to know why Stan can not estimate a value of 0 in multinomial smoothly.

Thanks again for your advice, your advice really helps me a lot.

1 Like

That certainly seems weird. I just tried running the model you posted and I see no divergences or other issues when running both with some zeroes and all zeroes. My code is:

cc <- "data {
int y[3]; // y<-c(0,1,2)
}

parameters {
simplex[3] theta;
}

model{
y~multinomial (theta);
}
"


# ff <- tempfile()
# ff 
mm <- rstan::stan_model(model_code = cc)

rstan::sampling(mm, data = list(y = c(0,1,2)))
rstan::sampling(mm, data = list(y = c(0,0,0)))

If you use exactly this code do you see issues? (I presume you are using rstan, if not, can you run the same model via the interface you are using) Also what version of Stan do you use?

Yes, I copied your code and without library(rstan), it worked.
However, when I library rstan and use this R code

a<-stan(file = "00.stan", data = list(y = c(0,1,2)),
  iter = 1000, chains = 1,
  control = list(adapt_delta = 0.8, max_treedepth = 10))

the result is

SAMPLING FOR MODEL ‘00’ NOW (CHAIN 1).
Chain 1:
Chain 1: Gradient evaluation took 0 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1:
Chain 1:
Chain 1: Iteration: 1 / 1000 [ 0%] (Warmup)
Chain 1: Iteration: 100 / 1000 [ 10%] (Warmup)
Chain 1: Iteration: 200 / 1000 [ 20%] (Warmup)
Chain 1: Iteration: 300 / 1000 [ 30%] (Warmup)
Chain 1: Iteration: 400 / 1000 [ 40%] (Warmup)
Chain 1: Iteration: 500 / 1000 [ 50%] (Warmup)
Chain 1: Iteration: 501 / 1000 [ 50%] (Sampling)
Chain 1: Iteration: 600 / 1000 [ 60%] (Sampling)
Chain 1: Iteration: 700 / 1000 [ 70%] (Sampling)
Chain 1: Iteration: 800 / 1000 [ 80%] (Sampling)
Chain 1: Iteration: 900 / 1000 [ 90%] (Sampling)
Chain 1: Iteration: 1000 / 1000 [100%] (Sampling)
Chain 1:
Chain 1: Elapsed Time: 1.743 seconds (Warm-up)
Chain 1: 2.305 seconds (Sampling)
Chain 1: 4.048 seconds (Total)
Chain 1:
Warning messages:
1: In system(paste(CXX, ARGS), ignore.stdout = TRUE, ignore.stderr = TRUE) :
‘-E’ not found
2: There were 410 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 10. See
http://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded
3: Examine the pairs() plot to diagnose sampling problems

4: The largest R-hat is 1.59, indicating chains have not mixed.
Running the chains for more iterations may help. See
http://mc-stan.org/misc/warnings.html#r-hat
5: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
Running the chains for more iterations may help. See
http://mc-stan.org/misc/warnings.html#bulk-ess
6: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
Running the chains for more iterations may help. See
http://mc-stan.org/misc/warnings.html#tail-ess

My R is 3.6.3,
rstan is

rstan (Version 2.21.2, GitRev: 2e1f913d3ca3)
For execution on a local, multicore CPU with excess RAM we recommend calling
options(mc.cores = parallel::detectCores()).
To avoid recompilation of unchanged Stan programs, we recommend calling
rstan_options(auto_write = TRUE)
Do not specify '-march=native' in 'LOCAL_CPPFLAGS' or a Makevars file
Warning messages:
1:  パッケージ ‘rstan’ はバージョン 3.6.3 の R の下で造られました  
2:  パッケージ ‘StanHeaders’ はバージョン 3.6.3 の R の下で造られました  
3:  パッケージ ‘ggplot2’ はバージョン 3.6.3 の R の下で造られました  

By the way, I would like to know

1: In system(paste(CXX, ARGS), ignore.stdout = TRUE, ignore.stderr = TRUE) :
‘-E’ not found

what is this warning?

It feels like this problem is about to be solved

Just to be completely clear, 00.stan contains the same code as I posted or is this your model? (If so, could you please share the file?)

00.stan (161 Bytes)

yep, it is my fault. I add a softmax() in transformed parameters.
Sorry for my mistake and I try the same code using previous R code, it worked.

But because my model need a softmax function, I test the simplest one as 00.stan

Is there any solution for this uploaded model?

Oh, I see! The problem is this: softmax basically removes one degree of freedom, i.e.: a multinomial with N classes is fully described by N-1 parameters (because the probabilities need to sum to 1). This is actually what the simplex type does under the hood: simplex[3] is represented by two parameters on the unconstrained scale. So to make the model work with softmax, you can either fix one of the softmax inputs to 0 (this is the typical way to do multinomial regression) or you need to constraint the parameters to sum to zero (this is a bit more tricky, some discussion at Test: Soft vs Hard sum-to-zero constrain + choosing the right prior for soft constrain

Does that make sense?

I am sorry, probably because I don’t have enough code knowledge, I cannot understand this operation ↓↓↓

So to make the model work with softmax, you can either fix one of the softmax inputs to 0 (this is the typical way to do multinomial regression) or you need to constraint the parameters to sum to zero (this is a bit more tricky

Could you give me a code example?

I read the link you pasted, and tried to use hard constrain. It worked and no divergent transition. But I am not sure whether it is a good form for my model.

(This may be a digression)
So I want to use WAIC to compare the soft one estimating by JAGS and the hard one by Stan. However, I used waic() in loo and the result of JAGS model is

693 (93.9%) p_waic estimates greater than 0.4. We recommend trying loo instead. 
Warning message:

so I stop this attempt.

I am still more interested in how to solve the problem of Softmax~

So the math is explained in https://en.wikipedia.org/wiki/Multinomial_logistic_regression#As_a_set_of_independent_binary_regressions

The code samples for weak and hard constraint are in the link I posted. Code sample for fixing one of the inputs, derived from your model is (didn’t actually run the code, but hope it is OK)

data {
int y[3]; 
}

parameters {
vector[2] X_raw;
}

transformed parameters{
simplex[3] theta;
vector[3] X;
X[1:2] = X_raw;
X[3] = 0;
theta=softmax(X);
}

model{
y~multinomial (theta);
}

Also note that in newer version of Stan you could get some speed and numerical stability by not computing the intermediate softmax yourself and delegating to:

y ~ multinomial_logit(X);

That’s a good question but with no simple answer, unfortunately. I don’t think the exact way you formulate the multinomial regression would usually matter a lot (just that your probabilities need to sum to one and hence the need for the constraint or the reference category). You can always do some posterior and prior predictive checks and see if you formulation leads to any problems (e.g. as discussed at https://arxiv.org/abs/2011.01808).