Divergence or Convergence based on rhat using different datasets

[Please include Stan program and accompanying data if possible]

  #####  stan file
  
                      
                  
  stanmodelcode="data {
	
  int<lower=1> N;        // Number of points from 2 to track length-1 of the i'th for a cluster. 
          
  int<lower=1> K;          // number of states

  int<lower=1> J;          // number of cells in the cluster.
      
  vector[N] angular_change_cluster;    // an array containning the angular change of all trajectories.

  vector[N] radius_cluster;     // an array containning the radius (second-to-the-end) of all trajectories.

  vector[N+J] number_of_neighbor_cluster;  // we will use it as a covariate to determine two transition matrices.
       
  vector<lower=0>[K] alpha_prior;   // dirichlet distribution parameters

  vector[K] alpha_P_1[K];   // dirichlet distribution parameters for P_1

  vector[K] alpha_P_2[K];   // dirichlet distribution parameters for P_2
  
  int<lower=1> counter_n[J];  // where to start the markovian process
  
  int<lower=1> b_final_f[J];  // where to end the markovian process
  
  int<lower=1>  a_initial_f[J];  // first position of each trajectory
  

      
}


parameters {


 
    simplex[K] prior_pi;    // k is the number of observed models we consider.
    
    simplex[K] P[2, K];    //  transition matrix 
     
       
    real<lower=-1.5, upper=1.5> alpha_non_per_radius;

    real<lower=0, upper=3.5> beta_non_per_radius;

    real<lower=-1.5, upper=1.5> alpha_per_radius;

    real<lower=0, upper=3.5> beta_per_radius;
    
    real<lower=-1.5, upper=1.5> alpha_hesi_radius;

    real<lower=0, upper=3.5> beta_hesi_radius;  
       
    real<lower=0, upper=4> alpha_per_angle;  
      
    real<lower=0, upper=10> beta_per_angle;
        
    real<lower=0, upper=10> alpha_non_per_angle;
    
    real<lower=0, upper=4> beta_non_per_angle;
    
    real<lower=0, upper=10> alpha_hesi_angle;
    




}



model {
	  

  matrix[K, K] P_transformed[K];   //saving P as a matrix format.

  vector[N+J] prior_pi_alternative[K];
  
  vector[N] prior_pi_modeling[K];
   
  row_vector[N] models[K];    // coordinates 
  
  row_vector[K] a;

  row_vector[K] b;
  
  int t;
  
  vector[K] transfer;
  
  
  
  
 	
  for(k in 1:K){

           target+= dirichlet_lpdf(P[1, k]| alpha_P_1[k]);
  
  }

  
  
  for(k in 1:K){
  
           target+= dirichlet_lpdf(P[2, k]| alpha_P_2[k]);

   }

  
     
  
     

  for(l in 1:2){
  
   for(m in 1:K){
  	
  	 for(n in 1:K){
  		
  		 P_transformed[l, m, n] = P[l, m, n];
  		
  	 }
  	
    }    	

   }
  
  
     target+= dirichlet_lpdf(prior_pi| alpha_prior);   
  
	
   for(j in 1:J){	

     # target+= normal_lpdf(alpha_r[j]| b_0, sigma_alpha);
  	
     # target+= chi_square_lpdf(radius[a_initial_f[j]:b_final_f[j]]| alpha_r[j]); 

     
     for(k in 1:K){

           prior_pi_alternative[k, a_initial_f[j]] = prior_pi[k];


     }
  	 

     for(l in counter_n[j]: b_final_f[j]){


           for(k in 1:K){

                  b[k] = prior_pi_alternative[k, (l-1)];

           }



     if(number_of_neighbor_cluster[l-1]<=0){
       
           a = b * P_transformed[1];
          
     }


     if(number_of_neighbor_cluster[l-1]>0){
       
           a = b * P_transformed[2];
          
     }



     for(k in 1:K){
           
           prior_pi_alternative[k, l] = a[k];

     }

  	
  }

 } 
 
 t = 1;
 
 for(j in 1:J){
 	
 	for(l in counter_n[j]: b_final_f[j]){
 		
 		for(k in 1:K){
 		
 		      prior_pi_modeling[k, t] = prior_pi_alternative[k, l];
 		
 				
} 	

    t = t+1;

}
 	
 	
 }
 

  


##   Introducing no priors means we consider flat prior for the parameter. Note that the domain of the flat prior comes from the bound we have already defined for parameters.
   



   for(n in 1:N){
   	
   	          	transfer[1] = log(prior_pi_modeling[1, n]) +  beta_lpdf( (angular_change_cluster[n]/pi())| alpha_non_per_angle, beta_non_per_angle) + lognormal_lpdf(radius_cluster[n] | alpha_non_per_radius, beta_non_per_radius);  
   	          	
   	          	transfer[2] = log(prior_pi_modeling[2, n]) +  beta_lpdf((angular_change_cluster[n]/pi())| alpha_hesi_angle, alpha_hesi_angle) + lognormal_lpdf(radius_cluster[n] | alpha_hesi_radius, beta_hesi_radius); 
   	          	
   	          	transfer[3] = log(prior_pi_modeling[3, n]) +  beta_lpdf((angular_change_cluster[n]/pi())| alpha_per_angle, beta_per_angle) + lognormal_lpdf(radius_cluster[n] | alpha_per_radius, beta_per_radius);  	          	
   	          

                target += log_sum_exp( transfer );

   }


                       
  
 }"

Hello STAN group,

Thanks in advance for your support. I have some sets of data to run the above program with. I already use the following command to run the program

                fit <- stan(model_code = stanmodelcode, model_name = "example",
            data = dat.cluster.1, iter = 200000, chains = 8, verbose = TRUE, control = list(adapt_delta = 0.9999, stepsize = 0.001, max_treedepth =15), cores=8)

My problem is the program converges for some of data sets, but diverges for others where I get large rhat for them. My question is whether the increase of iterations can help those data sets converge? Please note that it takes 10 days to get the program run. It might be worth mentioning that I got no error while running the program. Also, I did not play with some parameters like adapt_delta, stepsize, max_treedepth.

Regards,
Elaheh

[edit: escaped program, but didn’t clean up spacing]

iter = 200000, chains = 8

Without know anything else about the model, this is almost definitely way too many iterations. Running for more iterations probably isn’t going to help you. The defaults (2000 iterations, 4 chains) should give you something interesting. If it’s taking this long to get something out of the model, there’s almost certainly something wrong with the model itself.

It’s true that a model can behave quite differently on different data, but that probably means that you need to use different models for the two bits of data.

Check out: https://github.com/stan-dev/stan/wiki/Stan-Best-Practices

Is there a small version of this model you can start with? It’s almost always best to build up these models gradually, checking everything (posterior pairplots, posterior predictives if you have them, neff/Rhat) along the way.

1 Like

Also these settings:

control = list(adapt_delta = 0.9999, stepsize = 0.001, max_treedepth =15)

Making the adapt delta really close to one like that and making the treedepth very deep are things you do when you have no other options. First step is to try all the reparameterizations you can think of.

Thanks for your prompt reply. I have already checked the model many times… It seems correct to me… I also reparametrized it couple of times… The data I work with are from the same phenomenon which means they have the same format. Please note that the data is big… Can it make a difference?

Regards,
Elaheh

So when I’m saying the model has a problem, I’m not talking about a typo or a mistake coding. Usually, if Stan doesn’t sample a model efficiently, there’s some underlying mathematical problem in the model, and it’s worth figuring out.

Honestly, if someone just gave you this model, there’s very little reason to trust it – even if it’s published somewhere or is what everyone uses. A lot of models are just bad. One of Stan’s ways of diagnosing a model is that it takes forever to run it :D.

The most common thing to look for are unidentifiabilities in parameters. An example of this is if you had a problem like, ‘1 = a + b, solve for a and b’. There’s a whole line of solutions, not just one. It’s very easy to accidentally code these into your models. Look at your model in Shinystan or use the R pairplots. Look for where parameters are tightly correlated in the samples. Anything like this is bad – it’s difficult for Stan to explore these places.

Really all you can do to start with the smallest version of the model you have with a small bit of data, and build up from there. The adapt_delta and max_treedepth are about the only knobs you get with NUTS, and they’re kindof last-resort things.

The volume of data can make a difference. You might have to make prior assumptions for small data to avoid parameters going crazy that aren’t necessary if you just jam a ton of data in your model. However, big data might reveal misspecified pieces of your model you wouldn’t have noticed with only a little data. But this is 2nd order stuff. First is get the model happily sampling in Stan.

4 Likes

Also just because data is formatted the same way and goes into the same models, doesn’t mean the posteriors look the same! It’s hard to separate any of these things. It’s all very tangled haha. That’s part of the fun.

2 Likes

Thanks for the time you spent on this. The problem might be because of the prior I already implemented. I made some changes following your comments to let a wider range for parameters. The other possibility is the model I use. Anyway, it seems I need to spend some time on it…

Regards,
Elaheh

I would look at your priors—those interval-constrained priors can cause probability mass to pile up at the boundaries, which go to plus or minus infinity on the unconstrained scale where Stan samples. They will also push the estimated posterior means away from the boundary (creating bias if the broader prior was more accurate).

Additionally, the posterior has to be proper. So flat priors without upper and lower bounds have to be combined with data to ensure that the posterior is proper.

Youc an also vectorize most of these loops, e.g., with:

for(k in 1:K){
  prior_pi_alternative[k, a_initial_f[j]] = prior_pi[k];

being

prior_pi_alternative[ , a_initial_f[j] = prior_pi;

Won’t help with efficiency, but it makes the code easier to read and shorter.

1 Like

Hi Bob,

Thanks for your comments. I implemented some changes and sent the programs to run on server. Please find the below for the new model.

stan file
  stanmodelcode="data {
	
  int<lower=1> N;        // Number of points from 2 to track length-1 of the i'th for a cluster. 
          
  int<lower=1> K;          // number of states

  int<lower=1> J;          // number of cells in the cluster.
      
  vector[N] angular_change_cluster;    // an array containning the angular change of all trajectories.

  vector[N] radius_cluster;     // an array containning the radius (second-to-the-end) of all trajectories.

  vector[N+J] number_of_neighbor_cluster;  // we will use it as a covariate to determine two transition matrices.
       
  vector<lower=0>[K] alpha_prior;   // dirichlet distribution parameters

  vector[K] alpha_P_1[K];   // dirichlet distribution parameters for P_1

  vector[K] alpha_P_2[K];   // dirichlet distribution parameters for P_2
  
  int<lower=1> counter_n[J];  // where to start the markovian process
  
  int<lower=1> b_final_f[J];  // where to end the markovian process
  
  int<lower=1>  a_initial_f[J];  // first position of each trajectory
 
}


parameters {

    simplex[K] prior_pi;    // k is the number of observed models we consider.
    
    simplex[K] P[2, K];    //  transition matrix 
     
    real<lower=-5, upper=5> alpha_non_per_radius;

    real<lower=0, upper=5> beta_non_per_radius;

    real<lower=-5, upper=5> alpha_per_radius;

    real<lower=0, upper=5> beta_per_radius;
   
    real<lower=0, upper=4> alpha_per_angle;  
      
    real<lower=0, upper=10> beta_per_angle;
        
    real<lower=0, upper=10> alpha_non_per_angle;
    
    real<lower=0, upper=4> beta_non_per_angle;
   
   
}

model {
	  

  matrix[K, K] P_transformed[K];   //saving P as a matrix format.

  vector[N+J] prior_pi_alternative[K];
  
  vector[N] prior_pi_modeling[K];
   
  row_vector[N] models[K];    // coordinates 
  
  row_vector[K] a;

  row_vector[K] b;
  
  int t;
  
  vector[K] transfer;
  
  
  
  
 	
  for(k in 1:K){

           target+= dirichlet_lpdf(P[1, k]| alpha_P_1[k]);
  
  }

  
  
  for(k in 1:K){
  
           target+= dirichlet_lpdf(P[2, k]| alpha_P_2[k]);

   }

  
     
  
     

  for(l in 1:2){
  
   for(m in 1:K){
  	
  	 for(n in 1:K){
  		
  		 P_transformed[l, m, n] = P[l, m, n];
  		
  	 }
  	
    }    	

   }
  
  
     target+= dirichlet_lpdf(prior_pi| alpha_prior);   
  
	
   for(j in 1:J){	

     # target+= normal_lpdf(alpha_r[j]| b_0, sigma_alpha);
  	
     # target+= chi_square_lpdf(radius[a_initial_f[j]:b_final_f[j]]| alpha_r[j]); 

     
     for(k in 1:K){

           prior_pi_alternative[k, a_initial_f[j]] = prior_pi[k];


     }
  	 

     for(l in counter_n[j]: b_final_f[j]){


           for(k in 1:K){

                  b[k] = prior_pi_alternative[k, (l-1)];

           }



     if(number_of_neighbor_cluster[l-1]<=0){
       
           a = b * P_transformed[1];
          
     }


     if(number_of_neighbor_cluster[l-1]>0){
       
           a = b * P_transformed[2];
          
     }



     for(k in 1:K){
           
           prior_pi_alternative[k, l] = a[k];

     }

  	
  }

 } 
 
 t = 1;
 
 for(j in 1:J){
 	
 	for(l in counter_n[j]: b_final_f[j]){
 		
 		for(k in 1:K){
 		
 		      prior_pi_modeling[k, t] = prior_pi_alternative[k, l];
 		
 				
} 	

    t = t+1;

}
 	
 	
 }
 

##   Introducing no priors means we consider flat prior for the parameter. Note that the domain of the flat prior comes from the bound we have already defined for parameters.
   


   for(n in 1:N){
   	
   	           	transfer[1] = log(prior_pi_modeling[1, n]) +  beta_lpdf( (angular_change_cluster[n]/pi())| alpha_non_per_angle, beta_non_per_angle); 
   	          	 	          	
   	          	transfer[2] = log(prior_pi_modeling[2, n]) +  beta_lpdf((angular_change_cluster[n]/pi())| alpha_per_angle, beta_per_angle);  	   
   	          

                target += log_sum_exp( transfer );
                
                if(max(transfer)==transfer[1]){ 
                	
                	target +=lognormal_lpdf(radius_cluster[n] | alpha_non_per_radius, beta_non_per_radius);

}                
                if(max(transfer)==transfer[2]){
                	
                	target +=lognormal_lpdf(radius_cluster[n] | alpha_per_radius, beta_per_radius);
                	
                	}
                

   }

                       
  
 }"

I manually separated the radius part. By using

                    fit <- stan(model_code = stanmodelcode, model_name = "example",
                data = dat.cluster.4, iter = 1000, chains = 8, verbose = TRUE, control = list(adapt_delta = 0.9999, stepsize = 0.001, max_treedepth =15), cores=8)
                
I got the following result

>                     print(fit_ss$summary)
                              mean      se_mean          sd          2.5%
prior_pi[1]              0.3764711   0.10041494   0.2009592     0.1376445
prior_pi[2]              0.6235289   0.10041494   0.2009592     0.2179815
P[1,1,1]                 0.3801863   0.09655334   0.1932178     0.1398530
P[1,1,2]                 0.6198137   0.09655334   0.1932178     0.2368528
P[1,2,1]                 0.4522426   0.09892335   0.1979552     0.1815026
P[1,2,2]                 0.5477574   0.09892335   0.1979552     0.2061507
P[2,1,1]                 0.4436479   0.09343571   0.1869700     0.2586955
P[2,1,2]                 0.5563521   0.09343571   0.1869700     0.2294627
P[2,2,1]                 0.4068376   0.13352512   0.2672146     0.1265628
P[2,2,2]                 0.5931624   0.13352512   0.2672146     0.1298532
alpha_non_per_radius     0.8129261   0.78111202   1.5630446    -0.2161582
beta_non_per_radius      1.3147298   0.37546599   0.7513290     0.8171677
alpha_per_radius         0.2585232   0.22352504   0.4474348    -0.9011120
beta_per_radius          1.2031296   0.37735863   0.7551139     0.6872933
alpha_per_angle          1.9637384   0.20160161   0.4034314     1.3729530
beta_per_angle           5.7979664   0.66569337   1.3320983     4.0125342
alpha_non_per_angle      3.5532393   1.11268094   2.2265631     1.3269409
beta_non_per_angle       2.2606035   0.21655704   0.4333837     1.6565645
lp__                 -2881.7569818 413.73155873 827.8903103 -4860.2304270
                               25%           50%           75%         97.5%
prior_pi[1]              0.2555593  3.027808e-01     0.5134481     0.7820185
prior_pi[2]              0.4865519  6.972192e-01     0.7444407     0.8623555
P[1,1,1]                 0.2477330  3.397177e-01     0.4498614     0.7631472
P[1,1,2]                 0.5501386  6.602823e-01     0.7522670     0.8601470
P[1,2,1]                 0.2850488  4.624272e-01     0.5867090     0.7938493
P[1,2,2]                 0.4132910  5.375728e-01     0.7149512     0.8184974
P[2,1,1]                 0.2991540  3.646874e-01     0.5324278     0.7705373
P[2,1,2]                 0.4675722  6.353126e-01     0.7008460     0.7413045
P[2,2,1]                 0.2143347  2.616189e-01     0.5840389     0.8701468
P[2,2,2]                 0.4159611  7.383811e-01     0.7856653     0.8734372
alpha_non_per_radius    -0.1780985 -9.262668e-02     1.0416411     3.7568933
beta_non_per_radius      0.8694648  8.998656e-01     1.3312639     2.7111629
alpha_per_radius         0.2958152  4.591725e-01     0.4705028     0.5599967
beta_per_radius          0.7221164  7.379577e-01     1.3481838     2.8047095
alpha_per_angle          1.7081805  1.866690e+00     2.1497405     2.7507069
beta_per_angle           4.6163355  5.778767e+00     6.9176130     7.8860717
alpha_non_per_angle      1.8658726  3.094800e+00     3.8579596     8.6777145
beta_non_per_angle       1.9498927  2.252867e+00     2.4857999     3.0782765
lp__                 -3051.6708411 -2.404105e+03 -2357.9987077 -2335.8918311
                        n_eff      Rhat
prior_pi[1]          4.005154  93.38408
prior_pi[2]          4.005154  93.38408
P[1,1,1]             4.004605 159.48504
P[1,1,2]             4.004605 159.48504
P[1,2,1]             4.004389 146.56405
P[1,2,2]             4.004389 146.56405
P[2,1,1]             4.004223 189.73539
P[2,1,2]             4.004223 189.73539
P[2,2,1]             4.004924 134.33409
P[2,2,2]             4.004924 134.33409
alpha_non_per_radius 4.004203 273.80790
beta_non_per_radius  4.004231 202.51932
alpha_per_radius     4.006887  57.92543
beta_per_radius      4.004205 238.94649
alpha_per_angle      4.004529 143.81990
beta_per_angle       4.004277 195.69684
alpha_non_per_angle  4.004319 153.92392
beta_non_per_angle   4.004983  88.77896
lp__                 4.004131 411.07480

Definitely, the model did not converge. I have some questions and I appreciate your time answering them.

1- How many more iterations do I need to get the model converged? Considering that the result is obtained after 3 days and half, I need to get some number of iterations which makes sense with time as well.

2- Is my new result an indicator of non-convergence for all parameters equally? In other words, based on what we see as rhat, do you expect the same converging speed for all parameters?

3- Does it make sense to you to rewrite the last part of code as follows

   for(n in 1:N){
   	
   	           	# transfer[1] = log(prior_pi_modeling[1, n]) +  beta_lpdf( (angular_change_cluster[n]/pi())| alpha_non_per_angle, beta_non_per_angle); 
   	          	 	          	
   	          	# transfer[2] = log(prior_pi_modeling[2, n]) +  beta_lpdf((angular_change_cluster[n]/pi())| alpha_per_angle, beta_per_angle); 


                if(max(prior_pi_modeling[, n])==prior_pi_modeling[1, n]){

                    target +=beta_lpdf( (angular_change_cluster[n]/pi())| alpha_non_per_angle, beta_non_per_angle);

                    target +=lognormal_lpdf(radius_cluster[n] | alpha_non_per_radius, beta_non_per_radius);


                } 

                if(max(prior_pi_modeling[, n])==prior_pi_modeling[2, n]){

                    target +=beta_lpdf((angular_change_cluster[n]/pi())| alpha_per_angle, beta_per_angle); 

                    target +=lognormal_lpdf(radius_cluster[n] | alpha_per_radius, beta_per_radius);

                } 	   

Do you think it affects the convergence speed?

Regards
Elaheh

[edit: escaped code, but it’s still unreadable with the spacing]

What specifically are you talking about? That’s a rather large file.

If you have a mixture model, your R-hats are going to be bad simply due to identifiability. Michael Betancourt wrote a case study on mixture models on our web site that may be helpful and there’s some basic info in the manual chapter on problematic posteriors.

Sorry, missed the questions for all the code the first time.

It’s not going to converge if that’s where you’re at now.

When things are this bad, all is lost.

I’m still not sure what all of this is doing, but you have a bunch of problematic constructs in your model including interval-constrained priors and conditionals that seem to be conditioned on parameters. Thse will break continuous differentiability unless you’re very careful (e.g., doing smooth interpolations).

Hi Bob,

Thanks for the time you spent on my questions. I will implement them and let you know what happens next.

Regards,
Elaheh

I have a similar problem where I’m fitting a model to multiple (24) datasets. My model converges well for most data sets but not for others, I think due to poor parameter identifiability.

Originally my 12 parameters had uniform priors. I think adding informative priors is probably the way to go to assist with identifiability.

With 24 data sets, can you use a hierarchical model?

1 Like

I don’t know what a hierarchical model is! Possibly. This would mean there are parameters that apply to several data sets?

I am fitting a simple hydrological simulation model to streamflow and chemistry data from 8 different catchments x 3 different time periods.

I compare the results from the 3 time periods to assess the identifiability/stability of the catchment parameters.

OK, so that’s time series, too.

With hierarchical models, the idea is to let each condition (catchemnt, time period) get its own parameter (as you do when you fit them all together), but then you give it a prior you also fit at the same time. For example, if alpha is a parameter in every model, you would have alpha[1], ..., alpha[24] and set up a prior like

alpha ~ normal(mu, sigma);

where mu and sigma are themselves parameters to estimate. Then mu gives you the mean among all conditions, sigma the scale of deviation across conditions.

There’s a tutorial in the Stan case studies using Radon and another using repeated binary trials that are gentle intros to hierarchical models.

1 Like

Yeah exactly. Although then you would think of it as one larger data set that can be sliced up to give you your separate datasets. For example, if you think the sites are not independent then it may make sense to model the site-specific parameters as having common dependencies on parameters that affect all possible sites. Like how in the eight schools example in the RStan vignette and the Stan manual (and BDA3) the school specific parameters \theta_1,\ldots,\theta_8 each depend on the so-called population or global parameters \mu and \tau.

1 Like