Divergent transitions in hierarchical model with random effects

kstudens · May 21, 2021, 6:35pm

Hello,

I’m currently working on a hierarchical model for spruce budworm development times. I’ve constructed my model in Template Model Builder (which by default does maximum likelihood using Laplace approximation for Gaussian random effects), but am using Stan to do Bayesian/HMC analysis via the ‘tmbstan’ package (https://CRAN.R-project.org/package=tmbstan), which uses Stan/HMC/NUTS as a back-end sampler for models originally constructed in TMB. The sampler is indicating that there are 6 divergent transitions out of 18,000 samples (iter = 5000, warmup = 500, chains = 4, adapt_delta = 0.95). Increasing adapt_delta to 0.99 eliminates some but not all of the divergences. I have decentralized all of my latent variables to avoid funnelling, and I’ve read the Gelman et al. “Bayesian workflow” paper and all of the publicly posted advice I can find about resolving divergent transitions.

The data are interval-censored observations of times to development events. My model consists of a temperature-dependent development rate curve representing the median development rate of the population at a specified temperature (rho25, HA, TL, HL, TH, HH); a lognormal distribution of individual variation around the median (the variance parameter of this distribution, s_eps, is estimated/sampled) and a Cauchy iid random effect/latent variable that allows for lack of fit/variation around the parametric development rate curve for each treatment temperature (the scale parameter, s_upsilon, is also being sampled/estimated). The likelihood structure of the model follows Régnière et al. 2012 (https://doi.org/10.1016/j.jinsphys.2012.01.010), with a development rate curve from Schoolfield et al. 1981 (https://doi.org/10.1016/0022-5193(81)90246-0). The model code is as follows (written in C++, but analogous to Stan code):

#include <TMB.hpp> 
#include <iostream>
#include <string>

template<class Type>
Type calc_TA(Type HL, Type HH, Type TL, Type TH) {
  Type c3 = Type(0.0001987);
  Type den, tphi;
  den = c3*log(-HL/HH) + (HL/TL) - (HH/TH);
  tphi = (HL - HH)/den;
  return tphi;
}

template<class Type>
Type calc_pred(Type temp, Type rho25, Type HA, Type TL, Type HL, Type TH, Type HH, Type TA) {
	Type c1 = Type(273);
        Type c2 = TA;
	Type c3 = Type(0.0001987);
	Type tK, num, den1, den2, tau;
	tK = temp + c1;
	num = rho25 * tK/c2 * exp(HA/c3* (1/c2-1/tK));
	den1 = exp(HL/c3 *(1/TL-1/tK));
	den2 = exp(HH/c3 *(1/TH-1/tK));
	tau = 1/(num/(1 + den1 + den2));
	return tau;
}

template<class Type>
Type dnorm1(Type x){
  return Type(1.0/sqrt(2.0*M_PI)) * exp(-Type(.5)*x*x);
}

TMB_ATOMIC_VECTOR_FUNCTION(
  // ATOMIC_NAME
  pnorm_log
  ,
  // OUTPUT_DIM
  1
,
// ATOMIC_DOUBLE
ty[0] = atomic::Rmath::Rf_pnorm5(tx[0],0,1,1,1);
,
// ATOMIC_REVERSE
Type W  = ty[0];                    // Function value from forward pass
Type DW = dnorm1(tx[0])/exp(W);
px[0] = DW * py[0];
)

template<class Type> 
  Type pnorm_log(Type x){
    CppAD::vector<Type> tx(1);
    tx[0] = x;
    return pnorm_log(tx)[0];
}  

template<class Type>
Type qcauchy(Type y, Type location, Type scale){
  Type q = location + scale*tan(M_PI*(y - Type(0.5)));
  return q;
}

template<class Type>
Type objective_function<Type>::operator() ()
{
	DATA_IVECTOR(nobs);
	DATA_IVECTOR(block);
	DATA_VECTOR(time1);
	DATA_VECTOR(time2);
	DATA_VECTOR(temp1);
	DATA_VECTOR(temp2);
	DATA_VECTOR(time2d);
	DATA_INTEGER(use_prior);

	PARAMETER(rho25);
	PARAMETER(HA);
	PARAMETER(TL);
	PARAMETER(HL);
	PARAMETER(TH);
	PARAMETER(HH);
	PARAMETER(s_eps);
	PARAMETER(s_upsilon);
	PARAMETER_VECTOR(upsilon);

	
	Type jnll = 0;
	Type tpred1;
	Type tpred2;
	Type epsm1;
	Type epsij;
	Type pnormdiff;
	Type nlogp;
	Type epsm1_std;
	Type epsij_std;
	Type TA;
	Type u_upsilon;
	Type c_upsilon;
	
	Type l_s_eps = log(s_eps);
	Type l_s_upsilon = log(s_upsilon);

	for (int i=0; i<time1.size(); i++) {
		
		TA = calc_TA(HL, HH, TL, TH);

		tpred1 = calc_pred(temp1(i), rho25, HA, TL, HL, TH, HH, TA);
		tpred2 = calc_pred(temp2(i), rho25, HA, TL, HL, TH, HH, TA);
		
		u_upsilon = pnorm(upsilon(block(i)), Type(0), Type(1));
		c_upsilon = qcauchy(u_upsilon, Type(1), s_upsilon);
		
		tpred1 *= c_upsilon;
		tpred2 *= c_upsilon;

		epsm1 = log(time1(i)/tpred1 + time2d(i)/tpred2);
		epsij = log(time1(i)/tpred1 + time2(i)/tpred2);
		
		epsm1_std = epsm1/s_eps;
		epsij_std = epsij/s_eps;

		Type pnorm_ij = pnorm_log(epsij_std);
		Type pnorm_m1 = pnorm_log(epsm1_std);
		
		pnormdiff =  logspace_sub(pnorm_ij, pnorm_m1);

		nlogp = nobs(i)*pnormdiff;
		
		jnll -= nlogp;

	}
	
	jnll -= sum(dnorm(upsilon, Type(0), Type(1), 1));
	
	if (use_prior == 1) {
	  jnll -= dgamma(rho25, Type(2), Type(0.25), 1); 
	  
	  jnll -= dgamma(-HL, Type(6), Type(2), 1);
	  jnll -= dgamma(HA, Type(5), Type(0.2), 1);
	  jnll -= dgamma(HH, Type(10), Type(3), 1);
	  
	  jnll -= dnorm(TL, Type(284), Type(2), 1);
	  jnll -= dnorm(TH, Type(304), Type(2), 1);
	  
	  jnll -= dnorm(l_s_eps, Type(-1.5), Type(0.1), 1);
	  jnll -= dnorm(l_s_upsilon, Type(-2.5), Type(0.05), 1);
	}
	
	return jnll;

}

There are no model diagnostic warnings other than the divergent transitions. Here is the pairs plot for the initial run (adapt_delta = 0.95):

pairs_divergent

Judging by the pairs plot, the upsilon random effect terms are likely the cause of the divergences; indeed, removing the lack of fit portion of the model eliminates them.

Any advice on how to proceed with these divergent transitions would be greatly appreciated.

Ara_Winter · May 21, 2021, 7:06pm

Two things off hand:

Do you have a simulated data set with known parameters that you can use? This can help with tracking errors.
Taming Divergences in Stan Models I can see the funnel-y and u-shapey stuff going on in the lower right hand corner.

bbolker · May 21, 2021, 7:13pm

@Ara_Winter : re #2: yes, we had noticed that (I’m helping with this project). Not immediately obvious to me/us how to reparameterize in order to avoid this funnel (which occurs among different elements of the random effects vector, rather than between the scale/SD parameter and the random effects as in the classic funnel examples I’ve read about … that is, upsilon[6] and upsilon[7] will represent the deviation from the deterministic development curve at two different temperatures …

I guess we could use something other than an iid structure on the random effects (random-walk prior? autocorrelated?) but I don’t know if that would help …

jsocolar · May 21, 2021, 7:32pm

Is it possible for you to post the Stan code produced by tmbstan (is it human-readable?). Might get more eyes on the problem.

FWIW, the warmup scheme looks a bit odd to me. You probably don’t need 4500 sampling iterations in Stan, but 500 warmup iterations is pretty short. You might try something more along the lines of warmup = 1000 or even warmup = 1500, with 1000 or 1500 sampling iterations. I’m not saying better warmup will necessarily solve your problems (it doesn’t generally solve funnels), but it might help.

jsocolar · May 21, 2021, 7:37pm

I would also strongly suggest switching the Cauchy to a student t with moderate degrees of freedom, just to see if that squashes your issues. If it does, then you can have a careful think about whether that switch is justified.

bbolker · May 21, 2021, 7:37pm

Unfortunately tmbstan doesn’t produce Stan code. The way it works is that TMB produces a DLL/.so file that takes a parameter vector and returns a negative log-posterior and its autodiff-generated gradient vector. It then uses Stan’s NUTS machinery to generate the posterior samples.

bbolker · May 21, 2021, 7:38pm

Maybe, but we had started with a Gaussian distribution for that latent variable. Switching to a Cauchy made things better, not worse. (And, if we want to switch to an intermediate t-distribution, we will have to do a bit more work because of the way that TMB is structured …)

Ara_Winter · May 21, 2021, 7:42pm

I was going to say the cauchy on the random effect is kinda wide. I would usually go for normal or better maybe the student_t unless that is a pain.

jsocolar · May 21, 2021, 7:43pm

Hrm, I’m probably at the limits of my expertise here, but FWIW I find it quite strange that a switch from Gaussian to Cauchy improved divergence issues. In general Gaussians are very easy to sample (constant curvature and all that) and Cauchy’s are really hard because the tail is so long. I almost wonder whether switching to the Cauchy removed a problem with the Gaussian model but replaced it with an entirely different problem.

Some other things to check: noncentering is useful when the prior dominates but can lead to funnels when the likelihood dominates. If your upsilons are strongly informed, you could consider trying a centered parameterization. If some are strongly informed and others are weakly informed, you could also consider a non-centered parameterization for the weakly informed ones and a centered parameterization for the strongly informed ones.

Ara_Winter · May 21, 2021, 8:00pm

Like @jsocolar said (figured it was worth repeating) the high iters is a bit suspect too. At work I have a tangled multi-level structure equation model that fits fine with 3000 iters, 4 chains, 6 cores, and a warm-up of 500.

jerlich · May 26, 2021, 12:44pm

I would 2nd the synthetic data approach. I have had divergences in some hierarchical models because some of the data were clearly not well fit by the model. If you can generate synthetic data that you know comes from this model and recover the generative parameters without issues, the problem may lie in your data. If your synthetic data still produces divergences then you need to work on the model specification.

kstudens · June 7, 2021, 2:32pm

Thank you all for the suggestions. The sampler still produces divergent transitions when running it on a simulated dataset, so it seems as though the issue may lie in the model specification. However, it turned out that since the number of divergent transitions was so low, increasing adapt_delta to 0.999 got rid of them even for a large number of sampling iterations.

Ara_Winter · June 7, 2021, 3:57pm

Another thing to try (it’s time consuming) that might yield less divergent transitions and some gains in performance is to build the simplest model you can first and then add complexity to it in iterations. This way you can see which part of the model is causing trouble.

As an example our messy SEM model started life out as a linear model with a small number of variables. Over a few iterations added in the multi-level components, then time handled in three different ways, and lastly the correlation between the different linear models.

betanalpha · June 14, 2021, 6:18pm

As others have said it’s going to be hard for most in the community to offer precise feedback without seeing a Stan program. Working through the TMB C++ isn’t impossible but it’s asking a lot from people who don’t use TMB, especially with the non-informative variable names.

Divergences often indicate some degeneracy in the posterior density function, but they do not indicate the nature of that geometry. In particular degeneracies in a hierarchical model may be due to the hierarchal interactions but not always, and in this case that hierarchical structure is likely a red herring as the most serious interactions seem to be within the upsilon variables.

I have a detailed discussion about how to investigate pathological behavior hinted at by divergences at Identity Crisis, in particular by following those strategies you should be able to confirm whether or not the upsilon interactions are indeed causing the problem. If they are then you can work through your measurement to determine why the upsilon are degenerate for the observed data (for example perhaps they are observed only in certain combinations) which will then inform what resolutions there might be.

cavieresJJ · June 14, 2021, 7:23pm

Hi!

I think that it is a good idea to start with a Gaussian latent variable in \texttt{TMB} and prove it with a simulated Gaussian response variable and only one covariate. That way you can compare the model with a Cauchy latent variable with a Gaussian latent variable using the AIC criterion directly for your frequentist model.
I don’t know how is the behavior of the Laplace approximation for a Latent Cauchy variable in \texttt{TMB} so this could be a good start point.

Another thing, have you tried putting \texttt{laplace=TRUE} in \texttt{tmbstan}?

Best wishes

Topic		Replies	Views
Divergent transitions with hierarchical model Modeling	5	733	July 15, 2019
Hierarchical Model in Stan Taking to Long Modeling specification , performance	12	1230	June 19, 2019
Sampling from the prior - why am I seeing divergent transitions? Modeling ecology	6	2772	December 10, 2021
Hierarchical MPT, divergent transitions Modeling	2	306	November 18, 2023
Divergent Transitions when Scaling Hierarchical Model Modeling	3	398	October 18, 2022

Divergent transitions in hierarchical model with random effects

Related topics