Divergent transitions in Beta model

Hi everyone,

I am fitting a reasonably standard Fay-Herriot style Beta regression model with rstan in R. As some of you may know the mean-precision parameterization of the Beta distribution gives

\theta_i \sim Beta(\alpha_i = \mu_i \phi_i, \kappa = \phi_i - \alpha_i)

where \theta_i and \psi_i are known direct proportion estimators (bounded between 0 and 1) and their respective sampling variances (bounded between 0 and 0.25). Also note the following relationship

\phi_i = \frac{\mu_i (1-\mu_i)}{\psi_i} - 1.

The model fits well in Stan, but unfortunately, I cannot find a way to reduce the number of divergent iterations. About 25% of my iterations diverge even though the Rhat’s are very good (i.e. < 1.001). Upon inspection of the pairs() plots, we see that the parameters specified in the transformed parameters block are (by design) exact transformations of other parameters. I am worried that these transformations are causing Stan to diverge, because of the perfect correlation between parameters.

A second problem is the initial values. Unless specifying init = 0, the Stan model cannot find a suitable starting value and does not even start sampling. The error it gives it related to unallowed values of \alpha_i.

I have pasted the Stan model below and here (forStanForum.zip - Google Drive) you can find a .zip with the stan code and two .rds files with the data and R stan object for those wishing to explore the divergent iterations in more detail.

functions {
	// function to transform psi to unconstrained space
	vector L(vector psi, int n){
		vector[n] temp = psi ./ (0.25 - psi);
		return log(temp);
	}
	// inverse of L()
	vector invL(vector eta, int n){
		vector[n] temp1 = 0.25 * exp(eta);
		vector[n] temp2 = 1 + exp(eta);
		return temp1 ./ temp2;
	}
}
data {
	int<lower=0> m; 									// number of sampled areas
	int<lower=0> M;										// total number of areas
	int<lower=0> nu_stable_psi;							// number of sampled areas with stable psi
	vector[M] logNi; 									// log population size in each area

	int<lower=0> q_c;           						// number of parameters in BETA MODEL
	matrix[M, q_c] x_c; 								// covariates for BETA MODEL
	
	vector<lower=0,upper=1>[m] theta_o; 				// observed theta
	vector<lower=0, upper=0.25>[nu_stable_psi] psi_o; 	// observed psi
}
transformed data{
	vector[nu_stable_psi] Lpsi_o = L(psi_o, nu_stable_psi); // unconstrained observed psi's
}
parameters {
	vector[q_c] B_c;						// fixed parameters
	vector[M] v; 							// random effect
	real<lower=0> sigma_v;
	real z_0;								// fixed parameters for PSI MODEL
	real<upper=0> z_1;						// fixed parameters for PSI MODEL
	real<lower=0> sigma_psi;				// error for PSI MODEL	
	vector[M-nu_stable_psi] Lpsi_obar;		// unconstrained variances for the areas with unstable variance estimates
}
transformed parameters{
	// DECLARACTIONS
		// length M
		vector[M] eta_c; 										// linear pred for BETA MODEL
		vector<lower=0, upper=1>[M] mu;							// mean for BETA MODEL
		vector[M] mu_psi;										// mean of PSI MODEL
		vector<lower=0,upper=1>[M] theta;						// full vector of theta
		vector<lower=0, upper=0.25>[M] psi;						// full vector of psi
		vector<lower=0>[M] phi;									// full vector of phi
		vector<lower=0>[M] alpha; 								// Beta shape parameter 1	
		// length: various
		vector<lower=0,upper=1>[M-m] theta_obar;				// missing theta's
		
	// DEFINITIONS
		// length M
			// linear predictor for BETA MODEL
			eta_c = x_c*B_c + v*sigma_v; 
			 mu = inv_logit(eta_c);
		// length: various 
			// impute missing theta's			
			for(j in 1:M-m){
				theta_obar[j] = mu[j+m];
			}
			// PSI MODEL
			mu_psi = z_0 + z_1 * logNi;
	
		// create FULL VECTORS
			theta = append_row(theta_o, theta_obar);
			psi = invL(append_row(Lpsi_o, Lpsi_obar), M);
			phi = ((mu .* (1 - mu)) ./ (psi)) - 1;
			alpha = mu .* phi; 
}
model{
	// PRIORS
		// for BETA MODEL
			for(j in 1:q_c){
				B_c[j] ~ normal(0,3);
			}
			v ~ std_normal();
			sigma_v ~ cauchy(0,2);
		// for PHI MODEL
			z_0 ~ normal(0,3);
			z_1 ~ normal(0,3);
			sigma_psi ~ cauchy(0,2);
			
	// PHI MODEL
		for(k in 1:nu_stable_psi){
			target += normal_lpdf(Lpsi_o[k] | mu_psi[k], sigma_psi);
		}

	// MODEL
		// draw from prior for non-sampled psi's
		for(k in nu_stable_psi+1:M){
			target += normal_lpdf(Lpsi_obar[k-nu_stable_psi] | mu_psi[k], sigma_psi);
		}
		// Likelihood for complete theta's and phi's
		target += beta_lpdf(theta | alpha, phi - alpha);
}
generated quantities{
	real log_lik_beta[nu_stable_psi];
	vector[M] theta_rep;
	
	// log likelihood values for BETA MODEL
	for(f in 1:nu_stable_psi){
		log_lik_beta[f] = beta_lpdf(theta[f] | alpha[f], phi[f] - alpha[f]);
	}

	// draw from predictive distributions
	for(j in 1:M){
		theta_rep[j] = beta_rng(alpha[j], phi[j] - alpha[j]);
	}
}

I have tried declaring the vectors \alpha and \phi in the model block, but this didn’t help. Initially I was using the beta_proportion_lpdf function, but opted for beta_lpdf as I believe this setup makes it clearer what is really happening for the Beta model.

I have a sneeky suspicion that inference from this model should be valid and that Stan is only complaining because of the perfectly correlated parameters that are calculated in the transformed parameters block. I would be interested in others thoughts and particular how I might reparametrize the model to reduce the divergence.
BETA2_stan.stan (3.2 KB)

Thank you in advance for any assistance.

EDIT: Here is an example of the correlation between mu and phi. This is to be expected (seeing as we told Stan to do this).

Divergencies and initialization issues most likely have a common source. Initialization fails because the model is not defined over the entire (declared) parameter space, and every time the sampler tries to explore outside the well-defined region you get a divergence.

For the model to be well-defined you need \phi_i > 0 and the equation

\phi_i = \frac{\mu_i\left(1-\mu\right)}{\psi_i} - 1

translates this into \left|\mu_i-\frac{1}{2}\right| < \sqrt{1-4\psi_i} but your parametrization does not ensure this restriction. Hence the sampler will sometimes try “impossible” parameter values with \phi_i<0.
One potential solution is to marginalize out the random effect v and set the required bounds on eta_c.

parameters {
  ...
  vector<lower=...,upper=...>[M] eta_c;
}
model {
  eta_c ~ normal(x_c*B_c, sigma_v);
  ...
}

You need to define a new function to compute the bounds, it’s a bit annoying.

I think Lpsi_obar should also have a lower bound.

Thank you for your response, nhuurre.

This makes sense. I’ll give it a try.

Out of interest, do you have any experience with defining bounds that change at each iteration? Is this even possible in Stan?

Parameter bounds may depends on earlier parameters. This is a valid program

parameters {
  vector<lower=0,upper=0.25>[5] psi;
  vector<lower=0.5-sqrt(1-4*psi),upper=0.5+sqrt(1-4*psi)>[5] mu;
}

(Now that I look at this again, psi=0 should result in 0 < mu < 1 and that means the bounds I calculated are a bit off…I guess forgot to divide the square root by two.)

What a very helpful feature!

Yes, I believe the correct bounds on \mu are

\left( \frac{1-\sqrt{1-4 \psi_i}}{2}, \frac{1+\sqrt{1-4 \psi_i}}{2} \right)

I have tried this, but still getting divergence. The new Stan model is below.

functions {
	// function to transform psi to unconstrained space
	vector L(vector psi){
		return log(psi ./ (0.25 - psi));
	}
	// inverse of L()
	vector invL(vector eta){
		return (0.25 * exp(eta)) ./ (1 + exp(eta));
	}
	// function to transform eta_c to conditional constrained space
	//vector Ltwo(vector eta, vector psi, int M){
	//	vector[M] lgteta = inv_logit(eta);
	//	vector[M] l = 0.5 * (1 - sqrt(1 - 4 * psi));
	//	vector[M] u = 0.5 * (1 + sqrt(1 - 4 * psi));
	//	return l + (u + l) .* lgteta;
	//}
}
data {
	int<lower=0> m; 									// number of sampled areas
	int<lower=0> M;										// total number of areas
	int<lower=0> nu_stable_psi;							// number of sampled areas with stable psi
	vector[M] logNi; 									// log population size in each area

	int<lower=0> q_c;           						// number of parameters in BETA MODEL
	matrix[M, q_c] x_c; 								// covariates for BETA MODEL
	
	vector<lower=0,upper=1>[m] theta_o; 				// observed theta
	vector<lower=0, upper=0.25>[nu_stable_psi] psi_o; 	// observed psi
}
transformed data{
	vector[nu_stable_psi] Lpsi_o = L(psi_o); 			// unconstrained observed psi's
}
parameters {
	vector[q_c] B_c;						// fixed parameters
	vector[M] v; 							// random effect
	real<lower=0> sigma_v;
	real z_0;								// fixed parameters for PSI MODEL
	real<upper=0> z_1;						// fixed parameters for PSI MODEL
	real<lower=0> sigma_psi;				// error for PSI MODEL	
	vector[M-nu_stable_psi] Lpsi_obar;		// unconstrained variances for the areas with unstable variance estimates
}
transformed parameters{
	// DECLARACTIONS
		// length M
		vector<lower=0,upper=1>[M] theta;						// full vector of theta
		vector<lower=0, upper=0.25>[M] psi;						// full vector of psi
		vector<lower=0>[M] phi;									// full vector of phi
		vector<lower=0>[M] alpha; 								// Beta shape parameter 1
		vector[M] eta_c; 										// linear pred for BETA MODEL
		vector[M] lowerMU;
		vector[M] upperMU;
		vector<lower=lowerMU,upper=upperMU>[M] mu;				// mean for BETA MODEL
		vector[M] mu_psi;										// mean of PSI MODEL
		// length: various
		vector<lower=0,upper=1>[M-m] theta_obar;				// missing theta's
		
	// DEFINITIONS
		psi = invL(append_row(Lpsi_o, Lpsi_obar));
		lowerMU = 0.5 * (1 - sqrt(1 - 4 * psi));
		upperMU = 0.5 * (1 + sqrt(1 - 4 * psi));
		// length M
			// linear predictor for BETA MODEL
			eta_c = x_c*B_c + v*sigma_v; 
			mu = inv_logit(eta_c);
		// length: various 
			// impute missing theta's			
			for(j in 1:M-m){
				theta_obar[j] = mu[j+m];
			}
			// PSI MODEL
			mu_psi = z_0 + z_1 * logNi;
	
		// create FULL VECTORS
			theta = append_row(theta_o, theta_obar);
			phi = ((mu .* (1 - mu)) ./ (psi)) - 1;
			alpha = mu .* phi; 
}
model{
	// PRIORS
		// for BETA MODEL
			for(j in 1:q_c){
				B_c[j] ~ normal(0,3);
			}
			v ~ std_normal();
			sigma_v ~ cauchy(0,2);
		// for PHI MODEL
			z_0 ~ normal(0,3);
			z_1 ~ normal(0,3);
			sigma_psi ~ cauchy(0,2);
			
	// PHI MODEL
		for(k in 1:nu_stable_psi){
			target += normal_lpdf(Lpsi_o[k] | mu_psi[k], sigma_psi);
		}

	// MODEL
		// draw from prior for non-sampled psi's
		for(k in nu_stable_psi+1:M){
			target += normal_lpdf(Lpsi_obar[k-nu_stable_psi] | mu_psi[k], sigma_psi);
		}
		// Likelihood for complete theta's and phi's
		target += beta_lpdf(theta | alpha, phi - alpha);
}
generated quantities{
	real log_lik_beta[nu_stable_psi];
	vector[M] theta_rep;
	
	// log likelihood values for BETA MODEL
	for(f in 1:nu_stable_psi){
		log_lik_beta[f] = beta_lpdf(theta[f] | alpha[f], phi[f] - alpha[f]);
	}

	// draw from predictive distributions
	for(j in 1:M){
		theta_rep[j] = beta_rng(alpha[j], phi[j] - alpha[j]);
	}
}

You’ll see that I have also tried (inspired by 21.4 Vectors with Varying Bounds | Stan User’s Guide) the slashed out function labelled “function to transform eta_c to conditional constrained space”, but this did not help. I suppose I am getting confused about when I should manually transform a parameter into the correct bounds or let Stan do this. For example, above we take eta_c (unconstrained) and then transform it into the bounds (0,1) using inv_logit(), but then stan will also do change of variables to transform the (0, 1) bounds of mu to (lowerMU, upperMU) during sampling. Any ideas what I am missing here?

The constraints must be in parameters block for them to help the sampler to choose good proposals. It’s quite confusing but constraints in the transformed parameters block are only useful for debugging. By the time transformed parameters executes it’s already too late to prevent a divergence, you can only report what went wrong.
That’s why I suggested removing v and moving eta_c to parameters where its constraints are most useful.

It’s better to let Stan handle the transform if possible because then you don’t need to worry about Jacobians.
I think that manual page dates back to when the bounds had to be scalars. Nowadays you can declare a vectorized bounds for vectors.

Hi Nhuure,

This is super helpful advice! I have given it a go! The divergences are definitely gone, but now I get some super strange trace plots. Below is the new stan code.

functions {
	// function to transform psi to unconstrained space
	real L(real psi){
		return log(psi / (0.25 - psi));
	}
}
data {
	int<lower=0> m; 									// number of sampled areas
	int<lower=0> M;										// total number of areas
	int<lower=0> nu_stable_psi;							// number of sampled areas with stable psi
	vector[M] logNi; 									// log population size in each area

	int<lower=0> q_c;           						// number of parameters in BETA MODEL
	matrix[M, q_c] x_c; 								// covariates for BETA MODEL
	
	vector<lower=0,upper=1>[m] theta_o; 				// observed theta
	vector<lower=0, upper=0.25>[nu_stable_psi] psi_o; 	// observed psi
}
parameters {
	vector[q_c] B_c;																// fixed parameters
	real<lower=0> sigma_v;															// variance of random effect
	real z_0;																		// fixed parameters for PSI MODEL
	real<upper=0> z_1;																// fixed parameters for PSI MODEL
	real<lower=0> sigma_psi;														// error for PSI MODEL
	vector<lower=0,upper=0.25>[M - nu_stable_psi] psi_obar;							// missing psi's
	vector<lower=logit(0.5 * (1 - sqrt(1 - 4 * psi_o))),
	upper=logit(0.5 * (1 + sqrt(1 - 4 * psi_o)))>[nu_stable_psi] eta_c_o; 			// BETA MODEL: linear pred for stable psi
	vector<lower=logit(0.5 * (1 - sqrt(1 - 4 * psi_obar))),
	upper=logit(0.5 * (1 + sqrt(1 - 4 * psi_obar)))>[M - nu_stable_psi] eta_c_obar; // BETA MODEL: linear pred for unstable psi
}
transformed parameters{
// create FULL VECTORS
	vector[M] psi = append_row(psi_o, psi_obar);
	vector[M] eta_c = append_row(eta_c_o, eta_c_obar);
	vector[M] phi = ((inv_logit(eta_c) .* (1 - inv_logit(eta_c))) ./ (psi)) - 1;
	vector[M] alpha = inv_logit(eta_c) .* phi;
}
model{	
	// PRIORS
		// for BETA MODEL
			for(j in 1:q_c){
				B_c[j] ~ normal(0,3);
			}
			eta_c ~ normal(x_c*B_c, sigma_v);
			sigma_v ~ cauchy(0,2);
		// for PHI MODEL
			z_0 ~ normal(0,3);
			z_1 ~ normal(0,3);
			sigma_psi ~ cauchy(0,2);
			
	// PHI MODEL
		for(k in 1:nu_stable_psi){
			target += normal_lpdf(L(psi_o[k]) | z_0 + z_1 * logNi[k], sigma_psi);
		}

	// MODEL
		// draw from prior for non-sampled psi's
		for(k in nu_stable_psi+1:M){
			target += normal_lpdf(L(psi_obar[k-nu_stable_psi]) | z_0 + z_1 * logNi[k], sigma_psi);
		}
		// Likelihood for stable theta's and phi's
		for(i in 1:m){
			target += beta_lpdf(theta_o[i] | alpha[i], phi[i] - alpha[i]);
		}
}
generated quantities{
	real log_lik_beta[m];
	vector[M] theta_rep;
	
	// Parameters of interest
	vector[M] mu = inv_logit(append_row(eta_c_o, eta_c_obar)); 
	
	// log likelihood values for BETA MODEL
	for(f in 1:m){
		log_lik_beta[f] = beta_lpdf(theta_o[f] | alpha[f], phi[f] - alpha[f]);
	}

	// draw from predictive distributions
	for(j in 1:M){
		theta_rep[j] = beta_rng(alpha[j], phi[j] - alpha[j]);
	}
}


And here is what I mean by no convergence for the 4 parrallel chains.

Observation 1 has a very small \psi, and a very small direct estimate \theta. It’s clear that the resulting \text{Beta} distribution is bimodal.

Any idea how I should reparamertize to stop this? It seems that a single observation like obs 1 is really having an effect on the convergence of the regression parameters.

Also, the marginalize trick for eta_c works very well. Thank you for the suggestion. However, down the line I will be adding spatial random effects, so it would be helpful to define the linear predictor explicitly. Would the setup be eta_c ~ normal(x_c*B_c + s * sigma_s, sigma_v), or should I look for another solution?

Thank you so much for all your assistance so far. You’re been immeasurable helpful.

Bimodality can be tricky to remove. You could try and set inits eta_c_o=logit(theta_o) as that’s probably near the true value.

Yes, I think that looks ok.