Reducing divergences for measurement error hierarchical model

Hi all,

I am trying to run a kind of measurement error model below using rstan, but am still getting around 5-10% divergences. I have read elsewhere on this site that any divergences should trigger alarm bells.

The model setup is below.

\theta_i \sim N( \bar{\theta}_i, s(\theta_i) ) \\ \psi_i \sim N( \bar{\psi}_i, s(\psi_i) ) \\ \theta_i \sim N( \nu_i, \sqrt{\psi_i} ) \\ \nu_i \sim N( X \beta, \sigma_v) \\ X \in \mathbb{R}^{M \times p}, \beta \in \mathbb{R}^{p \times 1}

Where \bar{\theta}_i, s(\theta_i), \bar{\psi}_i, s(\psi_i) are known quantities. The distributions for \theta_i and \psi_i are approximated by normal distributions from a previous model run. Unfortunately, I cannot get the full model to run, but this setup does work. In particular, convergence based on trace plots looks good and I get all Rhat < 1.02. However, as mentioned above I am still getting some divergent transitions.

data {
	int<lower=0> m; 						// number of sampled areas
	int<lower=0> M; 						// total number of areas
	
	int<lower=0> q_a; 						// number of parameters in NORMAL MODEL
	matrix[M, q_a] x_a; 					// covariates for NORMAL MODEL
	
	vector[m] direct_mean;			
	vector<lower=0>[m] direct_sd;
	vector<lower=0>[m] psi_mean;
	vector<lower=0>[m] psi_sd;
}
transformed data{
	int q_a_C = q_a - 1;						// number of parameters minus 1
	matrix[M, q_a_C] x_a_C;						// centered version of X without an intercept
	vector[q_a_C] bar_x_a;						// columns means of x before centering
	for(i in 2:q_a){
		bar_x_a[i - 1] = mean(x_a[,i]);
		x_a_C[, i - 1] = x_a[,i] - bar_x_a[i - 1];
	}
}
parameters {
	// NORMAL MODEL
	vector[q_a_C] beta_a;												// fixed parameters
	real intercept_a;													// temporary intercept
	vector[M] Z_v;														// standard normal for error
	real<lower=0> sigma_v;												// standard deviation of error
	
	// FROM LOGISTIC
	vector[m] theta_o;
	vector<lower=0>[m] psi_o;
}
transformed parameters{
		vector[M] nu = intercept_a + x_a_C * beta_a + Z_v * sigma_v;
}
model{
	// PRIORS
		target += normal_lpdf(beta_a | 0, 2);
		target += student_t_lpdf(intercept_a | 3, 0, 0.5);
		target += std_normal_lpdf(Z_v);
		target += std_normal_lpdf(sigma_v);
		// FROM LOGSITIC
		target += normal_lpdf(theta_o | direct_mean, direct_sd);
		target += normal_lpdf(psi_o | psi_mean, psi_sd);
			
	// MODEL
	for(i in 1:m){
		target += normal_lpdf( theta_o[i] | nu[i], sqrt(psi_o[i]) );
	}
		
}
generated quantities{
	real r_intercept_a = intercept_a - dot_product(bar_x_a, beta_a);
	vector[m] comp_est_weight = sigma_v^2 ./ ( psi_o + sigma_v^2 ); // composite estimator weight
	vector[M] mu = inv_logit(nu);
	vector[m] il_theta_o = inv_logit(theta_o);
}

Here are the trace plots for the fixed parameters.

And here is a pairs plot showing the divergence transitions. The parameter posteriors for obs 47 gives Rhat \approx 1.02 and insufficient bulk ESS.

I cannot see any discernable patterns with the area of the parameter space where the divergences are appearing, except of course the high correlation between nu and Z_v.

In this example I only have about 100 observations so I am sure that the problems have something to do with this. You’ll see that I have used the non-mean centred parameterization where I can.

I would be confident that the problem with divergences could be solved if I could reparameterize the model somehow. I wonder if anyone would have any further tricks or ideas. Thank you in advance!

Jamie

This question has been unanswered for 4 days. Based on @martinmodrak post here, I thought I might tag some experts. @bbbales2 @syclik @jonah

I do apologize if this practice has ceased since 2020.

It looks like you have two priors for theta_o, is that intentional?

  target += normal_lpdf(theta_o | direct_mean, direct_sd);
  target += normal_lpdf( theta_o[i] | nu[i], sqrt(psi_o[i]) );
1 Like

Hi @andrjohns, thank you for your question.

Yes, that is intentional. There is uncertainty in theta_o that is captured through a previous deaggregated model run. The idea is to approximately traverse this uncertainty to this model.

I don’t think that’s a consistent model. Take a look at the section on measurement error models in the Stan user’s guide, in particular the simple linear regression example in section 6.1.

In that example there’s a model for how the unobserved “real” variables are generated, and a model relating the observed values to the latent ones. In your model formulation flipping the positions of the observed and latent variables in the top line would (if I’m reading it right) produce a consistent model.

1 Like

Hi @Michael_Peck, thank you for your answer. I have read the section on measurement error models several times. I agree that this is probably where the solution is.

Are you suggesting replacing

\theta_i \sim N(\bar{\theta}_i, \text{s}(\theta_i))

with

\bar{\theta}_i \sim N(\theta_i, \text{s}(\theta_i))

? I can see how this would help, however, I wasn’t aware that this was a valid operation. Would you mind elaborating on your thought process?

In terms of reparameterisation, the main opportunity I see here would be a non-centered parameterisation for psi_o. Given that you have a lower bound, you’ll need to construct the parameterisation manually. I’ve previously put together an example over in this post: How to constraint multilevel parameters when using a non-centered parametrisation? - #4 by andrjohns

However, I would also recommend doing some sanity checking of the consistency between the location and scale parameters of the two sets of priors for theta_o. If they’re markedly different, or if they’re strongly informative in different places/directions, then this could also result in some difficult posterior geometry

Yes, something like that with the caveat that I didn’t spend much time trying to understand your model. My reasoning is just the same as in that simple regression model with measurement error in the user’s guide. The author there has a model of sorts for how the latent variable x is generated, that is

x ~ normal(mu_x, sigma_x);

and then a model for how the observed data x_meas is generated from the latent x:

x_meas ~ normal(x, tau);

and finally a model relating the outcomes y with the latent predictors x. My understanding from the original post was that previous model runs give you information about the latent variables that can be expressed in the priors.

Thank you both @andrjohns and @Michael_Peck. I agree that there may be some inconsistency in the model specification. Unfortunately using the non-centred parameterization for psi_o or theta_o does not provide any respite.

You make a great point @Michael_Peck in terms of the latent x. I’ll have a good think about this.

I have discovered that the priors on \theta_i and \psi_i are incorrectly independent. I am currently trying to incorporate the correlation structure using a multivariate normal for both. To aid others who may view this post in the future, I’ll try to post any progress I make.