Few-but-persistent divergences in hierarchical models with partially-pooled variances

Curious if anyone has noticed this before and if there’s a known solution of some sort. I’ve been playing with adding partial-pooling of the variances in hierarchical models (typical implementations like SUG1.13 leave these variances unpooled) and I consistently encounter a few post-warmup divergences even with very long warmup.

For context here’s a simple hierarchical model with the pooled-variances part highlighted through comments:


data{

	// nI: number of individuals
	int<lower=2> nI ;

	// nXc: number of condition-level predictors
	int<lower=2> nXc ;

	// rXc: number of rows in the condition-level predictor matrix
	int<lower=nXc> rXc ;

	// Xc: condition-level predictor matrix
	matrix[rXc,nXc] Xc ;

	// iXc: which individual is associated with each row in Xc
	array[rXc] int<lower=1,upper=nI> iXc ;

	// nY: num entries in the observation vector
	int<lower=nI> nY ;

	// Y: observations
	vector[nY] Y ;

	// yXc: which row in Xc is associated with each observation in Y
	array[nY] int<lower=1,upper=rXc> yXc ;

}

transformed data{

	// tXc: transposed copy of Xc
	matrix[nXc,rXc] tXc = transpose(Xc) ;

}

parameters{

	// locat_coef: coefficients associated with location of the Gaussian likelihood
	vector[nXc] locat_coef ;

	// indiv_locat_coef_sd: magnitude of variability among individuals within a group
	vector<lower=0>[nXc] indiv_locat_coef_sd ;
	// NOTE: indiv_locat_coef_sd is usually modelled in an unpooled manner whereby 
	//       the individual elements don't mutually inform, ex:
	//           indiv_locat_coef_sd ~ std_normal() ;
	//       To partially-pool, we could do instead:
	//           indiv_locat_coef_sd ~ normal(m,s) ;
	//       Or maybe: 
	//           indiv_locat_coef_sd ~ lognormal(lm,ls) ;
	//       Thus need two more parameters:
	// mean_indiv_locat_coef_sd: mean of the normal population from which the 
	//                           values in indiv_locat_coef_sd are considered samples
	real<lower=0> mean_indiv_locat_coef_sd ;
	// mean_indiv_locat_coef_sd: sd of the normal population from which the 
	//                           values in indiv_locat_coef_sd are considered samples
	real<lower=0> sd_indiv_locat_coef_sd ;

	// helper_indiv_locat_coef: helper-variable for non-centered parameterization of indiv_locat_coef
	matrix[nXc,nI] helper_indiv_locat_coef ;

	// noise: SD of Gaussian likelihood
	real<lower=0> noise ;

}

model{

	////
	// group-level structure
	////

	// standard-normal priors on all group-level coefficients
	locat_coef ~ normal(0,10) ;
	noise ~ weibull(2,1) ;

	////
	// individual-level structure
	////
	
	// priors for the hyper-parameters for partial-pooling of indiv_locat_coef_sd
	mean_indiv_locat_coef_sd ~ std_normal() ;
	sd_indiv_locat_coef_sd ~ std_normal() ;
	
	// both normal & lognormal priors for indiv_locat_coef_sd seem to yield 
	// few-but-consistent divergences
	indiv_locat_coef_sd ~ normal(mean_indiv_locat_coef_sd,sd_indiv_locat_coef_sd) ;
	// indiv_locat_coef_sd ~ lognormal(mean_indiv_locat_coef_sd,sd_indiv_locat_coef_sd) ;

	// indiv_locat_coef_ must have a standard-normal prior
	to_vector(helper_indiv_locat_coef) ~ std_normal() ;

	////
	// observation-level structure
	////

	// observations as normal with common error variability and varying mean
	Y ~ normal(
		// in-place computation of the Gaussian locations implied by the coeffi-
		// cients & contrasts associated with each observation.
		columns_dot_product(
			(
				rep_matrix(locat_coef,nI)
				+ (
					rep_matrix(indiv_locat_coef_sd,nI)
					.* helper_indiv_locat_coef
				)
			)[,iXc]
			, tXc
		)[yXc]
		, noise
	) ;

}

As the comments note, I seem to encounter divergences regardless of whether the partial-pooling is applied on the normal or log-normal scales. Any thoughts?

Oh! True face-palm moment: This is probably solved by non-centered parameterization of the partial-pooling (at least for the low-informative simulated data I’m using). Checking that now…

Darn, no luck with the non-centered parameterization of the partial pooling either.

It would probably help to plot the location parameters against the population scale for your fit to see if the pathological funnel shape shows up? That might give a clue as to which parameters are being problematic/causing divergences. If you highlight the divergences in the plots, you might be able to target the issue.

Edit: Oh since you are pooling the variances, I believe you’d want to plot indiv_locat_coef_sd against sd_indiv_locat_coeff_sd. Misread your post originally.

Update: playing with simulated data a bit and observe that divergences become more likely as the likelihood becomes weaker, which is a pattern usually manifest with hierarchical models where the individual-level coefficients are centered-parameterized. But here I’ve been using non-centered parameterization for the individual-level coefficients, and additionally non-centering the parameterization of the variances doesn’t seem to help. Here’s some updated code that enables sample-time toggling between centered & non-centered parameterization of the pooled-variances:


data{

	// nI: number of individuals
	int<lower=2> nI ;

	// nXc: number of condition-level predictors
	int<lower=2> nXc ;

	// rXc: number of rows in the condition-level predictor matrix
	int<lower=nXc> rXc ;

	// Xc: condition-level predictor matrix
	matrix[rXc,nXc] Xc ;

	// iXc: which individual is associated with each row in Xc
	array[rXc] int<lower=1,upper=nI> iXc ;

	// nY: num entries in the observation vector
	int<lower=nI> nY ;

	// Y: observations
	vector[nY] Y ;

	// yXc: which row in Xc is associated with each observation in Y
	array[nY] int<lower=1,upper=rXc> yXc ;

	// centered_groupLogVariance: binary toggle for intercept centered/non-centered parameterization
	int<lower=0,upper=1> centered_groupLogVariance ;

}

transformed data{

	// tXc: transposed copy of Xc so we can use the more-performant
	//      columns_dot_product() rather than rows_dot_product()
	matrix[nXc,rXc] tXc = transpose(Xc) ;

}

parameters{

	// noise: SD of Gaussian likelihood
	real<lower=0> noise ;

	// groupMean_locat_coef: group-level mean for coefficients associated with location of the Gaussian likelihood
	vector[nXc] groupMean_locat_coef ;

	// mean_groupLogVariance_locat_coef: mean of the normal population from which the
	//                           values in groupLogVariance_locat_coef are considered samples
	real mean_groupLogVariance_locat_coef ;
	// mean_groupLogVariance_locat_coef: sd of the normal population from which the
	//                           values in groupLogVariance_locat_coef are considered samples
	real<lower=0> sd_groupLogVariance_locat_coef ;

	// groupLogVariance_locat_coef: group-level variance for coefficients associated with location of the Gaussian likelihood
	vector<
		offset = (centered_groupLogVariance ? 0 : mean_groupLogVariance_locat_coef)
		, multiplier = (centered_groupLogVariance ? 1 : sd_groupLogVariance_locat_coef)
    >[nXc] groupLogVariance_locat_coef ;

	// helper_indiv_locat_coef: helper-variable for non-centered parameterization of indiv_locat_coef
	matrix[nXc,nI] helper_indiv_locat_coef ;

}
transformed parameters{

	// groupSD_locat_coef: for each coefficient, across individual
	//     Note: sqrt(exp()) implies that the normal variate model manifests on the log-variance scale
	vector[nXc] groupSD_locat_coef = sqrt(exp(groupLogVariance_locat_coef)) ;

	// indiv_locat_coef: each individual's coefficients
	matrix[nXc,nI] indiv_locat_coef = (
		rep_matrix(groupMean_locat_coef,nI)
		+ (
			rep_matrix(groupSD_locat_coef,nI)
			.* helper_indiv_locat_coef
		)
	) ;

	// indiv_locat: condition locations implied by coefficients and contrast matrix
	row_vector[rXc] indiv_locat = columns_dot_product(
		indiv_locat_coef[,iXc]
		, tXc
	) ;

}
model{
	// zero-avoiding prior for measurement noise
	noise ~ weibull(2,1) ;

	// unpooled normal model/prior for groupMean_locat_coef
	groupMean_locat_coef ~ std_normal();

	// priors for the hyper-parameters for partial-pooling of groupLogVariance_locat_coef
	mean_groupLogVariance_locat_coef ~ std_normal() ;
	sd_groupLogVariance_locat_coef ~ std_normal() ;

	// normal model for distribution of groupMean_locat_coef
	groupLogVariance_locat_coef ~ normal(
		mean_groupLogVariance_locat_coef
		, sd_groupLogVariance_locat_coef
	) ;

	// indiv_locat_coef_ must have a standard-normal prior for non-centered
	// parameterization of indiv_locat_coef
	to_vector(helper_indiv_locat_coef) ~ std_normal() ;

	// observations as normal with common error variability and varying mean
	Y ~ normal(
		indiv_locat[yXc]
		, noise
	) ;

}

And here’s R code to simulate, fit, and pairs-plot:

# with both kinds of contrasts:
# if either max_id or max_replicate ==2, divergences
# if one is 100 and the other is 10, no divergences
# if both are 10, divergences

library(tidyverse)

# make fake data ----
num_ids = 2
num_replicates = 2
(
	expand_grid(
		id = 1:num_ids
		, condition = factor(c('a','b'))
		, replicate = 1:num_replicates
	)
	%>% mutate(
		y = rnorm(n())
	)
) ->
	dat

# compute quantities for stan ----

#set default contrasts to sum
options(contrasts=c('contr.sum','contr.poly'))
#define helper function for alternative indicator-contrasts
get_indicator_contrasts = function(x){(
	x
	%>% mutate(
		isA = as.numeric(condition=='a')
		, isB = as.numeric(condition=='b')
	)
	%>% select(isA,isB)
)}

(
	dat
	%>% select(id,condition)
	%>% distinct()
	# %>% group_by(id,condition)
	%>% mutate(
		#contrast option 1 (sum contrasts)
		contrasts = model.matrix(
			data = .
			, object = ~ condition
		)
		#contrast option 2 (indicator contrasts)
		# contrasts = get_indicator_contrasts(.)
	)
) ->
	complete_Xc_with_vars

(
	complete_Xc_with_vars
	%>% semi_join(dat)
) ->
	Xc_with_vars

# join Xc with dat to label observations with corresponding row from Xc
(
	Xc_with_vars
	# add row identifier
	%>% mutate(Xc_row=1:n())
	# right-join with dat to preserve dat's row order
	%>% right_join(
		mutate(dat,dat_row=1:n())
	)
	%>% arrange(dat_row)
	# pull the Xc row identifier
	%>% pull(Xc_row)
) ->
	yXc

# package for stan ----
data_for_stan = lst( # lst permits later entries to refer to earlier entries
	
	# # # #
	# Entries we need to specify ourselves
	# # # #
	
	centered_groupLogVariance = 0
	
	# Xc: condition-level predictor matrix
	, Xc = (
		Xc_with_vars
		%>% select(contrasts)
		%>% unnest(contrasts)
		%>% as.matrix()
	)

	# iXc: which individual is associated with each row in Xc
	, iXc = as.numeric(factor(Xc_with_vars$id))
	
	# Y: observations
	, Y = dat$y
	
	# yXc: which row in Xc is associated with each observation in Y
	, yXc = yXc
	
	# # # #
	# Entries computable from the above
	# # # #
	
	# nXc: number of cols in the condition-level predictor matrix
	, nXc = ncol(Xc)
	
	# rXc: number of rows in the condition-level predictor matrix
	, rXc = nrow(Xc)
	
	# nI: number of individuals
	, nI = max(iXc)
	
	# nY: num entries in the observation vectors
	, nY = length(Y)
	
)
glimpse(data_for_stan)

# compile & sample ----
code_path = 'stan/hierarchical_pooledLogVariances.stan'
mod = cmdstanr::cmdstan_model(
	stan_file = code_path
)
post = mod$sample(
	data = data_for_stan
	, chains = parallel::detectCores()/2
	, parallel_chains = parallel::detectCores()/2
	, iter_warmup = 1e4
	, iter_sampling = 1e3
)

# viz ----
np = bayesplot::nuts_params(post)
(
	c(
		# a manageable subset of parameters (see TP for more)
		'noise'
		, 'groupMean_locat_coef'
		, 'mean_groupLogVariance_locat_coef'
		, 'sd_groupLogVariance_locat_coef'
		, 'groupLogVariance_locat_coef'
	)
	%>% post$draws()
	%>% bayesplot::mcmc_pairs(np=np,off_diag_fun='hex')
)

And here’s the pairs plot for the extreme scenario of max_ids=2 & max_replicates==2:

So there’s obviously some funnels in there; I think those explain the divergences (divergences should occur throughout the funnel, not just in the neck, right?). How to reparameterize to avoid the funnel is currently stumping me still…

It’s not totally obvious to me the resolution here, but some of those funnels looks like they could be the culprit. One option would be to center/non-center the likelihoods individually based on how informed each of them is.

That being said, the example that you show with max_ids=2 and max_replicates==2 – is this a case where you have only two individuals per group and only two groups? If that’s true, then this case I feel like there’s so little data that it’s hard to avoid divergences without really strong prior information. Trying to estimate multiple variances from so little data is bound to introduce problematic posteriors. I note in your comments, you say:

Having no divergences in the large case of 100 individuals, 10 replicates makes sense since the likelihood would be well informed. I know this is just simulated data, but perhaps a more informative test would be trying to simulate data that is close in size to the experimental system that you are modeling. Some variation of a non-extreme amount of data, more informed priors in the less informed cases, and targeting reparameterizations when particular likelihoods are weakly/strongly informed would likely get you closer to a well-posed model.

Ha, I just returned to note precisely this after lots more playing. That is, I simplified things way down to just the classic neal’s funnel example:

data{
	int<lower=2> nI ;
	int<lower=nI> nY ;
	vector[nY] Y ;
	array[nY] int<lower=1,upper=nI> iY ;
	int<lower=0,upper=1> centered ;
	int<lower=0,upper=1> prior_only ;
}
parameters{
	real<lower=0> noise ;
	real mu ;
	real<lower=0> sigma ;
	vector<
		offset = (centered ? 0 : mu)
		, multiplier = (centered ? 1 : sigma)
    >[nI] imu ;
}
model{
	noise ~ weibull(2,1) ;
	mu ~ std_normal();
	sigma ~ std_normal();
	imu ~ normal(mu,sigma) ;
	target += (prior_only ? 0 : normal_lupdf(Y|imu[iY],noise)) ;
}

And with nI==2 & nY==4 (so, two replicates per individual), neither centered nor non-centered parameterization comes out consistently divergence-free. Adding a zero-avoiding prior on sigma might help, but I’m still working out my confidence in that assertion.

1 Like

I would try fixing sd_groupLogVariance_locat_coef to see if that squashes the divergences. This could tell you something about whether the divergences are arising due to funnel-like behavior of the group log variances (if fixing the sd eliminates the divergences) or due to non-identifiabilties between the mean and the offsets (if fixing the sd does not eliminate divergences).

1 Like

Just a brief update on this topic. After a lot of playing I feel like I understand things a lot better. As others intuited, my low-data simulations were pushing even non-centered parameterizations beyond their capacity. But moreover I realized something conceptual that might be useful to others: partial-pooling of hierarchical quantities can actually be achieved two ways, either through a normal-variate approach, or through a regression approach.

Taking a pared-down example, consider a multivariate hierarchical model where we assume unit-variance and identity correlations, but maintaining inference for the mean of the

transformed data{
	matrix[nX,nX] COV = diag_matrix(rep_vector(1.0,nX)) ;
}
parameters{
	vector[nX] mu ;
	matrix[nI,nX] indiv ;
	... // possibly more parameters associated with likelihood
}
model{
	mu ~ std_normal() ;
	for(i_nI in 1:nI){
		indiv ~ multi_normal( mu, COV) ;
	}
	... // more terms here leading to likelihood
}

In that model, mu is unpooled. Alternatively, we can partially-pool via normal-variates, which would be implemented as (omitting unchanged things with ...):

...
parameters{
	real mu_mean ;
	real<lower=0> mu_sd ;
	...
}
model{
	mu ~ normal(mu_mean,mu_sd) ;
	...
}

But I realized this morning that an alternative way to achieve something like partial-pooling is via a regression structure, and indeed this is much more common to see :

...
parameters{
	vector[nX] mu_coef ;
	...
}
transformed parameters{
	vector[nX] mu = dot_product(mu_coef, X) ;
}
model{
	mu_coef ~ std_normal() ;
	...
}

That is, when X is a data variable reflecting a contrast matrix with an intercept and orthogonal contrasts otherwise, the mu will mutually-inform through this structure. And in my playing so far, this way of mutual-information seems a bit more robust in lower-informed-data scenarios than the normal-variate.

Now, expanding things a bit to add inference on not just the latent/hierarchical means but also on the latent/hierarchical variabilities (but still assuming identity correlation), what I typically see is:

...
parameters{
	vector[nX] mu_coef ;
	vector[nX] sigma ;
	...
}
transformed parameters{
	vector[nX] mu = dot_product(mu_coef, X) ;
}
model{
	mu_coef ~ std_normal() ;
	sigma ~ std_normal();
	for(i_nI in 1:nI){
		indiv ~ multi_normal( mu, diag_matrix(sigma) ) ;
	}
	...
}

That is, while there is a kind of partial-pooling for mu through the regression structure, sigma is left fully unpooled. This thread started with my explorations of the normal-variate pooling of sigma:

...
parameters{
	vector[nX] mu_coef ;
	real psi_mean ;
	real<lower=0> psi_sd ;
	vector[nX] psi ; // log-variance 
	...
}
transformed parameters{
	vector[nX] mu = dot_product(mu_coef, X) ;
	vector[nX] sigma = sqrt(exp( psi )) ;
}
model{
	mu_coef ~ std_normal() ;
	psi_mean ~ std_normal() ;
	psi_sd~ std_normal() ;
	psi ~ normal(psi_mean,psi_sd) ;
	for(i_nI in 1:nI){
		indiv ~ multi_normal( mu, diag_matrix(sigma) ) ;
	}
	...
}

But in my playing thusfar, I think the regression approach would be more robust:

...
parameters{
	vector[nX] mu_coef ;
	vector[nX] psi_coef ;
	...
}
transformed parameters{
	vector[nX] mu = dot_product(mu_coef, X) ;
	vector[nX] psi = dot_product(psi_coef, X) ; 
	vector[nX] sigma = sqrt(exp( psi )) ;
}
model{
	mu_coef ~ std_normal() ;
	psi_coef ~ std_normal() ;
	for(i_nI in 1:nI){
		indiv ~ multi_normal( mu, diag_matrix(sigma) ) ;
	}
	...
}

Bah, nevermind my seeming confident conclusions of the last update; I belatedly now realize that the typical hierarchical model is:

...
parameters{
	vector[nX] coef_mu ;
	vector[nX] coef_sigma ;
	matrix[nX,nI] indiv_coef ;
}
model{
	coef_mu ~ std_normal() ;
	coef_sigma ~ std_normal() ;
	for(i_nI in 1:nI){
		indiv_coef ~ multi_normal( coef_mu, diag_matrix(coef_sigma) ) ;
	}
	matrix[nX,nI] indiv = columns_dot_product( indiv_coef, X ); 
	...
}

That is, the multi-normal is still coefficients, with unpooled mean & variance for each coefficient, and it’s the individual-level coefficients that then get dot-product’d with contrasts to yield individual-level parameters for each condition implied by the contrast matrix. So I’m back to the beginning, and while it may make sense to do normal-variate partial-pooling of the variances of coefficients, this would only make sense if one had lots of coefficients.