Collecting identifiability constraint strategies for latent/hierarchical SEMs

So I’ve been playing with some complicated models for hierarchical data and need two kinds of SEM-like structures for some latent quantities, which I’ve boiled down to the following two separate examples. First, here’s a structure whereby the two latent quantities l1 & l2 are linked by a correlation r:

// sem_l1_l2.stan
data{
	int nIds ;
	int nReps ;
	matrix[nIds,nReps] y1 ;
	matrix[nIds,nReps] y2 ;
}
parameters{
	row_vector[2] means ;
	row_vector<lower=0>[2] sds ;
	vector[nIds] common_stdnorm ;
	matrix[nIds,2] unique_stdnorm ;
	real<lower=-1,upper=1> r ;
	real<lower=0> noise ;
}
transformed parameters{
	// generate l1_stdnorm & l2_stdnorm such that both have unit variance and cor(l1_stdnorm,l2_stdnorm)==r
	real w = sqrt(r/(1-r)) ;
	vector[nIds] l1_stdnorm = (common_stdnorm*w + unique_stdnorm[,1]) / sqrt( 1+(w^2) ) ;
	vector[nIds] l2_stdnorm = (common_stdnorm*w + unique_stdnorm[,2]) / sqrt( 1+(w^2) ) ;
	// scale & shift stdnorms
	vector[nIds] l1 = l1_stdnorm*sds[1] + means[1] ;
	vector[nIds] l2 = l2_stdnorm*sds[2] + means[2] ;
}
model{
	// prior for latent means/sds & manifest noise
	means ~ std_normal() ; // would want to tweak this to match domain expertise
	sds ~ weibull(2,1) ; // ditto
	noise ~ weibull(2,1) ; // ditto

	// common & unique must be ~ std_normal()
	common_stdnorm ~ std_normal() ;
	to_vector(unique_stdnorm) ~ std_normal() ;

	// r ~ uniform(-1,1) ; // commented-out bc implied by bounds

	// likelihood
	for(i_nId in 1:nIds){
		y1[i_nId,] ~ normal( l1[i_nId] , noise ) ;
		y2[i_nId,] ~ normal( l2[i_nId] , noise ) ;
	}

}

And next is a scenario where quantities l0 and the columns of l123, are linked through the correlations in the row-vector r:

// sem_l0_l123.stan
data{
	int nIds ;
	int nReps ;
	matrix[nIds,nReps] y1 ;
	matrix[nIds,nReps] y2 ;
	matrix[nIds,nReps] y3 ;
}
parameters{
	row_vector[3] means ;
	row_vector<lower=0>[3] sds ;
	vector[nIds] l0_stdnorm ;
	matrix[nIds,3] l123_unique_stdnorm ;
	row_vector<lower=-1,upper=1>[3] r ;
	real<lower=0> noise ;
}
transformed parameters{
	// generate l123_stdnorm such all columns have unit variance and
	//    cor(l0_stdnorm,l123_stdnorm[,1])==r[1]
	//    cor(l0_stdnorm,l123_stdnorm[,2])==r[2]
	//    cor(l0_stdnorm,l123_stdnorm[,3])==r[3]
	row_vector[3] vars_common = r.^2 ;
	row_vector[3] vars_unique = 1-vars_common ;
	matrix[nIds,3] l123_stdnorm = (
		(
			( rep_matrix(l0_stdnorm,3) .* rep_matrix(r,nIds) )
			+ ( l123_unique_stdnorm .* rep_matrix(sqrt(vars_unique),nIds) )
		)
		./ rep_matrix( sqrt(vars_unique+vars_common) ,nIds)
	) ;
	// scale & shift stdnorms
	matrix[nIds,3] l123 = (
		l123_stdnorm .* rep_matrix(sds,nIds)
		+ rep_matrix(means,nIds)
	);
}
model{
	// prior for latent means/sds & manifest noise
	means ~ std_normal() ; // would want to tweak this to match domain expertise
	sds ~ weibull(2,1) ; // ditto
	noise ~ weibull(2,1) ; // ditto

	// common & unique must be ~ std_normal()
	l0_stdnorm ~ std_normal() ;
	to_vector(l123_unique_stdnorm) ~ std_normal() ;

	// r ~ uniform(-1,1) ; // commented-out bc implied by bounds

	// likelihood
	for(i_nId in 1:nIds){
		y1[i_nId,] ~ normal( l123[i_nId,1] , noise ) ;
		y2[i_nId,] ~ normal( l123[i_nId,2] , noise ) ;
		y3[i_nId,] ~ normal( l123[i_nId,3] , noise ) ;
	}

}

Now, as coded, there’s going to be sampling issues with both these toy models because if the true generating process has a value for any r that’s non-zero, the model as coded will have a bimodal likelihood with peaks at +r & -r.

In light of this, I know folks usually undertake one or another strategy for achieving better identifiability, and I wanted to make sure I knew all the options. I think they are:

  1. Don’t change the model, instead hand-process the posterior to check-for and flip one mode of any bimodal posteriors.
  2. For the sem_l0_l123.stan model, constrain one of the correlations in r to positive-only.

Are there any other options?

I tend not to like 1 because it assumes the sampler will behave ok despite the bimodality of the posterior, which we know HMC probably won’t (but admittedly other samplers might).

I also tend not to like 2 as it really only feels like it “works” well when there’s strong domain expertise that at least one identifed correlation will be non-zero; if you don’t have such prior knowledge you can end up choosing to positive-constain a correlation that is actually zero or nearly-zero, leaving it’s posterior artificially truncated and (if it’s genuinely zero) not actually eliminating bimodality for other correlations.

p.s. I know there are tricks to marginalize-out the latent quantities like common_stdnorm & l0_stdnorm, which I presume also resolves the identifiability issues noted above, but I think the various other structures in my non-toy model genuinely need direct inference on those latents.

1 Like

@spinkney

1 Like

I think those are the main options. Related to your #2, I’ve seen many people use priors that only have support on positive numbers. The identification issue goes away, but then you can’t have posteriors that overlap with 0.

I think your #1 is tricky when there are multiple bimodal parameters, because the sign of one parameter can influence the sign of other parameters. So you wouldn’t want to flip each bimodal parameter independently of the others. Related to #2, this paper describes the need for a sign constrained parameter to achieve identification in factor analysis:

For many of these sorts of models, an identifiability issue remains even when you marginalize the latent variables.

2 Likes