Questions on Merkle et al. preprints on Efficient Bayesian SEM in Stan

I just saw the Efficient Bayesian Structural Equation Modeling in Stan preprint by Ed Merkle, @bgoodri and colleagues. This looks great! Section 4 discussed some issues specifying latent variable models in Stan, which I have had trouble with for a long time.

If possible, I am hoping to seek some more general clarifications on these issues. I suppose this post could also serve others who wanted to raise questions? My questions are

  1. I had tried to set up individual priors for freely estimated latent covariance parameters, manually construct the covariance matrix, and then let Stan reject values that lead to non-positive definiteness, as Section 4.1 suggested. However, in my case the model often does not initialize after rejecting the initial covariance values to a maximum times. If the model were set up correctly, would this issue simply be due to the prior informativeness, as the paper discussion suggested?

  2. For the sign-switching issue of the latent variable loadings, the paper suggested post-processing MCMC samples, by fixing one loading (say \lambda^*) to always positive and processing other related parameter samples according to the sign of the sampled \lambda^*. I was curious, for a latent covariance parameter (say \psi_{AB}, does it matter which latent variableā€™s \lambda^* this covariance parameterā€™s sign is based on? E.g. for a two factor latent variable model, should the sign of the factor covariance parameter \psi_{AB} be based on the sampled values of factor Aā€™s fixed loading, \lambda^*_A, or those from factor B \lambda^*_B ? I am also curious if there can be more discussion on how ā€œlegalā€ such post processing is in terms of MCMC parameter inference. Though I had resorted to constraining all loadings to positive in the past, which is much less elegant.

Let me know if these descriptions are not clear. Thanks for any input from the forum!

Best,
Marco

3 Likes

For 1.) I imagine you need to use a different method for estimating covariances [e.g., LKJ correlation + freely estimated SDs]; or you need to implement a method that guarantees PD constraint [this is much harder].
For 2.) My guess is that they go by each factor in a loop of sorts [their code may clarify this, but itā€™s a lot of stan code]. I think youā€™d need to say ā€˜for each factor f, if ref. loading is < 0, then multiply loadings factor f by -1, and all rows/columns of covariances for factor f by -1.ā€™ By the end of the loop, the signs should be swapped correctly. So the answer should be: Both. If F1 is negatively scaled, and F2 is negatively scaled, and they positively correlate, then flipping both should yield a positive correlation. If only F2 is negatively scaled, then flipping F2 should yield a negative correlation.

Edit: The ā€˜legalityā€™ of such an approach isnā€™t really a problem. Basically, SEMs are directionally unidentified. You can fix that aliasing by using loading constraints, or by just forcing the constraint after the fact in GQ. The GQ approach they use just sortaā€¦ ā€œfoldsā€ the aliased posterior over the axis onto the other. They should accomplish the same thing. There are some trade-offs to each approach I assume, just computationally. The posteriors should have the same target distribution though.

1 Like

Thanks Stephen! I agree starting with a LKJ prior and build additional SDs on it might make more sense . For the post-processing I may just try it out to see how well it works. Curious to see if the non-identifiability affects estimation stability or convergence in larger models.

Personally, I have had issues with Stan and using GQ for post-processing sign-flipping. But I only tried it twice I believe; it worked well for our multivariate garch model though.

I think it sorta depends on how large the loadings are. Iirc, the main issue was getting a good warmup/adaptation when each chain could oscillate between signed solutions. Similarly, Iā€™ve had issues with putting a constraint on a single loading, because with a large number of indicators, the likelihood is not drastically impacted by the positive-loadings being effectively zero, and the rest oscillating between positive and negative.

In my personal experience, just imposing a positivity constraint on all non-cross-loadings works well. But it requires tinkering, because itā€™s hard to define what is or isnā€™t a cross-loading in a generic way.

So really there are three ways, each with their pros and cons.

  1. Fix a loading per factor to be positive. [But, during sampling, two local modes may exist where those loadings are zero, and the rest can be positive or negative].
  2. Fix all loadings per factor to be positive, and reverse any reverse-scaled items. [But, specifying what is or isnā€™t a cross-loading, which can be positive or negative, is difficult in a generic interface].
  3. Allow loadings to be positive or negative, but impose the sign constraint via post-processing. [But, depending on the absolute values of the loadings, warmup could be a problem in my experience].

Again, I personally prefer 2); but I donā€™t usually work with cross-loadings, so the major downside of that approach isnā€™t an issue for me.

1 Like

This isnā€™t a good idea, even though blavaan currently does it. Your prior does not integrate to 1 (or another constant) over the subspace of values that yields a positive definite covariance matrix, so it does not draw from the posterior distribution you intended.

The basic rule is that if you change the draws, you have to do so in a way that would not change target (what is called lp__ in the output) or only changes it by a constant.

Iā€™ve indeed had the same experience with these two approaches as Stephen described. Itā€™s nice to know that corresponding evidence exists. My guess is how well the GQ approach works depends on how well the chains can warm even when local modes existed. Iā€™ve seen chains warm up ok and show nice paralleled multiple solutionsā€¦I am guessing in that case the GQ approach could work. Iā€™ll try it out when thereā€™s a chance. Thanks again.

I see! That makes sense, having a -1 scaling factor would only change the constant term in the log density.