Non-convergence of latent variable model

I am having problems with model convergence even after increasing the number of iterations to as many as 50000. Rhat isn’t close to 1.
I have attached a copy of my code which models a CFA using the Holzinger Swineford data in lavaan.

data {
  int N; //number of subjects
  int Nf; // number of factors
  int ind; //number of between level indicators
  matrix[ind,Nf] l_p; // loading patterns
  vector[ind] Y[N]; // between level indicators. In example, only 6 indicators
  int <lower =1> Nf_nzero; //number of non zero loadings
}


parameters {
  matrix[2,Nf] Nf_ld; 
  //vector[ind] intercepts;
  cholesky_factor_corr[Nf] Nf_corr; 
  vector <lower = 0> [ind] sd_res; 
  vector <lower = 0> [Nf] Nf_var;
}

transformed parameters {
 
  row_vector[Nf] ones;
  matrix[3, Nf] ld_matrix;
  matrix[ind, Nf] loadings;
  vector[Nf_nzero] vec_load; 
  cov_matrix[ind] Sigma;
  cov_matrix[ind] theta_del;
  corr_matrix[Nf] omega;
  cov_matrix[Nf] Phi_nf;
  matrix[ind,ind] lamba_phi_lampha;
  vector[Nf] zeros;
  
  
  zeros = rep_vector(0, Nf);
  
  
  ones = rep_row_vector(1, Nf);
  ld_matrix = append_row(ones, Nf_ld); 
  vec_load = to_vector(ld_matrix);

   {
    int pos = 0;
    for (i in 1:ind){
      for (j in 1:Nf) {
        
        if(l_p[i,j] != 0) {
          pos = pos + 1;
          loadings[i,j] = vec_load[pos];
        } else {
          loadings[i,j] = 0;
        }
      }
    }
  }  
  
  omega = multiply_lower_tri_self_transpose(Nf_corr);
  
  Phi_nf = quad_form_diag(omega, Nf_var);
  
  lamba_phi_lampha = quad_form_sym(Phi_nf, loadings');
  
  theta_del = diag_matrix(sd_res);
  
  Sigma = lamba_phi_lampha + theta_del;
}
model {
  
   Y ~ multi_normal(rep_vector(0, ind), Sigma);
   
   //priors
  sd_res ~ cauchy(0,5); 
  Nf_var ~ cauchy(0,5);

  to_vector(Nf_ld) ~ normal(0,10);
  
  Nf_corr ~ lkj_corr_cholesky(1);
}

So I’m using the reference indicator approach to scale the latent variables i.e. the first loading of each latent variable is set to one.

Y \sim N(0,\Lambda\Phi\Lambda' + \theta_{\delta}) , where \Lambda represents my loadings, \Phi is the covariance matrix of the latent variables, and \theta_{\delta} is the variance of the residuals.

I will appreciate any assistance that can be provided.

I am having a hard time following your indexing, can try to follow it with more time.

But thought relevant to mentioned that Ed Merkle has develop the package blavaan, which runs SEM models based on lavaan syntax. This can be run in either Stan or JAGS. With a bunch of added features as similar to lavaan as possible.

As any package, does not work for every model yet, but it works well for the developed features, without convergence problems

blavaan

Also, you can ask blavaan to export the Stan/JAGS model syntax

Slow convergence probably means the posterior is multimodal or nonidentified. More informative priors may help. Normal prior would be better than cauchy (which is very wide). So I’d start by guesstimating what is the largest sd_res and Nf_var that could reasonably occur and set

sd_res ~ normal(0,max_sd/2); 
Nf_var ~ normal(0,max_var/2);

I am unable to use Ed Merkle’s marvelous blavaan package this time because I am trying to build a multilevel structural equation model, which blavaan unfortunately does not support at this time.

I see, didnt notice that it was a multilevel CFA. I would recommend you to look at Song and Lee book, they have code example for multilevel SEM in chapter 6. The code is in WinBUGS, but it can be translated into Stan.

Song, X.-Y., & Lee, S.-Y. (2012). Basic and Advanced Bayesian Structural Equation Modeling: With Applications in the Medical and Behavioral Sciences . West Sussex, United Kindom: John Wiley & Son, Ltd.

Also, another recommendation I would make is that you are ow writing the code in function of the covariance equations (which are faster) but I think for Multilevel would be easier to code it in function of the data model with data augmentation for the factor score (at least easier for how I code)

I dont think the Holzinger Swineford is a good example to work for this, as it only have 2 groups (might be too few for a mutilevel solution). I will try and replicate an output from the Mplus examples on Chapter 9: Multilevel Modeling with Complex Survey Data like example 9.6. This should be more straightforward, and having an output to compare to can help check the model

A couple comments/questions on your code,

You define the factors means at 0, but I dont see these being use later in the code.

Here you are defining the item intercepts to be fix at 0, they are not being estimated

What is the Rhat that you are having problems with, how large? for all parameters?

I alwyas prefer to scale the factors in function of a fix variance at 1 and fix means at 0, instead of using the marker variable. This gives the interpretation of the parameters in function of a standardized factor, instead of the chosen indicator

1 Like

Thanks for pointing that out. That is a mistake.

It looks like for the first freely estimated parameters of the loadings, sd_res, and Nf_var have large Rhats of 6.18, 14.90, and 14.57.
Here, I’m using a half_normal prior as suggested by @nhuurre.

I will try the fixed variance approach.

I am trying to first get the measurement model working before building on it to include the multilevel components. Another reason is I am new to using Stan, so I thought I could start with something I am more familiar with.

Thanks very much for the book reference. It should definitely help me with working on multilevel SEM data.

PS: I have realized that the model converges immediately with Rhat = 1 for all parameters when I constrain the loadings to be nonnegative. This, however presents a theoretical challenge.

I agree, this does presents theoretical limitations. As well as functional for constraining too much the distribution a priori. Now, it is a common approach to fix 1 factor loadings to be positive to define the direction of the factor, factor indeterminacy. This is what Ed does in blavaan

2 Likes

Thanks very much for your help.

This reminded me that we should add to help text a comment that if Rhats are very large and doesn’t change much when the number of iterations is doubled, then the posterior is very likely multimodal with well separated modes (might be also due to aliasing). If Rhats for each single chain are close to 1, then it’s even more likely to be multimodal with well separated modes each of which are easy to sample from. If Rhats for each single chain are not close to 1, then it’s likely there is identifiability issue. I don’t think we have yet this mentioned in help texts, but we do recommend looking at the effective sample size estimates when the number of iterations is increased in our paper https://arxiv.org/abs/1903.08008. ESS is also easier to interpret to diagnose multimodality as then ESS <= the number of chains.

6 Likes

A simple solution is to set the latent variable variances to 1. Sample the loadings unconstrained and create a copy of the loadings in the transformed parameters block. Select an item per factor as a marker item. Check factor by factor, if the marker item loading is negative, then multiply all the loadings on that factor by -1.

You will also have to make changes to the interfactor correlation matrix. So create a copy of this matrix in the transformed parameters. And also multiply the relevant correlations by -1. So if the loadings on factor 2 were multiplied by -1, then all the correlations to factor 2 should also be multiplied by -1. Remember to multiply row 2 and column 2 of the correlation matrix.

You’d have to do all this before creating lambda_phi_lambda. blavaan now does something like this.

2 Likes

Just to clarify, blavaan does this in the generated quantities block, instead of the transformed block. Right now blavaan has two different Stan models, one based on creating factor scores based on data augmentation (target=“stanclassic”), and the default precompiled now which does not estimate factor scores and works with the covariance matrix equations, based on LISREL parameterization (target=“stan”). In both method the parameters are signed corrected in the generated quantities block

You can export the Stan code for either method and see how this works on blavaan with the mcmcfile argument

HS.model <- ’ visual =~ x1 + x2 + x3
textual =~ x4 + x5 + x6
speed =~ x7 + x8 + x9 ’

fit <- bcfa(HS.model, data = HolzingerSwineford1939, std.lv=T, mcmcfile =“default”, target=“stan”)
fit2 <- bcfa(HS.model, data = HolzingerSwineford1939, std.lv=T, mcmcfile =“augmentation”, target=“stanclassic”)

OP can make these changes in either location in their current code, which follows the LISREL measurement equation.

I would not expect OP to have much use for the exported blavaan Stan code as it is quite bulky, and difficult to follow in certain parts.

Agree, the blavaan exported Stan code, is bulky and can be difficult to follow. I still like to recommend these options, as its a good way to learn coding tricks by seeing code that works

As of the model, I worked and example model similar to the original model posted, with the corrected signed parameters in the generated quantities block. Some comments about the model, I estimated the model based on the data augmentation approach, which for me at least make it easier to extend for Multilevel CFA. But the log-likelihood for model comparison is estimated with the covariance matrix approach, so the marginal log-likelihood is used instead of the conditional, as recommended by

Merkle, E. C., Furr, D., & Rabe-Hesketh, S. (2019). Bayesian model assessment: Use of conditional vs marginal likelihoods. Psychometrika , 84 , 802–829.

data{
int N; // sample size
int P; // number of variables
int D; // number of dimensions
vector[P] X[N]; // data matrix of order [N,P]
int n_lam; // how many factor loadings are estimated
}

parameters{
vector[P] b; // intercepts
matrix[N,D] FS_UNC; // factor scores, matrix of order [N,D]
cholesky_factor_corr[D] L_corr_d; // cholesky correlation between factors
vector<lower=0>[P] sd_p; // residual sd for each variable
vector<lower=-30, upper=30>[n_lam] lam_UNC; // not positivily constraint 
}

transformed parameters{
vector[D] M; // factor means
vector<lower=0>[D] Sd_d; // sd for each factor
vector[P] mu_UNC[N]; // predicted values
cholesky_factor_cov[D] L_Sigma;

M = rep_vector(0, D);
Sd_d = rep_vector(1, D);

L_Sigma = diag_pre_multiply(Sd_d, L_corr_d);


for(i in 1:N){
mu_UNC[i,1] = b[1] + lam_UNC[1]*FS_UNC[i,1];
mu_UNC[i,2] = b[2] + lam_UNC[2]*FS_UNC[i,1];
mu_UNC[i,3] = b[3] + lam_UNC[3]*FS_UNC[i,1];
mu_UNC[i,4] = b[4] + lam_UNC[4]*FS_UNC[i,2];
mu_UNC[i,5] = b[5] + lam_UNC[5]*FS_UNC[i,2];
mu_UNC[i,6] = b[6] + lam_UNC[6]*FS_UNC[i,2];
mu_UNC[i,7] = b[7] + lam_UNC[7]*FS_UNC[i,3];
mu_UNC[i,8] = b[8] + lam_UNC[8]*FS_UNC[i,3];
mu_UNC[i,9] = b[9] + lam_UNC[9]*FS_UNC[i,3];
}

}

model{

L_corr_d ~ lkj_corr_cholesky(1);
b ~ normal(0, 10);
lam_UNC ~ normal(0, 10);
sd_p ~ cauchy(0,2.5);
///Sd_d ~ cauchy(0,2.5);

for(i in 1:N){
for(j in 1:P){
X[i,j] ~ normal(mu_UNC[i,j], sd_p[j]);
}
FS_UNC[i] ~ multi_normal_cholesky(M, L_Sigma);
}

}

generated quantities{
real log_lik[N]; ///// log likelihood matrix
real dev; /////// global deviance
real log_lik0; ///// global log likelihood
vector[N] log_lik_row;

cov_matrix[P] Sigma;
cov_matrix[D] Phi_lv;
matrix[P,P] lambda_phi_lambda;
cholesky_factor_cov[P] L_Sigma_model;
matrix[P,P] theta_del;

corr_matrix[D] Rho_UNC; /// correlation matrix
corr_matrix[D] Rho; /// correlation matrix
matrix[P,D] lam;  // factor loadings
matrix[N,D] FS; // factor scores, matrix of order [N,D]
///vector[P] mu[N]; // predicted values

/// signed corrected parameters
Rho_UNC = multiply_lower_tri_self_transpose(L_corr_d);
Rho = Rho_UNC;
FS = FS_UNC;
lam = rep_matrix(0, P, D);
lam[1:3,1] = to_vector(lam_UNC[1:3]);
lam[4:6,2] = to_vector(lam_UNC[4:6]);
lam[7:9,3] = to_vector(lam_UNC[7:9]);

// factor 1
if(lam_UNC[1] < 0){
  lam[1:3,1] = to_vector(-1*lam_UNC[1:3]);
  FS[,1] = to_vector(-1*FS_UNC[,1]);
  
  if(lam_UNC[4] > 0){
  Rho[1,2] = -1*Rho_UNC[1,2];
  Rho[2,1] = -1*Rho_UNC[2,1];
  }
  if(lam_UNC[7] > 0){
  Rho[1,3] = -1*Rho_UNC[1,3];
  Rho[3,1] = -1*Rho_UNC[3,1];
  }
}
// factor 2
if(lam_UNC[4] < 0){
  lam[4:6,2] = to_vector(-1*lam_UNC[4:6]);
  FS[,2] = to_vector(-1*FS_UNC[,2]);
  
  if(lam_UNC[1] > 0){
  Rho[2,1] = -1*Rho_UNC[2,1];
  Rho[1,2] = -1*Rho_UNC[1,2];
  }
  if(lam_UNC[7] > 0){
  Rho[2,3] = -1*Rho_UNC[2,3];
  Rho[3,2] = -1*Rho_UNC[3,2];
  }
}
// factor 3
if(lam_UNC[7] < 0){
  lam[7:9,3] = to_vector(-1*lam_UNC[7:9]);
  FS[,3] = to_vector(-1*FS_UNC[,3]);
  
  if(lam_UNC[1] > 0){
  Rho[3,1] = -1*Rho_UNC[3,1];
  Rho[1,3] = -1*Rho_UNC[1,3];
  }
  if(lam_UNC[4] > 0){
  Rho[3,2] = -1*Rho_UNC[3,2];
  Rho[2,3] = -1*Rho_UNC[2,3];
  }
}


/// marginal log-likelihood based on signed corrected parameters
Phi_lv = quad_form_diag(Rho, Sd_d);
lambda_phi_lambda = quad_form_sym(Phi_lv, transpose(lam));
theta_del = diag_matrix(sd_p);

Sigma = lambda_phi_lambda + theta_del;
L_Sigma_model = cholesky_decompose(Sigma);

for(i in 1:N){
log_lik[i] = multi_normal_cholesky_lpdf(X[i] | b, L_Sigma_model);
}

log_lik0 = sum(log_lik); // global log-likelihood
dev = -2*log_lik0; // model deviance

}

I compared this model with lavaan and blavaan, with simulated data


library(lavaan)
library(loo)
library(blavaan)
library(rstan)
#rstan_options(auto_write = TRUE)
#options(mc.cores = parallel::detectCores())

mod1 <-'
f1 =~ .5?y1+.5?y2+.5?y3 #+.1?y4 +.2?y8
f2 =~ .4?y4+.4?y5+.4?y6 #+.1?y1 +.2?y7
f3 =~ .6?y7+.6?y8+.6?y9 #+.1?y6 +.2?y3

f1 ~~ .3?f2+.4?f3
f2 ~~ .5?f3'

mod2 <-'
f1 =~ y1+y2+y3
f2 =~ y4+y5+y6
f3 =~ y7+y8+y9

f1 ~~ f2+f3
f2 ~~ f3'

datsim <- simulateData(mod1,sample.nobs=300,std.lv=T)
head(datsim)

fit <- cfa(mod2,data=datsim,std.lv=T,meanstructure=T)
summary(fit,standardized=T,fit.measures=T)


fitb <- bcfa(mod2,data=datsim,std.lv=T,target="stan")
summary(fitb,standardized=T)
fitMeasures(fitb)

###### run the Stan model
X <- as.matrix(datsim)
P <- ncol(datsim) 
N <- nrow(datsim)
D <- 3

dat <- list(N=N,P=P,D=D,X=X, n_lam=9)
param <- c("b","lam","Rho","sd_p","M","Sd_d",
           "log_lik0","dev","log_lik")# 

##### loop to check for convergence 
ta2 <- Sys.time()
iters <- 5000
keep <- 3000
rhat <- 20
while(rhat > 1.05){
  iters <- iters + 3000
    burn <- iters - keep
  
  OUT_st <- stan(data=dat, pars=param,
                 model_code=cfa_stan_fv,
                 chains=3, iter=iters,warmup=burn,
                 control=list(adapt_delta=.99, max_treedepth=20)) 
  
  ss <- summary(OUT_st)$summary
  print(rhat <- max(ss[,"Rhat"], na.rm=T))
}
tb2 <- Sys.time()
tb2 - ta2

OUT_st


print(OUT_st,digits=3, pars=c("b","lam","Rho","sd_p","M","Sd_d",
                              "log_lik0","dev"))

log_lik1 <- extract_log_lik(OUT_st, merge_chains = FALSE) 
(waic_cfa <- waic(log_lik1))
(loo_cfa <- loo(log_lik1, r_eff=relative_eff(exp(log_lik1))))

I teach a BSEM summer course, havent worked a Multilevel model yet, I will seek to develop a couple of useful examples from this. Its an interesting issue

2 Likes

This is a great example, thanks for posting this. Just trying to follow along here.

I would think that a different approach, in which one specifies the “first” and “remaining” factor scores separately in the parameters block (building on your example):

parameters {
vector<lower=0, upper=30>[3] lam_first;
vector<lower=-30, upper=30>[6] lam_remain;
...
}

and then puts the lambda vector together in the transformed parameters block:

transformed parameters {
vector[n_lam] lambda;
lambda[1] = lam_first[1];
lambda[4] = lam_first[2];
lambda[7] = lam_first[3];
lambda[2:3] = lam_remain[1:2];
lambda[5:6] = lam_remain[3:4];
lambda[8:9] = lam_remain[5:6];
...
}

should also work and make the sign-correction in the generated quantities block unnecessary. Or am I overlooking something?

I try this approach first. It kept giving me multimodal posteriors for the unconstrained lambdas. Trying to figure out why was taking me more time that I wanted to, so I switched to sign correction on the generated quantities block approach. This one worked well easier
I would be happy to find why it wasnt working

The reason is because you could either have loadings 1-J be positive, or loading 1 be positive and 2-J be negative, and it doesn’t affect the likelihood very much. In other words, you could flip the latent direction, have loadings 2-J be negative, and get the same prediction for (J-1)/J items. The only set of data that would suffer is for item 1; in that case, it could basically just push the loading down to zero (as though it ‘wants’ to be negative), and item 1 is basically just an intercept model. The more items you have, the more this can happen.

Imo, the best approach is to just constrain all loadings to be in the /intended/ direction. Or, just unreverse any reverse-scaled indicators, and use all positive loadings. If you’re concerned about method effects, add a method effect to any reversed items.
In my experience, constraining all loadings to be unidirectional (except for cross-loadings, I suppose), and inverting the responses to meet that, will give good Rhats and unimodality every time.

1 Like

Yes, I agree, this is the issue of factor interminancy I describe above. The issue is that constraining a factor loading to be positive fixes the issue in JAGS, but does not work in Stan. In Stan we have found that the best method is to estimate the model without constraints, and sign adjust the parameters in function a chosen factor loading. Like in the example I posted before

What I dont like about this is that you are already deciding that the posterior cannot cover negative values, which in may scenarios could be a constraint too strict

1 Like