Faster CFA without data augmentation

I need to build a CFA model that will be then used as a component of a more complex model. Since the final model fairly big, I need the CFA to be as fast and efficient as possible.

I built two CFA models: one with data augmentation, and another that draws from the marginal (something like what blavaan does). Although the blavaan-type model should run faster, it’s like 2.5 times slower. Is there any way I could speed it up?

Thanks a lot for your help!

The marginal model:

data {
  int<lower=1> N; // n of individuals/observations
  int<lower=1> M; // n of factors/latent variables
  int<lower=1> P; // n of observed indicators
  vector[P] Ind[N]; // matrix[N,P] of indicators
  int<lower=0, upper=1> D_skeleton[P,M]; // which LVs are related to which indicators? 1 if yes, 0 if not
  int Dvec_length; // equal to sum(D_skeleton)
parameters {
  vector<lower=0>[Dvec_length] Dvec; // factor loadings
  vector<lower=0>[P] psi; // variance of indicators
  vector[P] d; // mean of indicators
  cholesky_factor_corr[M] L_Omega; // cholesky factor or the covariance of LVs. (Actually a correlation so the variance of the LVs = 1)
transformed parameters {
  matrix[P,M] D = Dvec_to_D(P, M, D_skeleton, Dvec); // convert to a matrix of factor loadings. See the function below.
  corr_matrix[M] Omega = multiply_lower_tri_self_transpose(L_Omega);
model {
  matrix[P,P] new_cov = quad_form(Omega, D') + diag_matrix(psi);
  Dvec ~ normal(1, 2);
  psi ~ inv_gamma(1,2);
  d ~ normal(0, 10);
  L_Omega ~ lkj_corr_cholesky(1);
  Ind ~ multi_normal(d, new_cov);

The mathematical model:

I_i = \delta + D \nu_i + \eta_i \quad \forall i
\nu_i \sim N_M(0, \Omega) (the latent variables)
\eta_i \sim N_P(0, \Psi) (the measurement errors)

So the reduced form model is:
I_i \sim N_P(\delta, D\Omega D^T + \Psi)

The Dvec_to_D function:

functions {
  // Transform a vector of factor loadings into a matrix with appropriate zero constraints
  // @param P the number of observed indicators 
  // @param M the number of factors/latent variables 
  // @param D_skeleton a matrix with 1s where factor loadings should be estimated, 0s otherwise
  // @param Dvec a vector with factor loadings
  // @returns a [P,M] matrix with zeroes where D_skeleton is equal to zero
  matrix Dvec_to_D(int P,
                   int M,
                   int[ , ] D_skeleton,
                   vector Dvec) {
    matrix[P,M] ret_matrix;

    int counter = 1;
    for (i in 1:P) {
      for (j in 1:M) {
        if (D_skeleton[i,j] == 1) {
          ret_matrix[i,j] = Dvec[counter];
          counter += 1;
        else {
          ret_matrix[i,j] = 0;

    return ret_matrix;
1 Like

It is good to see you got this working. I skimmed the code and didn’t immediately notice a major issue.

It would be helpful to know what values of N, M, and P you are using. If you have something like N=100, M=1, P=3, then I doubt you will see much difference and the data augmentation may even be faster. I think you would see more benefits of marginal as N and M increase.

One other thing is that, instead of raw speed, you might consider sampling efficiency (effective sample size per second).

You might consider the SEM approach of modelling the covariance matrix rather than the individual observations. Given sample size n, sample covariance matrix S, and model-implied covariance matrix \Sigma, the log-likelihood is proportional to \log L = -\frac{1}{2} n \left[ \log \lvert\Sigma\rvert + tr(S \Sigma^{-1}) \right]. See Hoyle 2012 “Handbook of Structural Equation Modeling”, page 21.

Alternatively, you could use the Wishart distribution to model the scatter matrix (Q below), from which the equation above is derived.

    int N; // Number of observations
    int P; // Number of indicators
    matrix[N,P] Y;  // Raw data
transformed data{
    matrix[N,P] Y_center;
    matrix[P,P] Y_scatter;

    for(p in 1:P){
        Y_center[,p] = Y[,p] - mean(Y[,p]);

    Y_scatter = Y_center' * Y_center;
    cov_matrix[P] Sigma;  // Model-implied covariance matrix
    Y_scatter ~ wishart(N, Sigma);

Neither of these model the mean structure, though I suspect it may speed up the estimation of the covariance structure.

Thanks for looking at my code. In the CFA component of my model, I’m expecting something like N = 2000, M = 5, P = 12. I’m not sure why I’m not seeing much advantage in terms of speed. Maybe it’s the matrix multiplication?

My full model is quite complex (see below) and, in a first try, the marginal version didn’t work well either. At least in that case the covariance is pretty convoluted.

The full model is an integrated choice and latent variable model. There are latent variables (factors) that enter a multinomial logit model. More details in this paper.

The choice component (u_{ij} is individual i's utility for alternative j described by K attributes, \beta_i is a vector with random partworths/preference weights to be estimated. The probability that i chooses j given weights and attributes is a logistic regression.)

\begin{align} \boldsymbol{u}_i &= \boldsymbol{\alpha} + \boldsymbol{X}_i \boldsymbol{\beta}_i + \boldsymbol{\varepsilon}_i & J \times 1 \\ &= \boldsymbol{v}_i + \boldsymbol{\varepsilon}_i \\ \boldsymbol{\beta}_i &= \bar{\boldsymbol{\beta}} + \boldsymbol{\Gamma} \boldsymbol{x}_i^* + \boldsymbol{\xi}_i & K \times 1 \\ \varepsilon_{ij} &\sim \text{EV1}(0,1) \\ \boldsymbol{\xi}_i &\sim N(\textbf{0}, \Xi) \\ y_{ij} = 1 &\iff u_{ij} = \max_j \boldsymbol{u}_i \\ \therefore \Pr \left(y_{ij} = 1 \right) &= \frac{\exp{v_{ij}}}{\sum_{l=1}^J\exp{v_{il}}} = \text{MNL} \left( \boldsymbol{v}_i \right) \end{align}

The latent variable (CFA) component (d_i are demographics)

\begin{align} \boldsymbol{x}_i^* &= \boldsymbol{A d}_i + \boldsymbol{\nu}_i & M \times 1 \\ \boldsymbol{I}_i &= \boldsymbol{\delta} + \boldsymbol{D x}_i^* + \boldsymbol{\eta}_i & P \times 1 \\ \boldsymbol{\nu}_i &\sim N_M(\boldsymbol{0}, \boldsymbol{\Omega}) \\ \boldsymbol{\eta}_i &\sim N_P(\boldsymbol{0}, \boldsymbol{\Psi}) \end{align}

After some math, the generative process can be described by:

\begin{align} \boldsymbol{v}_i &\sim N_J\left( \boldsymbol{\alpha} + \boldsymbol{X}_i \bar{\boldsymbol{\beta}} + \boldsymbol{X}_i \boldsymbol{\Gamma} \boldsymbol{A d}_i, \boldsymbol{X}_i \boldsymbol{\Gamma} \boldsymbol{\Omega} \boldsymbol{\Gamma}^T \boldsymbol{X}_i^T + \boldsymbol{\Xi}\right)\\ \boldsymbol{I}_i &\sim N_P \left( \boldsymbol{\delta} + \boldsymbol{D A d}_i, \boldsymbol{D \Omega D^T} + \boldsymbol{\Psi} \right) \end{align}

The data-augmented process is a bit more standard with Gibbs-Sampling, but I’m sure HMC can deal better with this model. I’m just going to keep trying!

Edit: Mixed up one of the equations.

Thanks for your reply, Simon. Is there any way I could retrieve the values of the factor loadings (\Lambda in the book you mentioned)? I’d have only one known value. Sigma, but three unknowns (\Lambda, \Psi, \Theta).

By the way, thanks for pointing me towards Hoyle’s book. I don’t have much background in SEM/CFA and this looks very useful!

You should be able to substitute new_cov (in your code) in for Sigma.

I like the Hoyle book for the history and fundamentals of SEM. It gets pretty technical, but it clarified a lot for me. You might also find Bayesian Structural Equation Modeling (2021) by Depaoli helpful. It is more practical so a good supplement.

1 Like

Just to clarify, are you doing comparisons using the original code that you posted, or using this larger model with binary indicators? I have seen marginal to be advantageous with continuous observed variables, but the binary variables are a different beast.

You can avoid some computation by

matrix[P, P] new_cov = add_diag(multiply_lower_tri_self_transpose(D * L_Omega), psi);

The multi_normal_cholesky has derivatives so may be faster than multi_normal. Either way, each needs a Cholesky decomposition. In the multi_normal code an LDL’ decomposition is taken, which is a bit slower than LL’ but is more numerically stable. It’s worthwhile to test forming L_cov and passing that to multi_normal_cholesky.

matrix[P, P] L_cov = cholesky_decompose(new_cov); 
// you may need to add some jitter to the diagonal if non-PD warnings occur
// cholesky_decompose(add_diag(new_cov, rep_vector(1e-6, P));

Ind ~ multi_normal_cholesky(d, L_cov);

The comparison was with respect to (synthetic) continuous data using the model I posted. I coded the larger model with binary indicators, but it was so slow I didn’t even wait for it to finish.

Thanks! This makes the code run a tiny bit better. I guess there could be a more significant improvement with larger matrices.

Could you also post the data-augmented model that matches the one from your original post? I guess it will be very similar…

Sure, here it is. Dvec_to_D is the function defined in the original post.

data {
  int<lower=1> N; // n of observations
  int<lower=1> M; // n of latent variables
  int<lower=1> P; // n of indicators
  matrix[N, P] Ind; // indicators
  int<lower=0, upper=2> D_skeleton[P,M]; // which LVs are related to which indicators
  int<lower=M> Dvec_length; // equal to sum(D_skeleton)

parameters {
  vector<lower=0>[Dvec_length] Dvec; // vector with factor loadings
  vector<lower=0>[P] psi;
  vector[P] delta;
  matrix[M,N] z; // // individual random effects
  cholesky_factor_corr[M] L_Omega; // cholesky factor of the covariance/correlation matrix 

transformed parameters {
  matrix[P,M] D = Dvec_to_D(P, M, D_skeleton, Dvec);
  matrix[N,M] LV = transpose(L_Omega * z);

model {
  matrix[N,P] Ind_mean = rep_matrix(delta', N) + LV * D';
  Dvec ~ normal(1, 2);
  psi ~ inv_gamma(2, 1);
  delta ~ normal(0, 10);
  to_vector(z) ~ std_normal();
  L_Omega ~ lkj_corr_cholesky(4);
  to_vector(A) ~ normal(0, 10);
  for (i in 1:P) {
    Ind[,i] ~ normal(Ind_mean[,i], psi[i]);

add_diag() is good but I don’t think D * L_Omega is guaranteed to be lower triangular so you need to use tcrossprod() instead of multiply_lower_tri_self_transpose().

matrix[P, P] new_cov = add_diag(tcrossprod(D * L_Omega), psi);

Good catch!


If your objective is to run the model with binary (categorical) indicators. The models should not be written the same way as with continuous inidctaors. As you would be misspecifying the relation between the factor and the indicators, which would be linear with continuos and should be either probit or logistic with categorical.

Some options would be using IRT parameterication to specify the logistic relation, or the SEM method that approximates an underlying contininuous factor that represents the item response factor.

Hi Mauricio.

Thanks for pointing this out. The observed variables for the CFA component of my model are continuous, so it’s OK to use a Normal distribution for the marginal (reduced form) model. The categorical outcomes are captured in another component.

In the full model I wrote above, y_{ij} are categorical and I_{i} are continuous. The reduced form model, then, is I_{I} \sim N(\cdot) and y_{ij} \sim \text{softmax}(\cdot).

I wrote a script (below) to do a basic comparison of the two models using rstan 2.21.7. There are also links to the two models, which are minor modifications of the original models (without the suggestion of @spinkney). I am seeing speed and effective sample size advantages of marginal at smaller N. Speed gets similar at larger N, but effective sample size of marginal is still much larger than that of data augmentation.

The trick from @simonbrauer is worthwhile; blavaan also does that (since the past year or so) and it really helps for large N. If you need the constants of the mvn likelihood for that approach, see the Stan code in this post.

The comparison R script is below. You can play with it by changing the values of N,M, and pper. I’d be interested to hear if your results are different from mine.

tr_cond.stan (1.8 KB)
tr_marg.stan (1.9 KB)


N <- c(100, 2000)
M <- c(3, 9)
pper <- 3 ## indicators per factor

conds <- expand.grid(N = N, M = M, pper = pper)

mmarg <- stan_model('tr_marg.stan')
mcond <- stan_model('tr_cond.stan')

tmpR <- matrix(.4, max(pper), max(pper))
diag(tmpR) <- 1

res <- vector("list", nrow(conds))

savepars <- c("Dvec", "psi", "delta", "L_Omega") ## note: "d" in the marginal model is renamed
                                                 ##       "delta"

for(i in 1:nrow(conds)) {
  N <- conds[i,'N']
  M <- conds[i,'M']
  pper <- conds[i,'pper']

  datmat <- matrix(NA, N, M * pper)
  Dskel <- diag(M) %x% matrix(1, pper, 1)
  for(j in 1:M) {
    datmat[, (pper * (j-1) + 1):(pper * j)] <- rmvnorm(N, sigma = tmpR[1:pper, 1:pper])
  t1 <- system.time(r1 <- sampling(mmarg, data = list(N = N, M = M, P = pper * M, Ind = datmat,
                                                      D_skeleton = Dskel, Dvec_length = sum(Dskel)),
                                   warmup = 500, iter = 1000, pars = savepars))

  t2 <- system.time(r2 <- sampling(mcond, data = list(N = N, M = M, P = pper * M, Ind = datmat,
                                                      D_skeleton = Dskel, Dvec_length = sum(Dskel)),
                                   warmup = 500, iter = 1000, pars = savepars))

  res[[i]] <- list(r1 = summary(r1)[[1]], r2 = summary(r2)[[1]], t1 = t1, t2 = t2)

Thanks for clarifying this. Based on the 2021 article, I was under the impression that the “efficient” model was still at the observation level and used a multivariate normal distribution marginal across latent variables rather than sampling the latent variables for each observation. I understand that the model code is pre-compiled. Is the original Stan syntax available?

And thank you for your work on blavaan. It is a great tool.

1 Like

The Stan model that gets precompiled is on github here. You can see its history by going back to old commits.

Currently, for continuous data, the likelihood evaluation (marginal across latents) is based on the sample covariance matrix. Then we go back to the observation level in generated quantities for the waic/loo computations. For continuous data with missing values, we still use the sample covariance matrix by missing data pattern. I have not been able to get this sort of thing working for ordinal data, so the ordinal data likelihood is still at the observation level and is fairly slow.

1 Like

Thanks so much for doing this. I’m now sold on the marginal version of the CFA.

I ran basically the same script on Stan 2.21.7 using @spinkney’s suggestion and for N <- c(100, 500, 2000, 5000). When N is large, the marginal model tends to run faster. More importantly in my opinion, it had no issues with chains that do not mix (I guess that also helps with runtime?) and a much better effective sample size across the board.

Guess this solves my question! Thanks all for your input.

tr_cond.stan (1.8 KB)
tr_marg.stan (1.9 KB)