Faster CFA without data augmentation

tomas.rossetti · October 26, 2022, 1:43pm

I need to build a CFA model that will be then used as a component of a more complex model. Since the final model fairly big, I need the CFA to be as fast and efficient as possible.

I built two CFA models: one with data augmentation, and another that draws from the marginal (something like what blavaan does). Although the blavaan-type model should run faster, it’s like 2.5 times slower. Is there any way I could speed it up?

Thanks a lot for your help!

The marginal model:

data {
  int<lower=1> N; // n of individuals/observations
  int<lower=1> M; // n of factors/latent variables
  int<lower=1> P; // n of observed indicators
  vector[P] Ind[N]; // matrix[N,P] of indicators
  int<lower=0, upper=1> D_skeleton[P,M]; // which LVs are related to which indicators? 1 if yes, 0 if not
  int Dvec_length; // equal to sum(D_skeleton)
}
parameters {
  vector<lower=0>[Dvec_length] Dvec; // factor loadings
  vector<lower=0>[P] psi; // variance of indicators
  vector[P] d; // mean of indicators
  cholesky_factor_corr[M] L_Omega; // cholesky factor or the covariance of LVs. (Actually a correlation so the variance of the LVs = 1)
}
transformed parameters {
  matrix[P,M] D = Dvec_to_D(P, M, D_skeleton, Dvec); // convert to a matrix of factor loadings. See the function below.
  corr_matrix[M] Omega = multiply_lower_tri_self_transpose(L_Omega);
}
model {
  matrix[P,P] new_cov = quad_form(Omega, D') + diag_matrix(psi);
  
  Dvec ~ normal(1, 2);
  psi ~ inv_gamma(1,2);
  d ~ normal(0, 10);
  L_Omega ~ lkj_corr_cholesky(1);
  
  Ind ~ multi_normal(d, new_cov);
}

The mathematical model:

I_i = \delta + D \nu_i + \eta_i \quad \forall i
\nu_i \sim N_M(0, \Omega) (the latent variables)
\eta_i \sim N_P(0, \Psi) (the measurement errors)

So the reduced form model is:
I_i \sim N_P(\delta, D\Omega D^T + \Psi)

The Dvec_to_D function:

functions {
  // Transform a vector of factor loadings into a matrix with appropriate zero constraints
  // @param P the number of observed indicators 
  // @param M the number of factors/latent variables 
  // @param D_skeleton a matrix with 1s where factor loadings should be estimated, 0s otherwise
  // @param Dvec a vector with factor loadings
  // @returns a [P,M] matrix with zeroes where D_skeleton is equal to zero
  matrix Dvec_to_D(int P,
                   int M,
                   int[ , ] D_skeleton,
                   vector Dvec) {
    matrix[P,M] ret_matrix;

    int counter = 1;
    for (i in 1:P) {
      for (j in 1:M) {
        if (D_skeleton[i,j] == 1) {
          ret_matrix[i,j] = Dvec[counter];
          counter += 1;
        }
        else {
          ret_matrix[i,j] = 0;
        }
      }
    }

    return ret_matrix;
  }
}

edm · October 28, 2022, 2:14am

It is good to see you got this working. I skimmed the code and didn’t immediately notice a major issue.

It would be helpful to know what values of N, M, and P you are using. If you have something like N=100, M=1, P=3, then I doubt you will see much difference and the data augmentation may even be faster. I think you would see more benefits of marginal as N and M increase.

One other thing is that, instead of raw speed, you might consider sampling efficiency (effective sample size per second).

simonbrauer · October 28, 2022, 12:26pm

You might consider the SEM approach of modelling the covariance matrix rather than the individual observations. Given sample size n, sample covariance matrix S, and model-implied covariance matrix \Sigma, the log-likelihood is proportional to \log L = -\frac{1}{2} n \left[ \log \lvert\Sigma\rvert + tr(S \Sigma^{-1}) \right]. See Hoyle 2012 “Handbook of Structural Equation Modeling”, page 21.

Alternatively, you could use the Wishart distribution to model the scatter matrix (Q below), from which the equation above is derived.

data{
    int N; // Number of observations
    int P; // Number of indicators
    matrix[N,P] Y;  // Raw data
}
transformed data{
    matrix[N,P] Y_center;
    matrix[P,P] Y_scatter;

    for(p in 1:P){
        Y_center[,p] = Y[,p] - mean(Y[,p]);
    }

    Y_scatter = Y_center' * Y_center;
}
parameters{
    cov_matrix[P] Sigma;  // Model-implied covariance matrix
}
model{
    Y_scatter ~ wishart(N, Sigma);
}

Neither of these model the mean structure, though I suspect it may speed up the estimation of the covariance structure.

tomas.rossetti · October 28, 2022, 2:39pm

Thanks for looking at my code. In the CFA component of my model, I’m expecting something like N = 2000, M = 5, P = 12. I’m not sure why I’m not seeing much advantage in terms of speed. Maybe it’s the matrix multiplication?

My full model is quite complex (see below) and, in a first try, the marginal version didn’t work well either. At least in that case the covariance is pretty convoluted.

The full model is an integrated choice and latent variable model. There are latent variables (factors) that enter a multinomial logit model. More details in this paper.

The choice component (u_{ij} is individual i's utility for alternative j described by K attributes, \beta_i is a vector with random partworths/preference weights to be estimated. The probability that i chooses j given weights and attributes is a logistic regression.)

\begin{align} \boldsymbol{u}_i &= \boldsymbol{\alpha} + \boldsymbol{X}_i \boldsymbol{\beta}_i + \boldsymbol{\varepsilon}_i & J \times 1 \\ &= \boldsymbol{v}_i + \boldsymbol{\varepsilon}_i \\ \boldsymbol{\beta}_i &= \bar{\boldsymbol{\beta}} + \boldsymbol{\Gamma} \boldsymbol{x}_i^* + \boldsymbol{\xi}_i & K \times 1 \\ \varepsilon_{ij} &\sim \text{EV1}(0,1) \\ \boldsymbol{\xi}_i &\sim N(\textbf{0}, \Xi) \\ y_{ij} = 1 &\iff u_{ij} = \max_j \boldsymbol{u}_i \\ \therefore \Pr \left(y_{ij} = 1 \right) &= \frac{\exp{v_{ij}}}{\sum_{l=1}^J\exp{v_{il}}} = \text{MNL} \left( \boldsymbol{v}_i \right) \end{align}

The latent variable (CFA) component (d_i are demographics)

\begin{align} \boldsymbol{x}_i^* &= \boldsymbol{A d}_i + \boldsymbol{\nu}_i & M \times 1 \\ \boldsymbol{I}_i &= \boldsymbol{\delta} + \boldsymbol{D x}_i^* + \boldsymbol{\eta}_i & P \times 1 \\ \boldsymbol{\nu}_i &\sim N_M(\boldsymbol{0}, \boldsymbol{\Omega}) \\ \boldsymbol{\eta}_i &\sim N_P(\boldsymbol{0}, \boldsymbol{\Psi}) \end{align}

After some math, the generative process can be described by:

\begin{align} \boldsymbol{v}_i &\sim N_J\left( \boldsymbol{\alpha} + \boldsymbol{X}_i \bar{\boldsymbol{\beta}} + \boldsymbol{X}_i \boldsymbol{\Gamma} \boldsymbol{A d}_i, \boldsymbol{X}_i \boldsymbol{\Gamma} \boldsymbol{\Omega} \boldsymbol{\Gamma}^T \boldsymbol{X}_i^T + \boldsymbol{\Xi}\right)\\ \boldsymbol{I}_i &\sim N_P \left( \boldsymbol{\delta} + \boldsymbol{D A d}_i, \boldsymbol{D \Omega D^T} + \boldsymbol{\Psi} \right) \end{align}

The data-augmented process is a bit more standard with Gibbs-Sampling, but I’m sure HMC can deal better with this model. I’m just going to keep trying!

Edit: Mixed up one of the equations.

tomas.rossetti · October 28, 2022, 2:42pm

Thanks for your reply, Simon. Is there any way I could retrieve the values of the factor loadings (\Lambda in the book you mentioned)? I’d have only one known value. Sigma, but three unknowns (\Lambda, \Psi, \Theta).

By the way, thanks for pointing me towards Hoyle’s book. I don’t have much background in SEM/CFA and this looks very useful!

simonbrauer · October 28, 2022, 3:01pm

You should be able to substitute new_cov (in your code) in for Sigma.

I like the Hoyle book for the history and fundamentals of SEM. It gets pretty technical, but it clarified a lot for me. You might also find Bayesian Structural Equation Modeling (2021) by Depaoli helpful. It is more practical so a good supplement.

edm · October 28, 2022, 3:02pm

Just to clarify, are you doing comparisons using the original code that you posted, or using this larger model with binary indicators? I have seen marginal to be advantageous with continuous observed variables, but the binary variables are a different beast.

spinkney · October 28, 2022, 3:50pm

You can avoid some computation by

matrix[P, P] new_cov = add_diag(multiply_lower_tri_self_transpose(D * L_Omega), psi);

The multi_normal_cholesky has derivatives so may be faster than multi_normal. Either way, each needs a Cholesky decomposition. In the multi_normal code an LDL’ decomposition is taken, which is a bit slower than LL’ but is more numerically stable. It’s worthwhile to test forming L_cov and passing that to multi_normal_cholesky.

matrix[P, P] L_cov = cholesky_decompose(new_cov); 
// you may need to add some jitter to the diagonal if non-PD warnings occur
// cholesky_decompose(add_diag(new_cov, rep_vector(1e-6, P));

Ind ~ multi_normal_cholesky(d, L_cov);

tomas.rossetti · October 28, 2022, 4:06pm

The comparison was with respect to (synthetic) continuous data using the model I posted. I coded the larger model with binary indicators, but it was so slow I didn’t even wait for it to finish.

tomas.rossetti · October 28, 2022, 4:33pm

Thanks! This makes the code run a tiny bit better. I guess there could be a more significant improvement with larger matrices.

edm · October 29, 2022, 1:19am

Could you also post the data-augmented model that matches the one from your original post? I guess it will be very similar…

tomas.rossetti · October 29, 2022, 3:38pm

Sure, here it is. Dvec_to_D is the function defined in the original post.

data {
  int<lower=1> N; // n of observations
  int<lower=1> M; // n of latent variables
  int<lower=1> P; // n of indicators
  matrix[N, P] Ind; // indicators
  int<lower=0, upper=2> D_skeleton[P,M]; // which LVs are related to which indicators
  int<lower=M> Dvec_length; // equal to sum(D_skeleton)
}

parameters {
  vector<lower=0>[Dvec_length] Dvec; // vector with factor loadings
  vector<lower=0>[P] psi;
  vector[P] delta;
  
  matrix[M,N] z; // // individual random effects
  cholesky_factor_corr[M] L_Omega; // cholesky factor of the covariance/correlation matrix 
}

transformed parameters {
  matrix[P,M] D = Dvec_to_D(P, M, D_skeleton, Dvec);
  matrix[N,M] LV = transpose(L_Omega * z);
}

model {
  matrix[N,P] Ind_mean = rep_matrix(delta', N) + LV * D';
  
  Dvec ~ normal(1, 2);
  psi ~ inv_gamma(2, 1);
  delta ~ normal(0, 10);
  to_vector(z) ~ std_normal();
  L_Omega ~ lkj_corr_cholesky(4);
  to_vector(A) ~ normal(0, 10);
  
  for (i in 1:P) {
    Ind[,i] ~ normal(Ind_mean[,i], psi[i]);
  }
}

nhuurre · October 30, 2022, 8:49am

add_diag() is good but I don’t think D * L_Omega is guaranteed to be lower triangular so you need to use tcrossprod() instead of multiply_lower_tri_self_transpose().

matrix[P, P] new_cov = add_diag(tcrossprod(D * L_Omega), psi);

spinkney · October 30, 2022, 1:59pm

Good catch!

Mauricio_Garnier-Villarre · October 30, 2022, 7:46pm

Tomas

If your objective is to run the model with binary (categorical) indicators. The models should not be written the same way as with continuous inidctaors. As you would be misspecifying the relation between the factor and the indicators, which would be linear with continuos and should be either probit or logistic with categorical.

Some options would be using IRT parameterication to specify the logistic relation, or the SEM method that approximates an underlying contininuous factor that represents the item response factor.

tomas.rossetti · October 31, 2022, 1:39pm

Hi Mauricio.

Thanks for pointing this out. The observed variables for the CFA component of my model are continuous, so it’s OK to use a Normal distribution for the marginal (reduced form) model. The categorical outcomes are captured in another component.

In the full model I wrote above, y_{ij} are categorical and I_{i} are continuous. The reduced form model, then, is I_{I} \sim N(\cdot) and y_{ij} \sim \text{softmax}(\cdot).

edm · November 2, 2022, 2:12am

I wrote a script (below) to do a basic comparison of the two models using rstan 2.21.7. There are also links to the two models, which are minor modifications of the original models (without the suggestion of @spinkney). I am seeing speed and effective sample size advantages of marginal at smaller N. Speed gets similar at larger N, but effective sample size of marginal is still much larger than that of data augmentation.

The trick from @simonbrauer is worthwhile; blavaan also does that (since the past year or so) and it really helps for large N. If you need the constants of the mvn likelihood for that approach, see the Stan code in this post.

The comparison R script is below. You can play with it by changing the values of N,M, and pper. I’d be interested to hear if your results are different from mine.

tr_cond.stan (1.8 KB)
tr_marg.stan (1.9 KB)

library(rstan)
library(mvtnorm)

N <- c(100, 2000)
M <- c(3, 9)
pper <- 3 ## indicators per factor

conds <- expand.grid(N = N, M = M, pper = pper)

mmarg <- stan_model('tr_marg.stan')
mcond <- stan_model('tr_cond.stan')

tmpR <- matrix(.4, max(pper), max(pper))
diag(tmpR) <- 1

res <- vector("list", nrow(conds))

savepars <- c("Dvec", "psi", "delta", "L_Omega") ## note: "d" in the marginal model is renamed
                                                 ##       "delta"

for(i in 1:nrow(conds)) {
  N <- conds[i,'N']
  M <- conds[i,'M']
  pper <- conds[i,'pper']

  datmat <- matrix(NA, N, M * pper)
  Dskel <- diag(M) %x% matrix(1, pper, 1)
  
  for(j in 1:M) {
    datmat[, (pper * (j-1) + 1):(pper * j)] <- rmvnorm(N, sigma = tmpR[1:pper, 1:pper])
  }
  
  t1 <- system.time(r1 <- sampling(mmarg, data = list(N = N, M = M, P = pper * M, Ind = datmat,
                                                      D_skeleton = Dskel, Dvec_length = sum(Dskel)),
                                   warmup = 500, iter = 1000, pars = savepars))

  t2 <- system.time(r2 <- sampling(mcond, data = list(N = N, M = M, P = pper * M, Ind = datmat,
                                                      D_skeleton = Dskel, Dvec_length = sum(Dskel)),
                                   warmup = 500, iter = 1000, pars = savepars))

  res[[i]] <- list(r1 = summary(r1)[[1]], r2 = summary(r2)[[1]], t1 = t1, t2 = t2)
}

simonbrauer · November 2, 2022, 1:14pm

Thanks for clarifying this. Based on the 2021 article, I was under the impression that the “efficient” model was still at the observation level and used a multivariate normal distribution marginal across latent variables rather than sampling the latent variables for each observation. I understand that the model code is pre-compiled. Is the original Stan syntax available?

And thank you for your work on blavaan. It is a great tool.

edm · November 2, 2022, 2:09pm

The Stan model that gets precompiled is on github here. You can see its history by going back to old commits.

Currently, for continuous data, the likelihood evaluation (marginal across latents) is based on the sample covariance matrix. Then we go back to the observation level in generated quantities for the waic/loo computations. For continuous data with missing values, we still use the sample covariance matrix by missing data pattern. I have not been able to get this sort of thing working for ordinal data, so the ordinal data likelihood is still at the observation level and is fairly slow.

tomas.rossetti · November 2, 2022, 5:34pm

Thanks so much for doing this. I’m now sold on the marginal version of the CFA.

I ran basically the same script on Stan 2.21.7 using @spinkney’s suggestion and for N <- c(100, 500, 2000, 5000). When N is large, the marginal model tends to run faster. More importantly in my opinion, it had no issues with chains that do not mix (I guess that also helps with runtime?) and a much better effective sample size across the board.

Guess this solves my question! Thanks all for your input.

tr_cond.stan (1.8 KB)
tr_marg.stan (1.9 KB)

Topic		Replies	Views
Speeding up construction of linear predictor Modeling techniques , specification , performance	0	384	January 27, 2023
How can I make my script more efficient? Depending on the size of the data matrix it can take a while to run Modeling techniques , specification , performance	3	478	April 21, 2023
Improving efficiency when modeling autocorrelation Modeling performance	8	1732	October 21, 2019
Speedup of independent factor analysis model Modeling performance	4	873	August 23, 2018
Speeding up hierarchical VAR Modeling performance	9	1114	May 18, 2021

Faster CFA without data augmentation

Related topics