CFA model cannot recover simulated values

I am using cmdstanr to fit a confirmatory factor analysis (CFA) model, but cannot recover the true values from my simulation.
Here is the model:
X = WZ + \epsilon .
where X is a 10*100 dimensional data matrix. W is a 10*4 loading matrix, and Z is a 4*100 latent factor matrix.
This is a CFA model, so W has a known structure. To avoid the identifiability issue, I made it very sparse and each LV specifically targets a different set of observed variables. e.g.,
1,0,0,0,0,0,0,0,0,0,
0,1,1,0,0,0,0,0,0,0,
0,0,0,1,1,1,0,0,0,0,
0,0,0,0,0,0,1,1,1,1
Also, I assume all the elements in Z are non-negative, to avoid the sign flipping issue.
Then I simulate data on those non-zero loadings in W and also simulate Z and \epsilon. To make it simple, all the simulation uses normal distribution.
I used two ways to fit the model, one is conditional likelihood, X|W,Z,\tau \sim MVN(WZ, diag(\tau)), the other is marginal likelihood, X|W,\tau \sim MVN(0,WW^T + diag(\tau)).
Simulation R code and stan code are attached below.

## dimensions
n_TFs = 4 
n_TGs = 10
n_samples = 100

## prior structure
prior = matrix(data = c(1,0,0,0,0,0,0,0,0,0,
                        0,1,1,0,0,0,0,0,0,0,
                        0,0,0,1,1,1,0,0,0,0,
                        0,0,0,0,0,0,1,1,1,1),
               byrow = T,nrow = 4,ncol = 10)
rownames(prior) = c("TF1","TF2","TF3","TF4")
colnames(prior) = c("TG1","TG2","TG3","TG4","TG5","TG6","TG7","TG8","TG9","TG10")
prior

## simulate W ~ N(0,1)
W = prior
set.seed(1)
W[W==1] = rnorm(n=sum(prior),mean=0,sd=1)
W

## simulate Z~N+(0,1)
library(tmvtnorm)
set.seed(1)
Z = rtmvnorm(n_samples, mean=rep(0,n_TFs), sigma=diag(rep(1,n_TFs)), lower=rep(0,n_TFs))
Z = t(Z)

## simulate eps~N(0,0.1)
set.seed(1)
library(MASS)
eps=MASS::mvrnorm(n_samples,mu=rep(0,n_TGs),Sigma = diag(rep(0.1,n_TGs)))

## X
X = t(W)%*%Z + t(eps)

## fit model using marginal likelihood 
require(cmdstanr)
P = prior
n_genes = dim(X)[1]
n_samples = dim(X)[2]
n_TFs = dim(P)[1]
P = as.vector(t(P)) 
P_zero = which(P==0)
P_ones = which(P!=0)
n_zero = length(P_zero)
n_ones = length(P_ones)
n_all = length(P)
data_to_model = list(n_genes = n_TGs, n_samples = n_samples, n_TFs = n_TFs,
                     X = X, P = P, 
                     P_zero = P_zero, P_ones = P_ones,
                     n_zero = n_zero,n_ones = n_ones,
                     n_all = n_all)

#1. creat stan programme file
mod <- cmdstan_model("ex_conditional.stan")

#2. run MCMC
fit <- mod$sample(data = data_to_model,chains=1,seed = seed,max_treedepth=10,
                    iter_warmup = 500,iter_sampling=1000,adapt_delta=0.99)

#3. check model using PSIS-LOO
require(loo)
loocv = loo(fit$draws("log_lik",format = "draws_array"),
            r_eff=relative_eff(fit$draws("log_lik",format = "draws_array")),
            moment_match=TRUE)
print(loocv)
  
#4. find parameter
W_sample = fit$draws("W",format = "draws_matrix") ## matrix, each row is a sample of vectorized matrix
one_ind = which(colSums(W_sample==0)<nrow(W_sample)) ## index for all-zero columns

## point summary of W
W_pos = colMeans(W_sample[,one_ind]) ## average W
W_pos2 = rep(0,ncol(W_sample))
W_pos2[one_ind] = W_pos
W_pos = matrix(W_pos2,nrow = n_genes,ncol = n_TFs) ## convert to matrix genes*TFs

## point summary of Z
Z_sample = fit$draws("Z",format = "draws_matrix")
Z_pos = colMeans(Z_sample) ## average Z
Z_pos = matrix(Z_pos,nrow = n_TFs,ncol = n_samples) ## convert to matrix TFs*samples

## compare with true value
t(W)
W_pos

Z[,1:6]
Z_pos[,1:6]

Stan conditional likelihood model is below:

data {
  int<lower=0> n_genes;                       // Number of genes
  int<lower=0> n_samples;                     // Number of samples
  int<lower=0> n_TFs;                         // Number of TFs
  int<lower=0> n_zero;                        // length of zero elements in P
  int<lower=0> n_ones;                        // length of non-zero elements in P
  int<lower=0> n_all;                         // length of all elements in P
  matrix[n_genes,n_samples] X;                // Gene expression matrix X
  vector[n_all] P;                            // Prior connection probability
  array[n_zero] int P_zero;                         // index of zero probablity edges
  array[n_ones] int P_ones;                         // index of non-zero prob edges
  }
    
  transformed data {
  vector[n_genes*n_samples] X_vec;            // gene expression X
  X_vec = to_vector(X);
  }

  parameters {
  matrix<lower=0>[n_TFs,n_samples] Z;         // TF activity matrix Z
  vector<lower=0>[n_genes] tau;               // variances of genes 
  vector[n_ones] beta1;                      // Regulatory network non-zero edge weight
  }

  transformed parameters {
  
  matrix[n_genes, n_TFs] W;                   // Regulatory netwrok W
  vector[n_all] W_vec;                        // Regulatory vector W_vec
  W_vec[P_zero]=rep_vector(0,n_zero);
  W_vec[P_ones]=beta1;
  W = to_matrix(W_vec,n_genes,n_TFs);         // by column
  matrix[n_genes,n_samples] mu = W*Z;               // mu for gene expression X
  vector[n_genes*n_samples] X_mu = to_vector(mu);
  vector[n_genes*n_samples] sigma = to_vector(rep_matrix(sqrt(tau),n_samples));
  }
  
  
  model {
  // priors
  tau ~ inv_gamma(1,1);
  beta1 ~ std_normal();
  to_vector(Z) ~ normal(0,1);

  // likelihood
  X_vec ~ normal(X_mu, sigma);

  }
  
  generated quantities {
  // leave one element out
  vector[n_genes*n_samples] log_lik;
  for (i in 1:n_genes*n_samples){
    log_lik[i] = normal_lpdf(X_vec[i]|X_mu[i],sigma[i]);
  }
  }

Stan marginal likelihood model is blow:

data {
  int<lower=0> n_genes;                       // Number of genes
  int<lower=0> n_samples;                     // Number of samples
  int<lower=0> n_TFs;                         // Number of TFs
  int<lower=0> n_zero;                        // length of zero elements in P
  int<lower=0> n_ones;                        // length of non-zero elements in P
  int<lower=0> n_all;                         // length of all elements in P
  matrix[n_genes,n_samples] X;                // Gene expression matrix X
  vector[n_all] P;                            // Prior connection probability
  int P_zero[n_zero] ;                        // index of zero probablity edges
  int P_ones[n_ones] ;                        // index of non-zero prob edges
}

parameters {
  vector<lower=0>[n_genes] tau;          // variance 
  vector[n_ones] beta1;                      // Regulatory network non-zero edge weight
}

transformed parameters {
  matrix[n_genes, n_TFs] W;                   // Regulatory netwrok W
  vector[n_all] W_vec;                        // Regulatory vector W_vec
  W_vec[P_zero]=rep_vector(0,n_zero);         // zero edge weights

  W_vec[P_ones]=beta1;
  W = to_matrix(W_vec,n_genes,n_TFs);         // by column
  matrix[n_genes,n_genes] X_VCOV = tcrossprod(W) + diag_matrix(tau);  // WW'+tau*I
}
  
  
model {
  // priors
  tau ~ inv_gamma(1,1);
  beta1 ~ std_normal();

  // likelihood
  for (i in 1:n_samples){
    X[,i] ~ multi_normal(rep_vector(0,n_genes),X_VCOV);
  }


}
  
generated quantities {
  
  // leave one column out likelihood
  vector[n_samples] log_lik;
  for(i in 1:n_samples){
    log_lik[i] = multi_normal_lpdf(X[,i] | rep_vector(0,n_genes), X_VCOV);
  }
  
  // estimate Z, this is not truncated, not correct!
  matrix[n_TFs,n_samples] Z;    
  // (W'*tau^-1*W + I)^-1
  matrix[n_TFs,n_TFs]Z_VCOV = inverse_spd(quad_form_sym(diag_matrix(inv(tau)),W) + identity_matrix(n_TFs));
  
  for(i in 1:n_samples){
    vector[n_TFs]Z_mu = Z_VCOV*W'*diag_matrix(inv(tau))*(X[,i]-rep_vector(0,n_genes)) ;
    Z[,i] = multi_normal_rng(Z_mu,Z_VCOV);
  }

}
//


I have a few questions…

  1. My expectation is to recover the simulated values of W and Z, e.g., the posterior distribution of P(W_{ij}|X) should be centered at the simulated value of W_{ij}, and P(Z_{jn}|X) also centers at the true value of Z_{jn} My first question, is this expectation correct? If this is correct, I found that W and W_{poserior} are usually close (I wish it could be closer because this is a very small dataset), but Z and Z_{pos} are way different from each other. I have a very high dimensional gene expression data to fit later, so I what to have some correct expectations of how accurate this could be.

  2. I used PSIS-LOO for model checking. But I noticed that Pareto K always looks good (all K<0.5) for the marginal likelihood model (leave one column out) but is problematic (9.7% bad and 0.7% very bad) with the conditional likelihood model (leave one element out). I wonder if there is any explanation for this. Can I improve modeling fit for the conditional likelihood?

  3. In the marginal likelihood model, latent variable Z is integrated out, so we need to generate posterior samples of Z in Stan’s “generated quantities” block. My problem is my distribution for Z is a truncated MVN (non-negative). I found an answer from here. But I am still confused how to add this stan file to my stan model. Should I wrap it as a big function in my model?

  4. My conditional likelihood model runs much faster than marginal likelihood because conditional likelihood implies univariate likelihood. But in “blavaan” package, the authors claim marginal likelihood is running faster because the posterior density has a much less dimension. I agree with their idea and I found that the Pareto K says marginal likelihood model has a better fit. So I wonder in my simple CFA model, how should I accelerate my marginal likelihood stan model?

  5. I always found a warning “Chain 1 Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
    Chain 1 Exception: normal_lpdf: Scale parameter[2] is 0, but must be positive!”
    I assume this is OK if it only occurs once, but still, I wonder if there is a way to avoid this warning?

I am very grateful for any answers and suggestions!!
Chen

Chen, constraining the Z to be non-negative is not a traditional identifiability constraint. It also makes your life difficult by leading to a truncated MVN. Instead, one would typically constrain a single loading per column (of W) to be positive. I suspect that the non-negative Z could cause problems with parameter recovery, because that constraint does not allow any of the posterior distributions to overlap with 0. Some related discussion is here.

About bad Pareto k and the conditional likelihood, I often see the same thing and do not know of a good solution. I think the problem is that the number of latent variables increases with the sample size.

About speed of marginal vs conditional, it partly depends on the specific model, sample size, and number of latent variables, and you might not see a big difference for lower sample sizes and latent variables. To speed up the marginal, it is helpful to write the multivariate normal lpdf via the mean vector and covariance matrix, and define a custom function using this form of the lpdf. Then you can compute a single marginal density across all the observations, instead of separately for each observation. This is what happens in newer versions of blavaan.

About the warning message, I would guess that it is related to the initial values and could be fixed by defining your own initial values.

Hi Ed, Thanks for your information! I am sorry I forgot to mention that my constraining for Z is also based on domain knowledge. I work on biological data, Z is interpreted as a protein activity matrix, so it has to be non-negative. I guess this is a little like bayesian NMF. Do you know if any work successfully applied stan on bayesian NMF? I wonder how they deal with so many non-negative constraints?

I am glad you found the same thing! But if PSIS-LOO is not great for conditional likelihood, could you let me know what other model checking you usually use?

This sounds very interesting. I am not sure if I understand how to do it, but I will try to look at blavaan stan code.

I don’t know of a Bayesian NMF implementation in Stan.

For PSIS-LOO and bad Pareto k, I typically avoid the conditional likelihood and use marginal instead. If you wanted to use conditional likelihood, I think the recommendation would be to set up your own cross-validation. Some more details are here (especially see the FAQ).

For parameterizing the multivariate normal via mean and covariance matrix, I think the code below will do it. Mu and Sigma are the multivariate normal parameters; xbar and S are the observed mean and covariance matrix; N is number of observations; p is the dimension of the multivariate normal:

out = -.5 * N * ( sum( inverse_spd(Sigma) .* (S + (xbar - Mu) * (xbar - Mu)') ) + log_determinant(Sigma) + p * log(2 * pi()) );
1 Like

Hi Ed, many thanks for your answers! But I have a quick follow-up question. How did you generate the posterior of latent variables with marginal likelihood? In “generated quantities” of blavaan’s stan code, I feel like I didn’t see it? I want to check my formulas/code as I am new to Bayesian CFA. Could you share with me some references or your derivation? Thank you!

You are right that this doesn’t appear in generated quantities. It is because, at one point, the Stan file was leading to “out of memory” errors when CRAN tried to compile it. So I moved the latent variable stuff to R in order to make the Stan file smaller. But you can see the Stan procedure in old versions of the file, such as here.

The idea is that the latent variables and observed variables are jointly multivariate normal. The specific distribution appears in the lisrel book. Based on this, you can obtain the conditional distribution of latent variables given observed variables using properties of the multivariate normal. Then we generate values from this conditional distribution (which is multivariate normal).

2 Likes