I am using cmdstanr to fit a confirmatory factor analysis (CFA) model, but cannot recover the true values from my simulation.
Here is the model:
X = WZ + \epsilon .
where X is a 10*100 dimensional data matrix. W is a 10*4 loading matrix, and Z is a 4*100 latent factor matrix.
This is a CFA model, so W has a known structure. To avoid the identifiability issue, I made it very sparse and each LV specifically targets a different set of observed variables. e.g.,
1,0,0,0,0,0,0,0,0,0,
0,1,1,0,0,0,0,0,0,0,
0,0,0,1,1,1,0,0,0,0,
0,0,0,0,0,0,1,1,1,1
Also, I assume all the elements in Z are nonnegative, to avoid the sign flipping issue.
Then I simulate data on those nonzero loadings in W and also simulate Z and \epsilon. To make it simple, all the simulation uses normal distribution.
I used two ways to fit the model, one is conditional likelihood, XW,Z,\tau \sim MVN(WZ, diag(\tau)), the other is marginal likelihood, XW,\tau \sim MVN(0,WW^T + diag(\tau)).
Simulation R code and stan code are attached below.
## dimensions
n_TFs = 4
n_TGs = 10
n_samples = 100
## prior structure
prior = matrix(data = c(1,0,0,0,0,0,0,0,0,0,
0,1,1,0,0,0,0,0,0,0,
0,0,0,1,1,1,0,0,0,0,
0,0,0,0,0,0,1,1,1,1),
byrow = T,nrow = 4,ncol = 10)
rownames(prior) = c("TF1","TF2","TF3","TF4")
colnames(prior) = c("TG1","TG2","TG3","TG4","TG5","TG6","TG7","TG8","TG9","TG10")
prior
## simulate W ~ N(0,1)
W = prior
set.seed(1)
W[W==1] = rnorm(n=sum(prior),mean=0,sd=1)
W
## simulate Z~N+(0,1)
library(tmvtnorm)
set.seed(1)
Z = rtmvnorm(n_samples, mean=rep(0,n_TFs), sigma=diag(rep(1,n_TFs)), lower=rep(0,n_TFs))
Z = t(Z)
## simulate eps~N(0,0.1)
set.seed(1)
library(MASS)
eps=MASS::mvrnorm(n_samples,mu=rep(0,n_TGs),Sigma = diag(rep(0.1,n_TGs)))
## X
X = t(W)%*%Z + t(eps)
## fit model using marginal likelihood
require(cmdstanr)
P = prior
n_genes = dim(X)[1]
n_samples = dim(X)[2]
n_TFs = dim(P)[1]
P = as.vector(t(P))
P_zero = which(P==0)
P_ones = which(P!=0)
n_zero = length(P_zero)
n_ones = length(P_ones)
n_all = length(P)
data_to_model = list(n_genes = n_TGs, n_samples = n_samples, n_TFs = n_TFs,
X = X, P = P,
P_zero = P_zero, P_ones = P_ones,
n_zero = n_zero,n_ones = n_ones,
n_all = n_all)
#1. creat stan programme file
mod < cmdstan_model("ex_conditional.stan")
#2. run MCMC
fit < mod$sample(data = data_to_model,chains=1,seed = seed,max_treedepth=10,
iter_warmup = 500,iter_sampling=1000,adapt_delta=0.99)
#3. check model using PSISLOO
require(loo)
loocv = loo(fit$draws("log_lik",format = "draws_array"),
r_eff=relative_eff(fit$draws("log_lik",format = "draws_array")),
moment_match=TRUE)
print(loocv)
#4. find parameter
W_sample = fit$draws("W",format = "draws_matrix") ## matrix, each row is a sample of vectorized matrix
one_ind = which(colSums(W_sample==0)<nrow(W_sample)) ## index for allzero columns
## point summary of W
W_pos = colMeans(W_sample[,one_ind]) ## average W
W_pos2 = rep(0,ncol(W_sample))
W_pos2[one_ind] = W_pos
W_pos = matrix(W_pos2,nrow = n_genes,ncol = n_TFs) ## convert to matrix genes*TFs
## point summary of Z
Z_sample = fit$draws("Z",format = "draws_matrix")
Z_pos = colMeans(Z_sample) ## average Z
Z_pos = matrix(Z_pos,nrow = n_TFs,ncol = n_samples) ## convert to matrix TFs*samples
## compare with true value
t(W)
W_pos
Z[,1:6]
Z_pos[,1:6]
Stan conditional likelihood model is below:
data {
int<lower=0> n_genes; // Number of genes
int<lower=0> n_samples; // Number of samples
int<lower=0> n_TFs; // Number of TFs
int<lower=0> n_zero; // length of zero elements in P
int<lower=0> n_ones; // length of nonzero elements in P
int<lower=0> n_all; // length of all elements in P
matrix[n_genes,n_samples] X; // Gene expression matrix X
vector[n_all] P; // Prior connection probability
array[n_zero] int P_zero; // index of zero probablity edges
array[n_ones] int P_ones; // index of nonzero prob edges
}
transformed data {
vector[n_genes*n_samples] X_vec; // gene expression X
X_vec = to_vector(X);
}
parameters {
matrix<lower=0>[n_TFs,n_samples] Z; // TF activity matrix Z
vector<lower=0>[n_genes] tau; // variances of genes
vector[n_ones] beta1; // Regulatory network nonzero edge weight
}
transformed parameters {
matrix[n_genes, n_TFs] W; // Regulatory netwrok W
vector[n_all] W_vec; // Regulatory vector W_vec
W_vec[P_zero]=rep_vector(0,n_zero);
W_vec[P_ones]=beta1;
W = to_matrix(W_vec,n_genes,n_TFs); // by column
matrix[n_genes,n_samples] mu = W*Z; // mu for gene expression X
vector[n_genes*n_samples] X_mu = to_vector(mu);
vector[n_genes*n_samples] sigma = to_vector(rep_matrix(sqrt(tau),n_samples));
}
model {
// priors
tau ~ inv_gamma(1,1);
beta1 ~ std_normal();
to_vector(Z) ~ normal(0,1);
// likelihood
X_vec ~ normal(X_mu, sigma);
}
generated quantities {
// leave one element out
vector[n_genes*n_samples] log_lik;
for (i in 1:n_genes*n_samples){
log_lik[i] = normal_lpdf(X_vec[i]X_mu[i],sigma[i]);
}
}
Stan marginal likelihood model is blow:
data {
int<lower=0> n_genes; // Number of genes
int<lower=0> n_samples; // Number of samples
int<lower=0> n_TFs; // Number of TFs
int<lower=0> n_zero; // length of zero elements in P
int<lower=0> n_ones; // length of nonzero elements in P
int<lower=0> n_all; // length of all elements in P
matrix[n_genes,n_samples] X; // Gene expression matrix X
vector[n_all] P; // Prior connection probability
int P_zero[n_zero] ; // index of zero probablity edges
int P_ones[n_ones] ; // index of nonzero prob edges
}
parameters {
vector<lower=0>[n_genes] tau; // variance
vector[n_ones] beta1; // Regulatory network nonzero edge weight
}
transformed parameters {
matrix[n_genes, n_TFs] W; // Regulatory netwrok W
vector[n_all] W_vec; // Regulatory vector W_vec
W_vec[P_zero]=rep_vector(0,n_zero); // zero edge weights
W_vec[P_ones]=beta1;
W = to_matrix(W_vec,n_genes,n_TFs); // by column
matrix[n_genes,n_genes] X_VCOV = tcrossprod(W) + diag_matrix(tau); // WW'+tau*I
}
model {
// priors
tau ~ inv_gamma(1,1);
beta1 ~ std_normal();
// likelihood
for (i in 1:n_samples){
X[,i] ~ multi_normal(rep_vector(0,n_genes),X_VCOV);
}
}
generated quantities {
// leave one column out likelihood
vector[n_samples] log_lik;
for(i in 1:n_samples){
log_lik[i] = multi_normal_lpdf(X[,i]  rep_vector(0,n_genes), X_VCOV);
}
// estimate Z, this is not truncated, not correct!
matrix[n_TFs,n_samples] Z;
// (W'*tau^1*W + I)^1
matrix[n_TFs,n_TFs]Z_VCOV = inverse_spd(quad_form_sym(diag_matrix(inv(tau)),W) + identity_matrix(n_TFs));
for(i in 1:n_samples){
vector[n_TFs]Z_mu = Z_VCOV*W'*diag_matrix(inv(tau))*(X[,i]rep_vector(0,n_genes)) ;
Z[,i] = multi_normal_rng(Z_mu,Z_VCOV);
}
}
//
I have a few questions…

My expectation is to recover the simulated values of W and Z, e.g., the posterior distribution of P(W_{ij}X) should be centered at the simulated value of W_{ij}, and P(Z_{jn}X) also centers at the true value of Z_{jn} My first question, is this expectation correct? If this is correct, I found that W and W_{poserior} are usually close (I wish it could be closer because this is a very small dataset), but Z and Z_{pos} are way different from each other. I have a very high dimensional gene expression data to fit later, so I what to have some correct expectations of how accurate this could be.

I used PSISLOO for model checking. But I noticed that Pareto K always looks good (all K<0.5) for the marginal likelihood model (leave one column out) but is problematic (9.7% bad and 0.7% very bad) with the conditional likelihood model (leave one element out). I wonder if there is any explanation for this. Can I improve modeling fit for the conditional likelihood?

In the marginal likelihood model, latent variable Z is integrated out, so we need to generate posterior samples of Z in Stan’s “generated quantities” block. My problem is my distribution for Z is a truncated MVN (nonnegative). I found an answer from here. But I am still confused how to add this stan file to my stan model. Should I wrap it as a big function in my model?

My conditional likelihood model runs much faster than marginal likelihood because conditional likelihood implies univariate likelihood. But in “blavaan” package, the authors claim marginal likelihood is running faster because the posterior density has a much less dimension. I agree with their idea and I found that the Pareto K says marginal likelihood model has a better fit. So I wonder in my simple CFA model, how should I accelerate my marginal likelihood stan model?

I always found a warning “Chain 1 Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Chain 1 Exception: normal_lpdf: Scale parameter[2] is 0, but must be positive!”
I assume this is OK if it only occurs once, but still, I wonder if there is a way to avoid this warning?
I am very grateful for any answers and suggestions!!
Chen