Help with factor analysis (latent variable model)

Not sure at all if this is the reason why you are encountering the problems you flag, but is it possible that the conventional Poisson model is not the most appropriate specfication here? Could there be, for example, excess zeroes and non-linearities in the data that need to be accommodated?

Sadykova, D., Scott, B.E., Dominicis, M.D., Wakelin, S.L., Sadykov, A., & Wolf, J. (2017). Bayesian joint models with INLA exploring marine mobile predator-prey and competitor species habitat overlap. Ecology and Evolution, 7(14), 212-5226. doi: 10.1002/ece3.3081

Thank you for your answer!

You are perfectly right, and a negative-binomial could be even sufficient.

However, I notice something : I have only one hyperparameter for every loadings of factor, would I not be better to have each factor’s loading with a unique variance?

Exemple : loadings[factor[d]] ~ normal(0, sd[d])

I’m yet wandering how to code that efficiently, given that my non-zero non diagonal loadings are stored in a vector…

Nevermind, ignore this, I’m super wrong.

The eigenvalues of L would be positive, but not necessarily LL^T.

Check L =
[1, 0;
-5, 1]

Oh you’re right, I did not understand that! Thanks! Is there a way to constrain LL^T to be positive semi-definite??

Lol I don’t think I am. No paper no pencil, trying to think it out in my head, blech, mistakes were made.

You could do that, but it would seem to require a lot of background knowledge that may not be available.

meaning the loading-specific prior variances.

I thought about that because of papers of Ovaskainen team (2016, 2017), which implement two very interesting features :

  • Sparse infinite bayesian factor models. This approach cannot be directly implemented in stan, but could potentially be approached by shrinkage or marginalization, I think, even if I do not undestand the subtility of these subjects
  • Determine underlying structure of factors (spatial in 2016 paper) or loadings (dependant of environmental variables in 2017 paper).

I would love to be able to implement that in stan language and exploit the powerful sampler it represents.

R tells me you are probably right :D

A <- matrix(c(1,0,-5,1), ncol = 2, byrow = T)
A
     [,1] [,2]
[1,]    1    0
[2,]   -5    1
eigen(A)
eigen() decomposition
$values
[1] 1 1

$vectors
     [,1]         [,2]
[1,]    0 4.440892e-17
[2,]    1 1.000000e+00

A %*%t(A)
     [,1] [,2]
[1,]    1   -5
[2,]   -5   26
eigen(A%*%t(A))
eigen() decomposition
$values
[1] 26.96291202  0.03708798

$vectors
           [,1]       [,2]
[1,] -0.1891075 -0.9819564
[2,]  0.9819564 -0.1891075
data{
  int N; // sample size
  int P; // number of species
  int D; // number of dimensions
  int<lower=0> Y[N,P]; // data matrix of order [N,P]
}
transformed data{
  vector[D] FS_mu; // factor means
  vector<lower=0>[D] FS_sd; // factor standard deviations
  int<lower=1> M;
  M = D*(P-D)+ D*(D-1)/2;  // number of lower-triangular, non-zero loadings
  for (m in 1:D) {
    FS_mu[m] = 0; //Mean of factors = 0
    FS_sd[m] = 1;
  } //Sd of factors = 1
}
parameters{
  //Parameters
  vector[N] alpha; //Row intercepts
  row_vector[P] b0; // Intercept per species
  vector[M] L_lower; //Lower diagonal loadings
  vector<lower=0>[D] L_diag; //Positive diagonal elements of loadings
  matrix[N,D] FS; //Factor scores, matrix of order [N,D]
  cholesky_factor_corr[D] Rho; //Correlation matrix between factors
  //Hyperparameters
  real<lower=0> sigma_a; //Variance of the row intercepts
  real<lower=0> sigma_b0; //Variance of the species intercepts
  real<lower=0> mu_lower;
  real<lower=0> sigma_lower;
  real<lower=0> eta; //Parameter of LKJ prior for Rho
}
transformed parameters{
  cholesky_factor_cov[P,D] lambda; //Final matrix of laodings

  matrix[D,D] Ld; // cholesky decomposition of the covariance matrix between factors
  matrix[N,P] temp; //intermediate predictor
  matrix<lower=0>[N,P] mu; // predicted values

  // Correlation matrix of factors
  Ld = diag_pre_multiply(FS_sd, Rho);
  {
    int idx2; //Index for the lower diagonal loadings
    idx2 = 0;

    // Constraints to allow identifiability of loadings
  	 for(i in 1:(D-1)) { for(j in (i+1):(D)){ lambda[i,j] = 0; } } //0 on upper diagonal
  	 for(i in 1:D) lambda[i,i] = L_diag[i]; //Positive values on diagonal
  	 for(j in 1:D) {
  	   for(i in (j+1):P) {
  	     idx2 = idx2+1;
  	     lambda[i,j] = L_lower[idx2];
  	   }
  	 }
  }

  // Predictor
  temp = FS * lambda';
  for(n in 1:N) mu[n] = exp(alpha[n] + b0 + temp[n]);

}
model{
  // Hyperpriors
  sigma_a ~ gamma(2,0.1); //Row intercepts hyperprior
  sigma_b0 ~ gamma(2,0.1); //Species intercept effects hyperprior
  mu_lower ~ gamma(2,0.1);//Mean of lower loadings
  sigma_lower ~ gamma(2,0.1); //Variance of lower loadings
  eta ~ gamma(2,0.1); //Parameter of the cholesky prior for FS correlation structure

  // Priors
  alpha ~ normal(0, sigma_a); //Regularizing prior for row intercepts
  b0 ~ normal(0, sigma_b0); //Regularizing prior for species intercepts
  L_diag ~ normal(0,0.1); //Prior for diagonal elements
  L_lower ~ normal(mu_lower, sigma_lower); //Regularizing prior for lower loadings
  Rho ~ lkj_corr_cholesky(eta); //Regularizing prior for Rho

  for(i in 1:N){
    Y[i,] ~ poisson(mu[i]);
    FS[i,] ~ multi_normal_cholesky(FS_mu, Ld);
  }
}
generated quantities{
  //cov_matrix[P] cov_lambda;

  //cov_lambda = lambda * lambda´
}

I modified your code minimally to what I think is absolutely necessary to get rid of the BFMI
warning. Nevertheless I believe that your hyper-priors need refinement.

1 Like

Thank you very much!

You essentially moved the fixed parameters of factors in the tranformed data block to avoid sampling, isn’t it?

While reading your new version, I just realized that I let a strictly positive hyperprior for mu_lower, coming from another version in which I constrained loadings to be positive : a great amount of trouble could have come from this misspecification!

Yes. I changed this two:

Something to discuss:

 for(i in 1:(D-1)) { for(j in (i+1):(D)){ lambda[i,j] = 0; } } //0 on upper diagonal
 for(i in 1:D) lambda[i,i] = L_diag[i]; //Positive values on diagonal
 for(j in 1:D) {
   for(i in (j+1):P) {
     idx2 = idx2+1;
     lambda[i,j] = L_lower[idx2];
   }
 }

The above isn’t this just very similar to the Barlett Decomposition of the Wishart Distribution,
check matrix A with L is unit diagonal, thus redundant, or can be used as scale matrix.

If so, we do not need mu_lower and sigma_lower.

Last, I would change the sigma’s from gamma to cauchy:

Ok, the Barlett decomposition really looks like what is implemented in the jags code from boral (normal prior for the lower diag, zero on the upper). However, they do not sample the diagonal from a chi-squared distribution, but from a truncated normal…

Thank you very much for these insights, I would not have guess that by myself!

## JAGS model written for boral version 1.5 on 2018-04-20 09:43:59 ##

 model {
	 ## Data Level ## 
	 for(i in 1:n) {
		 for(j in 1:p) { eta[i,j] <- inprod(lv.coefs[j,2:(num.lv+1)],lvs[i,]) + row.coefs.ID1[row.ids[i,1]] }
		 for(j in 1:p) { y[i,j] ~ dpois(exp(lv.coefs[j,1] + eta[i,j])) }

		 }

	 ## Latent variables ## 
	 for(i in 1:n) { for(k in 1:num.lv) { lvs[i,k] ~ dnorm(0,1) } } 

	 ## Process level and priors ##
	 for(j in 1:p) { lv.coefs[j,1] ~ dnorm(0,0.1) } ## Separate species intercepts

	 for(i in 1:n.ID[1]) { row.coefs.ID1[i] ~ dnorm(0, pow(row.sigma.ID1,-2)) } 
	 row.sigma.ID1 ~ dunif(0,30)

	 for(i in 1:(num.lv-1)) { for(j in (i+2):(num.lv+1)) { lv.coefs[i,j] <- 0 } } ## Constraints to 0 on upper diagonal
	 for(i in 1:num.lv) { lv.coefs[i,i+1] ~ dnorm(0,0.1)I(0,) } ## Sign constraints on diagonal elements
	 for(i in 2:num.lv) { for(j in 2:i) { lv.coefs[i,j] ~ dnorm(0,0.1) } } ## Free lower diagonals
	 for(i in (num.lv+1):p) { for(j in 2:(num.lv+1)) { lv.coefs[i,j] ~ dnorm(0,0.1) } } ## All other elements

	 }

That’s ok in Stan too. You can sample from a truncated normal as well. It’s fine.
You may also consider to use a lkj correlation distribution instead of a Barlett Decomp. of Wishart.

lv.coefs[i,i+1] ~ dnorm(0,0.1)I(0,)

be aware that this would be normal(0, sqrt(1/0.1)); in Stan, don’t forget to use constraints.
Also the gamma’s I suspect to be sqrt of inv_gamma in Stan.

Oh, I didn’t know about these differences between jags and stan…
I have litteraly implemented the Barlett decomposition in my code, and it works very well : no divergence and rhats and effective sample sizes looks greats for all loadings! The observed covariance matrix seems to be greatly approximated. I will make quickly a case study with simulated data!

data{
  int N; // sample size
  int P; // number of species
  int D; // number of dimensions
  int<lower=0> Y[N,P]; // data matrix of order [N,P]
}
transformed data{
  int<lower=1> M;
  vector[D] FS_mu; // factor means
  vector<lower=0>[D] FS_sd; // factor standard deviations
  M = D*(P-D)+ D*(D-1)/2;  // number of lower-triangular, non-zero loadings
  for (m in 1:D) {
    FS_mu[m] = 0; //Mean of factors = 0
    FS_sd[m] = 1; //Sd of factors = 1
  }
}
parameters{
  //Parameters
  vector[N] alpha; //Row intercepts
  row_vector[P] b0; // Intercept per species
  vector[M] L_lower; //Lower diagonal loadings
  vector<lower=0>[D] L_diag; //Positive diagonal elements of loadings
  matrix[N,D] FS; //Factor scores, matrix of order [N,D]
  cholesky_factor_corr[D] Rho; //Correlation matrix between factors
  //Hyperparameters
  real<lower=0> sigma_a; //Variance of the row intercepts
  real<lower=0> sigma_b0; //Variance of the species intercepts
  real<lower=0> eta; //Parameter of LKJ prior for Rho
}
transformed parameters{
  cholesky_factor_cov[P,D] lambda; //Final matrix of laodings
  matrix[D,D] Ld; // cholesky decomposition of the covariance matrix between factors
  matrix[N,P] temp; //intermediate predictor
  matrix<lower=0>[N,P] mu; // predicted values
  
  // Correlation matrix of factors
  Ld = diag_pre_multiply(FS_sd, Rho);
  
  {
    int idx2; //Index for the lower diagonal loadings
    idx2 = 0;
    
    // Constraints to allow identifiability of loadings
  	 for(i in 1:(D-1)) { for(j in (i+1):(D)){ lambda[i,j] = 0; } } //0 on upper diagonal
  	 for(i in 1:D) lambda[i,i] = L_diag[i]; //Positive values on diagonal
  	 for(j in 1:D) {
  	   for(i in (j+1):P) {
  	     idx2 = idx2+1;
  	     lambda[i,j] = L_lower[idx2];
  	   }
  	 }
  }
  
  // Predictor
  temp = FS * lambda';
  for(n in 1:N) mu[n] = exp(alpha[n] + b0 + temp[n]);
  
}
model{
  // Hyperpriors
  sigma_a ~ gamma(2,0.1); //Row intercepts hyperprior
  sigma_b0 ~ gamma(2,0.1); //Species intercept effects hyperprior
  eta ~ gamma(2,0.1); //Parameter of the cholesky prior for FS correlation structure
  
  // Priors
  alpha ~ normal(0, sigma_a); //Regularizing prior for row intercepts
  b0 ~ normal(0, sigma_b0); //Regularizing prior for species intercepts
  for(d in 1:D) L_diag[d] ~ chi_square(M-d);
  L_lower ~ normal(0,1); //Regularizing prior for lower loadings
  Rho ~ lkj_corr_cholesky(eta); //Regularizing prior for Rho
  
  for(i in 1:N){	
    Y[i,] ~ poisson(mu[i]);
    FS[i,] ~ multi_normal_cholesky(FS_mu, Ld);
  }
}
generated quantities{
  matrix[P,P] cov_lambda;
  
  cov_lambda = multiply_lower_tri_self_transpose(lambda);
}
2 Likes

If L is square and lower diagonal, then as long as it has positive diagonals, L * L' will be positive definite. Otherwise, you can use a non-square L for lower ranks. I don’t know if there’s a way to force it to be semi-definite of unknown rank.

If someone wants to code up an efficient Wishart and inverse Wishart for Cholesky factors of covariance matrices, that’d be great. We have the correlation-based LKJ prior for Cholesky factors of correlation matrices, but that’s it for efficient priors.

Cool!

The efficient Wishart for Cholesky factors of covariance matrices is already in the Stan manual pg. 350.
The inverse W., is easy to calculate, to sample one has to use the transposed the Inverse.

Since A^-1 = (L^-1)^T L^-1 = U^-1 (U^-1)^T.

I didn’t realize that was in there. I’m guessing Ben added it at some point.

Just of curiosity, why (in the context of your data) do you compute the above?

Thanks

Hi!
This operation aims just to create a covariance matrix with estimated correlations between factors but diagonal fixed to one. This the Φ of classical factor analysis formulation. But I just realized that using direcly a constrained correlation matrix should work, without an unnecessary operation!