Help with factor analysis (latent variable model)

ldeschamps · April 19, 2018, 2:49pm

Hello!

I am exploring latent variable model, as described in ecological papers (from Warton, Ovaskainen, Blanchet et al.). The principal aim is to model residual correlations among species abundances after controlling for environmental variables.

After reading the google group discussion and Rick Farouni’s blog post, I tried to reproduce a simple example from the boral package, which realize exactly that by samping posteriors with jags.

EDIT : I just remembered this discussion on discourse, and I realized I have done exactly the same thing. However, I thought I followed Rick’s recommendation, but it seems I encounter the same problems…

I tried different formulations, but I always encounter BFMI warning, and I can’t reproduce boral results.

Here is the stan code

data{
  int N; // sample size
  int P; // number of species
  int D; // number of dimensions
  int<lower=0> Y[N,P]; // data matrix of order [N,P]
}
transformed data{
  int<lower=1> M;
  M = D*(P-D)+ D*(D-1)/2;  // number of lower-triangular, non-zero loadings
}
parameters{
  //Parameters
  vector[N] alpha; //Row intercepts
  row_vector[P] b0; // Intercept per species
  vector[M] L_lower; //Lower diagonal loadings
  vector<lower=0>[D] L_diag; //Positive diagonal elements of loadings
  matrix[N,D] FS; //Factor scores, matrix of order [N,D]
  cholesky_factor_corr[D] Rho; //Correlation matrix between factors
  //Hyperparameters
  real<lower=0> sigma_a; //Variance of the row intercepts
  real<lower=0> sigma_b0; //Variance of the species intercepts
  real<lower=0> mu_lower;
  real<lower=0> sigma_lower;
  real<lower=0> eta; //Parameter of LKJ prior for Rho
}
transformed parameters{
  cholesky_factor_cov[P,D] lambda; //Final matrix of laodings
  vector[D] FS_mu; // factor means
  vector<lower=0>[D] FS_sd; // factor standard deviations
  matrix[D,D] Ld; // cholesky decomposition of the covariance matrix between factors
  matrix[N,P] temp; //intermediate predictor
  matrix<lower=0>[N,P] mu; // predicted values
  
  for (m in 1:D) {
    FS_mu[m] = 0; //Mean of factors = 0
    FS_sd[m] = 1;} //Sd of factors = 1
  
  // Correlation matrix of factors
  Ld = diag_matrix(FS_sd) * Rho;
  {
    int idx2; //Index for the lower diagonal loadings
    idx2 = 0;
    
    // Constraints to allow identifiability of loadings
  	 for(i in 1:(D-1)) { for(j in (i+1):(D)){ lambda[i,j] = 0; } } //0 on upper diagonal
  	 for(i in 1:D) lambda[i,i] = L_diag[i]; //Positive values on diagonal
  	 for(j in 1:D) {
  	   for(i in (j+1):P) {
  	     idx2 = idx2+1;
  	     lambda[i,j] = L_lower[idx2];
  	   }
  	 }
  }
  
  // Predictor
  temp = FS * lambda';
  for(n in 1:N) mu[n] = exp(alpha[n] + b0 + temp[n]);
  
}
model{
  // Hyperpriors
  sigma_a ~ gamma(2,0.1); //Row intercepts hyperprior
  sigma_b0 ~ gamma(2,0.1); //Species intercept effects hyperprior
  mu_lower ~ gamma(2,0.1);//Mean of lower loadings
  sigma_lower ~ gamma(2,0.1); //Variance of lower loadings
  eta ~ gamma(2,0.1); //Parameter of the cholesky prior for FS correlation structure
  
  // Priors
  alpha ~ normal(0, sigma_a); //Regularizing prior for row intercepts
  b0 ~ normal(0, sigma_b0); //Regularizing prior for species intercepts
  L_diag ~ normal(0,0.1); //Prior for diagonal elements
  L_lower ~ normal(mu_lower, sigma_lower); //Regularizing prior for lower loadings
  Rho ~ lkj_corr_cholesky(eta); //Regularizing prior for Rho
  
  for(i in 1:N){	
    Y[i,] ~ poisson(mu[i]);
    FS[i,] ~ multi_normal_cholesky(FS_mu, Ld);
  }
}
generated quantities{
  //cov_matrix[P] cov_lambda;
  
  //cov_lambda = lambda * lambda´
}

And a R code producing the reproducible example.

rm(list = ls())

library(rstan)
library(boral) ## require jags
library(mvabund)

# We will use the spider dataset available in mvabund, pitfall trap counts of 12 hunting spider species at 28 sites. First load it from mvabund:
data(spider)

# Cherry-pick a few variables
spid.abund = spider$abund[,c(1:3,7,8,12)]

## Prepare data for stan model
Y <- spid.abund
N <- nrow(Y)
P <- ncol(Y)
D <- 3

StanDat <- list(Y = Y, N = N, P = P, D = D)

## Fit the model with stan
lvm_stan <- stan("Simple_latent_matrix.stan", data = StanDat,
                 iter = 2000, chain = 3, cores = 3)
## Fit the model with boral
lvm_boral <- boral(y = Y, num.lv = D, family = "poisson", row.eff = "random")

Any idea?

Thank you!
Lucas

ldeschamps · April 19, 2018, 3:11pm

First element of solution : I avoided BFMI problems and divergent transitions by remplacing the estimated mean of factors by 0.

L_lower ~ normal(0, sigma_lower); //Regularizing prior for lower loadings

I will check if I this code can recover parameters from a simulated data set.

ldeschamps · April 19, 2018, 3:34pm

However, I don’t understand why cov_matrix is not positive-definite (error if I define cov_lambda as a covariance matrix) : I made lambda having 0 on the upper-diagonal elements, and have positive constrained diagonal.

generated quantities{
  matrix[P,P] cov_lambda;
  
  cov_lambda = multiply_lower_tri_self_transpose(lambda);
}

Cam · April 19, 2018, 4:36pm

Not sure at all if this is the reason why you are encountering the problems you flag, but is it possible that the conventional Poisson model is not the most appropriate specfication here? Could there be, for example, excess zeroes and non-linearities in the data that need to be accommodated?

Sadykova, D., Scott, B.E., Dominicis, M.D., Wakelin, S.L., Sadykov, A., & Wolf, J. (2017). Bayesian joint models with INLA exploring marine mobile predator-prey and competitor species habitat overlap. Ecology and Evolution, 7(14), 212-5226. doi: 10.1002/ece3.3081

ldeschamps · April 19, 2018, 4:52pm

Thank you for your answer!

You are perfectly right, and a negative-binomial could be even sufficient.

However, I notice something : I have only one hyperparameter for every loadings of factor, would I not be better to have each factor’s loading with a unique variance?

Exemple : loadings[factor[d]] ~ normal(0, sd[d])

I’m yet wandering how to code that efficiently, given that my non-zero non diagonal loadings are stored in a vector…

bbbales2 · April 19, 2018, 5:07pm

Nevermind, ignore this, I’m super wrong.

The eigenvalues of L would be positive, but not necessarily LL^T.

Check L =
[1, 0;
-5, 1]

ldeschamps · April 19, 2018, 5:15pm

Oh you’re right, I did not understand that! Thanks! Is there a way to constrain LL^T to be positive semi-definite??

bbbales2 · April 19, 2018, 5:17pm

Lol I don’t think I am. No paper no pencil, trying to think it out in my head, blech, mistakes were made.

Cam · April 19, 2018, 6:15pm

You could do that, but it would seem to require a lot of background knowledge that may not be available.

Cam · April 19, 2018, 6:16pm

meaning the loading-specific prior variances.

ldeschamps · April 19, 2018, 7:42pm

I thought about that because of papers of Ovaskainen team (2016, 2017), which implement two very interesting features :

Sparse infinite bayesian factor models. This approach cannot be directly implemented in stan, but could potentially be approached by shrinkage or marginalization, I think, even if I do not undestand the subtility of these subjects
Determine underlying structure of factors (spatial in 2016 paper) or loadings (dependant of environmental variables in 2017 paper).

I would love to be able to implement that in stan language and exploit the powerful sampler it represents.

ldeschamps · April 19, 2018, 7:44pm

R tells me you are probably right :D

A <- matrix(c(1,0,-5,1), ncol = 2, byrow = T)
A
     [,1] [,2]
[1,]    1    0
[2,]   -5    1
eigen(A)
eigen() decomposition
$values
[1] 1 1

$vectors
     [,1]         [,2]
[1,]    0 4.440892e-17
[2,]    1 1.000000e+00

A %*%t(A)
     [,1] [,2]
[1,]    1   -5
[2,]   -5   26
eigen(A%*%t(A))
eigen() decomposition
$values
[1] 26.96291202  0.03708798

$vectors
           [,1]       [,2]
[1,] -0.1891075 -0.9819564
[2,]  0.9819564 -0.1891075

andre.pfeuffer · April 20, 2018, 8:16am

data{
  int N; // sample size
  int P; // number of species
  int D; // number of dimensions
  int<lower=0> Y[N,P]; // data matrix of order [N,P]
}
transformed data{
  vector[D] FS_mu; // factor means
  vector<lower=0>[D] FS_sd; // factor standard deviations
  int<lower=1> M;
  M = D*(P-D)+ D*(D-1)/2;  // number of lower-triangular, non-zero loadings
  for (m in 1:D) {
    FS_mu[m] = 0; //Mean of factors = 0
    FS_sd[m] = 1;
  } //Sd of factors = 1
}
parameters{
  //Parameters
  vector[N] alpha; //Row intercepts
  row_vector[P] b0; // Intercept per species
  vector[M] L_lower; //Lower diagonal loadings
  vector<lower=0>[D] L_diag; //Positive diagonal elements of loadings
  matrix[N,D] FS; //Factor scores, matrix of order [N,D]
  cholesky_factor_corr[D] Rho; //Correlation matrix between factors
  //Hyperparameters
  real<lower=0> sigma_a; //Variance of the row intercepts
  real<lower=0> sigma_b0; //Variance of the species intercepts
  real<lower=0> mu_lower;
  real<lower=0> sigma_lower;
  real<lower=0> eta; //Parameter of LKJ prior for Rho
}
transformed parameters{
  cholesky_factor_cov[P,D] lambda; //Final matrix of laodings

  matrix[D,D] Ld; // cholesky decomposition of the covariance matrix between factors
  matrix[N,P] temp; //intermediate predictor
  matrix<lower=0>[N,P] mu; // predicted values

  // Correlation matrix of factors
  Ld = diag_pre_multiply(FS_sd, Rho);
  {
    int idx2; //Index for the lower diagonal loadings
    idx2 = 0;

    // Constraints to allow identifiability of loadings
  	 for(i in 1:(D-1)) { for(j in (i+1):(D)){ lambda[i,j] = 0; } } //0 on upper diagonal
  	 for(i in 1:D) lambda[i,i] = L_diag[i]; //Positive values on diagonal
  	 for(j in 1:D) {
  	   for(i in (j+1):P) {
  	     idx2 = idx2+1;
  	     lambda[i,j] = L_lower[idx2];
  	   }
  	 }
  }

  // Predictor
  temp = FS * lambda';
  for(n in 1:N) mu[n] = exp(alpha[n] + b0 + temp[n]);

}
model{
  // Hyperpriors
  sigma_a ~ gamma(2,0.1); //Row intercepts hyperprior
  sigma_b0 ~ gamma(2,0.1); //Species intercept effects hyperprior
  mu_lower ~ gamma(2,0.1);//Mean of lower loadings
  sigma_lower ~ gamma(2,0.1); //Variance of lower loadings
  eta ~ gamma(2,0.1); //Parameter of the cholesky prior for FS correlation structure

  // Priors
  alpha ~ normal(0, sigma_a); //Regularizing prior for row intercepts
  b0 ~ normal(0, sigma_b0); //Regularizing prior for species intercepts
  L_diag ~ normal(0,0.1); //Prior for diagonal elements
  L_lower ~ normal(mu_lower, sigma_lower); //Regularizing prior for lower loadings
  Rho ~ lkj_corr_cholesky(eta); //Regularizing prior for Rho

  for(i in 1:N){
    Y[i,] ~ poisson(mu[i]);
    FS[i,] ~ multi_normal_cholesky(FS_mu, Ld);
  }
}
generated quantities{
  //cov_matrix[P] cov_lambda;

  //cov_lambda = lambda * lambda´
}

I modified your code minimally to what I think is absolutely necessary to get rid of the BFMI
warning. Nevertheless I believe that your hyper-priors need refinement.

ldeschamps · April 20, 2018, 12:48pm

Thank you very much!

You essentially moved the fixed parameters of factors in the tranformed data block to avoid sampling, isn’t it?

While reading your new version, I just realized that I let a strictly positive hyperprior for mu_lower, coming from another version in which I constrained loadings to be positive : a great amount of trouble could have come from this misspecification!

andre.pfeuffer · April 20, 2018, 12:58pm

Yes. I changed this two:

Something to discuss:

 for(i in 1:(D-1)) { for(j in (i+1):(D)){ lambda[i,j] = 0; } } //0 on upper diagonal
 for(i in 1:D) lambda[i,i] = L_diag[i]; //Positive values on diagonal
 for(j in 1:D) {
   for(i in (j+1):P) {
     idx2 = idx2+1;
     lambda[i,j] = L_lower[idx2];
   }
 }

The above isn’t this just very similar to the Barlett Decomposition of the Wishart Distribution,
check matrix A with L is unit diagonal, thus redundant, or can be used as scale matrix.

If so, we do not need mu_lower and sigma_lower.

Last, I would change the sigma’s from gamma to cauchy:

ldeschamps · April 20, 2018, 1:51pm

Ok, the Barlett decomposition really looks like what is implemented in the jags code from boral (normal prior for the lower diag, zero on the upper). However, they do not sample the diagonal from a chi-squared distribution, but from a truncated normal…

Thank you very much for these insights, I would not have guess that by myself!

## JAGS model written for boral version 1.5 on 2018-04-20 09:43:59 ##

 model {
	 ## Data Level ## 
	 for(i in 1:n) {
		 for(j in 1:p) { eta[i,j] <- inprod(lv.coefs[j,2:(num.lv+1)],lvs[i,]) + row.coefs.ID1[row.ids[i,1]] }
		 for(j in 1:p) { y[i,j] ~ dpois(exp(lv.coefs[j,1] + eta[i,j])) }

		 }

	 ## Latent variables ## 
	 for(i in 1:n) { for(k in 1:num.lv) { lvs[i,k] ~ dnorm(0,1) } } 

	 ## Process level and priors ##
	 for(j in 1:p) { lv.coefs[j,1] ~ dnorm(0,0.1) } ## Separate species intercepts

	 for(i in 1:n.ID[1]) { row.coefs.ID1[i] ~ dnorm(0, pow(row.sigma.ID1,-2)) } 
	 row.sigma.ID1 ~ dunif(0,30)

	 for(i in 1:(num.lv-1)) { for(j in (i+2):(num.lv+1)) { lv.coefs[i,j] <- 0 } } ## Constraints to 0 on upper diagonal
	 for(i in 1:num.lv) { lv.coefs[i,i+1] ~ dnorm(0,0.1)I(0,) } ## Sign constraints on diagonal elements
	 for(i in 2:num.lv) { for(j in 2:i) { lv.coefs[i,j] ~ dnorm(0,0.1) } } ## Free lower diagonals
	 for(i in (num.lv+1):p) { for(j in 2:(num.lv+1)) { lv.coefs[i,j] ~ dnorm(0,0.1) } } ## All other elements

	 }

andre.pfeuffer · April 20, 2018, 2:10pm

That’s ok in Stan too. You can sample from a truncated normal as well. It’s fine.
You may also consider to use a lkj correlation distribution instead of a Barlett Decomp. of Wishart.

lv.coefs[i,i+1] ~ dnorm(0,0.1)I(0,)

be aware that this would be normal(0, sqrt(1/0.1)); in Stan, don’t forget to use constraints.
Also the gamma’s I suspect to be sqrt of inv_gamma in Stan.

ldeschamps · April 20, 2018, 2:19pm

Oh, I didn’t know about these differences between jags and stan…
I have litteraly implemented the Barlett decomposition in my code, and it works very well : no divergence and rhats and effective sample sizes looks greats for all loadings! The observed covariance matrix seems to be greatly approximated. I will make quickly a case study with simulated data!

data{
  int N; // sample size
  int P; // number of species
  int D; // number of dimensions
  int<lower=0> Y[N,P]; // data matrix of order [N,P]
}
transformed data{
  int<lower=1> M;
  vector[D] FS_mu; // factor means
  vector<lower=0>[D] FS_sd; // factor standard deviations
  M = D*(P-D)+ D*(D-1)/2;  // number of lower-triangular, non-zero loadings
  for (m in 1:D) {
    FS_mu[m] = 0; //Mean of factors = 0
    FS_sd[m] = 1; //Sd of factors = 1
  }
}
parameters{
  //Parameters
  vector[N] alpha; //Row intercepts
  row_vector[P] b0; // Intercept per species
  vector[M] L_lower; //Lower diagonal loadings
  vector<lower=0>[D] L_diag; //Positive diagonal elements of loadings
  matrix[N,D] FS; //Factor scores, matrix of order [N,D]
  cholesky_factor_corr[D] Rho; //Correlation matrix between factors
  //Hyperparameters
  real<lower=0> sigma_a; //Variance of the row intercepts
  real<lower=0> sigma_b0; //Variance of the species intercepts
  real<lower=0> eta; //Parameter of LKJ prior for Rho
}
transformed parameters{
  cholesky_factor_cov[P,D] lambda; //Final matrix of laodings
  matrix[D,D] Ld; // cholesky decomposition of the covariance matrix between factors
  matrix[N,P] temp; //intermediate predictor
  matrix<lower=0>[N,P] mu; // predicted values
  
  // Correlation matrix of factors
  Ld = diag_pre_multiply(FS_sd, Rho);
  
  {
    int idx2; //Index for the lower diagonal loadings
    idx2 = 0;
    
    // Constraints to allow identifiability of loadings
  	 for(i in 1:(D-1)) { for(j in (i+1):(D)){ lambda[i,j] = 0; } } //0 on upper diagonal
  	 for(i in 1:D) lambda[i,i] = L_diag[i]; //Positive values on diagonal
  	 for(j in 1:D) {
  	   for(i in (j+1):P) {
  	     idx2 = idx2+1;
  	     lambda[i,j] = L_lower[idx2];
  	   }
  	 }
  }
  
  // Predictor
  temp = FS * lambda';
  for(n in 1:N) mu[n] = exp(alpha[n] + b0 + temp[n]);
  
}
model{
  // Hyperpriors
  sigma_a ~ gamma(2,0.1); //Row intercepts hyperprior
  sigma_b0 ~ gamma(2,0.1); //Species intercept effects hyperprior
  eta ~ gamma(2,0.1); //Parameter of the cholesky prior for FS correlation structure
  
  // Priors
  alpha ~ normal(0, sigma_a); //Regularizing prior for row intercepts
  b0 ~ normal(0, sigma_b0); //Regularizing prior for species intercepts
  for(d in 1:D) L_diag[d] ~ chi_square(M-d);
  L_lower ~ normal(0,1); //Regularizing prior for lower loadings
  Rho ~ lkj_corr_cholesky(eta); //Regularizing prior for Rho
  
  for(i in 1:N){	
    Y[i,] ~ poisson(mu[i]);
    FS[i,] ~ multi_normal_cholesky(FS_mu, Ld);
  }
}
generated quantities{
  matrix[P,P] cov_lambda;
  
  cov_lambda = multiply_lower_tri_self_transpose(lambda);
}

Bob_Carpenter · May 1, 2018, 3:08am

If L is square and lower diagonal, then as long as it has positive diagonals, L * L' will be positive definite. Otherwise, you can use a non-square L for lower ranks. I don’t know if there’s a way to force it to be semi-definite of unknown rank.

If someone wants to code up an efficient Wishart and inverse Wishart for Cholesky factors of covariance matrices, that’d be great. We have the correlation-based LKJ prior for Cholesky factors of correlation matrices, but that’s it for efficient priors.

Cool!

andre.pfeuffer · May 1, 2018, 3:37am

The efficient Wishart for Cholesky factors of covariance matrices is already in the Stan manual pg. 350.
The inverse W., is easy to calculate, to sample one has to use the transposed the Inverse.

Since A^-1 = (L^-1)^T L^-1 = U^-1 (U^-1)^T.

Topic		Replies	Views
Sparse infinite factor priors Modeling	14	1430	June 12, 2018
Convergence problems with factor loadings in multivariante autoregressive model Modeling rstan , fitting-issues , specification , ecology	11	1604	July 31, 2020
Constraining Latent Factor model / baysian probabalisic matrix factorization to remove multimodality Modeling techniques , fitting-issues , specification	27	5416	March 17, 2020
Non-convergence of latent variable model Modeling	28	6206	February 13, 2020
Underestimating correlation coefficients with LKJ prior Modeling	24	6685	September 15, 2020

Help with factor analysis (latent variable model)

Related topics