Hi all,
I’ve been trying to implement a generalised linear latent variable model (GLLVM, see Warton et al. 2015), to analyse multivariate data corresponding to species communities. The model is as follows, for species j at site i:
where
Where g() is a link function, a_i is a per-site intercept, \beta_{0j} is a species-specific intercept, \textbf{x}_i' is a vector of site-level covariates, and \beta_j is a vector of species-specific coefficients for these covariates. Finally, u_{ij} are latent variables \textbf{z}_i and their factor loadings \lambda_j. The goal of the model is twofold: the latent variables \textbf{z}_i can be used to construct a contstrained or unconstrained ordination, and residual correlation between species can be reconstructed by calculating \Sigma = \Lambda\Lambda^T.
Following this post by @rfarouni (Bayesian Factor Analysis), I have a working implementation of the model (leaving out the species-specific responses \textbf{x}_i'\beta_j for simplicity), which constrains the factor loadings such that all upper-triangular elements are zero, and the diagonal elements to be positive only. My understanding is that these constraints aid in identifiability by reducing the number of possible latent variable solutions (e.g. different orderings/signs of the factor loadings) to one. The post also recommends setting the init values for each chain to be close to one another, so I set all initial values to 0 using init = 0. This approach appears to work when using dataset simulated from the priors, finding a single solution, but in applying it to a real dataset (the spiders dataset from the R package mvabund), different chains appear to converge on different solutions for the latent variables even with these constraints:
Despite finding different solutions for the values of \textbf{z}_i and \lambda_j, the linear predictions produced by each chain are all largely consistent (note that the model does fit the observed data terribly, likely due to the lack of environmental predictors):
I’m new to the world of factor analysis (and no mathematician), but my understanding is that there are in principle an infinite number of solutions for the latent variables and their loadings. From the perspective of constructing the species residual covariance matrix, the multiple solutions here pose little issue as far as I can tell, as long as the solutions to \Lambda\Lambda^T are similar enough, but in terms of using this as an ordination technique this is obviously problematic! Is it possible to improve the identifiability/reduce the number of possible latent variable solutions further, or is this likely to be as good as it gets? If so, the only solution I can think of for consistent ordination would be to run many more parallel chains and select a set that produce the same solution for further analysis, but this seems rather arbitrary and for the more complex models I’d like to build, inefficient and wasteful of wall time.
Model code is as follows:
data {
int<lower=1> N; //Number of samples
int<lower=1> S; //Number of species
int<lower=1> D; //Number of latent dimensions
array[N, S] int Y; //Species matrix
}
transformed data{
// Number of non-zero lower triangular factor loadings
// Ensures identifiability of the model
int<lower=1> M;
M = D * (S - D) + D * (D - 1) / 2;
}
parameters {
// Site intercepts
vector[N] a;
real a_bar;
real<lower=0> sigma_a;
// Species intercepts
vector[S] b0;
real<lower=0> sigma_b0;
// Factor parameters
vector[M] L_lower; // lower triangle of species loadings
vector<lower=0>[D] L_diag; // Diagonal of species loadings - constrained to positive
matrix[N, D] LV; // Per-site latent variable
// NegBin parameters
real<lower=0> kappa;
}
transformed parameters {
matrix[S, D] L;
// Assign parameters to L matrix:
{
int idx2; // Index for the lower diagonal loadings
idx2 = 0;
// Constraints to allow identifiability of loadings
for (i in 1:(D-1)) { for (j in (i+1):(D)){ L[i,j] = 0; } } // 0 on upper diagonal
for (i in 1:D) L[i,i] = L_diag[i]; // Positive values on diagonal
for (j in 1:D) {
for (i in (j+1):S) {
idx2 = idx2+1;
L[i,j] = L_lower[idx2];
}
}
}
}
model {
// Factor priors
to_vector(LV) ~ std_normal();
L_lower ~ std_normal();
L_diag ~ std_normal();
// Random effect priors
a ~ normal(0, 1);
b0 ~ normal(0, 1);
a_bar ~ normal(0, 1);
sigma_a ~ exponential(1);
sigma_b0 ~ exponential(1);
// NegBin kappa prior
kappa ~ exponential(1);
array[N] vector[S] mu;
for (i in 1:N) {
for(j in 1:S) {
mu[i, j] = exp(a_bar + a[i] * sigma_a + b0[j] * sigma_b0 +
dot_product(L[j,], LV[i,]));
Y[i,j] ~ neg_binomial_2(mu[i, j], kappa);
}
}
}
generated quantities {
array[N] vector[S] linpred;
for (i in 1:N) {
for(j in 1:S) {
linpred[i, j] = a_bar + a[i] * sigma_a + b0[j] * sigma_b0 + dot_product(L[j,], LV[i,]);
}
}
matrix[S, S] COV;
COV = multiply_lower_tri_self_transpose(L);
}
Function to simulate dataset:
generate_sim_data <- function(N, S, D){
a_bar <- rnorm(1, 0, 1)
sigma_a <- rexp(1, 1)
sigma_b0 <- rexp(1, 1)
a <- rnorm(N, mean = 0, sd = 1)
b0 <- rnorm(S, mean = 0, sd = 1)
M <- D * (S - D) + D * (D - 1) / 2
L_l <- rnorm(M, 0, 1)
L_d <- abs(rnorm(D, 0, 1))
L <- matrix(nrow = S, ncol = D)
kappa <- rexp(1, 1)
idx2 = 0;
for (i in 1:(D-1)) { for (j in (i+1):(D)){ L[i,j] = 0 } }
for (i in 1:D) L[i,i] = L_d[i]
for (j in 1:D) {
for (i in (j+1):S) {
idx2 = idx2+1
L[i,j] = L_l[idx2]
}
}
LV <- matrix(rnorm(N * D, 0, 1), nrow = N, ncol = D)
Y <- matrix(nrow = N, ncol = S)
for(i in 1:N) {
for(j in 1:S){
mu_ij <- a_bar + a[i] * sigma_a + b0 * sigma_b0 + dot(LV[i,], L[j,])
Y[i, j] <- rethinking::rgampois(1, mu = exp(mu_ij), scale = kappa)
}
}
corr <- cov2cor(L %*% t(L))
pars = list(a_bar = a_bar,
sigma_a = sigma_a,
sigma_b0 = sigma_b0,
a = a, b0 = b0,
L = L,
LV = LV, Y = Y)
return(list(Y = Y, corr = corr, pars = pars))
}
Simulated dataset used above is here: sim_data.csv (1.0 KB)