I’ve finally found time to follow up an idea suggested to me by @jsocolar to incorporate biological trait information into a GLLVM/exploratory factor analysis model.
To briefly state the problem, I have data on the abundance of 30 fungal species on approximately 600 seedlings. These fungi have to compete for space on their hosts, so I’m interested in exploring the residual correlations among pairs of fungal species (I have no measured environmental covariates). I hypothesise that some component of this residual correlation is a function of fungal traits - each fungal species is assigned to one of 6 categorical traits.
It’s simple enough to find this residual correlation matrix among species - construct a simple factor analysis model, with a factor loadings matrix \textbf{L} and latent variable matrix \textbf{Z}. We can get the residual correlations among species by calculating \Sigma = \textbf{L}\textbf{L}'.
To incorporate trait information into this model, @jsocolar suggested replacing each factor loading with a trait-specific component (which I’ve called \delta here), and a species-specific component (which I’ve called \gamma). We could then compute either partial correlation matrices (e.g. \Sigma_{\textrm{trait}} = \boldsymbol{\delta}\boldsymbol{\delta}') or the whole component.
The problem comes in due to identifiability concerns - as I learned here, these models are only identifiable as far as \Sigma = \textbf{L}\textbf{L}', as there are an infinite number of possible rotations of the factor loadings. To fix this, in Bayesian models it’s typical to force the upper triangle of the factor loading matrix to be zero, and sometimes the diagonal itself to be strictly positive - this constrains the factor loadings to either a single mode or a set of sign-flipped modes, which are easy to deal with in the generated quantities.
However, by constructing the factor loadings as a sum of two components, I don’t think it’s possible to apply this constraint, which means Stan has severe problems exploring the highly multimodal posterior - with the code below, I get warnings that all samples have reached the maximum treedepth, and (with the inclusion of an offset in the model) a lot of divergences, and BFMI warnings. Increasing the maximum treedepth or adapt_delta does very little, as one might expect.
Does anyone have any thoughts on how to parameterise this model that might help with identifiability? Or (perhaps more realistically), can anyone suggest a different modelling framework that would allow me to explore the contribution of shared traits versus species identity to correlations in species abundance?
edit: I should add that I have no interest in a direct interpretation of any of the factor loadings - only on the correlation/covariance matrices they imply.
edit 2: thinking as I go, one option might be to split the trait correlations away from the latent variable altogether - estimate a hierarchical intercept with correlations between traits, and include only a species-specific factor loading with identfiable constraints. This would mean I could reduce the number of parameters by avoiding directly estimating a 30x30 species correlation matrix, and only have to directly estimate the smaller 6x6 trait correlation matrix. Would this approach give me a similar final output?
data {
int<lower=1> N; //Number of seedlings
int<lower=1> S; //Number of species
int<lower=1> T; //Number of traits
int<lower=1> D; //Number of latent dimensions
array[N * S] int Y; //Abundance
array[N * S] int Seedling; //Seedling id
array[N * S] int Species; //Species identity
array[N * S] int Trait; //Trait
}
parameters {
// Species factor loadings
matrix[S, D] Gamma;
// Trait factor loadings
matrix[T, D] Delta;
// Latent variables
matrix[D, N] LV;
}
model {
// Factor priors
to_vector(LV) ~ std_normal();
// Loading priors
to_vector(Gamma) ~ normal(0, 3);
to_vector(Delta) ~ normal(0, 3);
// Model
vector[N * S] mu;
for (i in 1:(N * S)) {
mu[i] = (Gamma[Species[i], ] + Delta[Trait[i], ]) * LV[, Seedling[i]];
Y[i] ~ poisson_log(mu[i]);
}
}