Hi all,
I am working on a generalised linear latent variable model to predict the joint abundance of different species at different sites. The model uses a factor analytic approach to allow the predicted abundances of each species to co-vary, through the inclusion of D site-specific uncorrelated latent factors, and a corresponding loading matrix \Lambda, which describes the response of each species to each latent factor.
The model is as follows:
Where y_{ij} is the count of species j at site y, \alpha_i is a site-specific varying intercept, \beta_{0j} is a species-specific varying intercept, \beta_j is a vector of varying species-specific coefficients for site-level covariates x_i, and \lambda_j is the response of species j to the site-specific latent factor x_i.
I have written the model in Stan as follows, currently leaving out the covariates for an intercept-only model to aid development while I build my understanding:
data {
int<lower=1> N; //Number of samples
int<lower=1> S; //Number of species
int<lower=1> D; //Number of latent dimensions
array[N, S] int Y; //Species matrix
}
transformed data{
// Number of non-zero lower triangular factor loadings
// Ensures identifiability of the model - no rotation of factors
int<lower=1> M;
M = D * (S - D) + D * (D - 1) / 2;
}
parameters {
// Site intercepts
real a_bar;
real<lower=0> sigma_a;
vector[N] a;
// Species intercepts
real<lower=0> sigma_b0;
vector[S] b0;
// Factor parameters
vector[M] L_lower; // lower triangle of species loadings
vector[D] L_diag; // Diagonal of species loadings
real<lower=0> sigma_L; // variance of species loadings
// Latent variables
matrix[D, N] LV; // Per-site latent variable
// NegBin parameters
real<lower=0> kappa;
}
transformed parameters {
matrix[S, D] Lambda;
// Assign parameters to L matrix:
{
int idx2; // Index for the lower diagonal loadings
idx2 = 0;
// Constraints to allow identifiability of loadings
for (i in 1:(D-1)) { for (j in (i+1):(D)){ Lambda[i,j] = 0; } } // 0 on upper diagonal
for (i in 1:D) Lambda[i,i] = L_diag[i]; // Positive values on diagonal
for (j in 1:D) {
for (i in (j+1):S) {
idx2 = idx2+1;
Lambda[i,j] = L_lower[idx2];
}
}
}
}
model {
// Factor priors
to_vector(LV) ~ std_normal();
L_lower ~ std_normal();
L_diag ~ std_normal();
// Random effect priors
a ~ std_normal();
b0 ~ std_normal();
a_bar ~ std_normal();
sigma_a ~ exponential(1);
sigma_b0 ~ exponential(1);
sigma_L ~ exponential(1);
kappa ~ exponential(1);
array[N] vector[S] mu;
for (i in 1:N) {
mu[i,] = exp(a_bar + a[i] * sigma_a + b0 * sigma_b0 + (Lambda * sigma_L) * LV[,i]);
Y[i,] ~ neg_binomial_2(mu[i, ], kappa);
}
}
generated quantities {
// Calculate implied covariance matrix among species
matrix[S, S] COV;
COV = multiply_lower_tri_self_transpose(Lambda);
}
I am working with the spiders dataset from the mvabund package in R for testing, as it is conveniently small, and have also built a function to simulate data from the priors for additional testing (see Identifiability of GLLVM (factor analytic model) for code). The spiders dataset consists of counts of 12 species of spiders at 28 sites, with each site represented only once in the data, for a 28 x 12 matrix.
The model compiles and runs (though without any covariates it obviously doesn’t make very good predictions), and I am trying to figure out how to compare this simple model with more complex models once I build them, using cross-validation or LOO.
The goal behind the model would be to predict the probable range of abundance for each species at some new, unmeasured site. This suggests to me that I will have to marginalise out both the site-specific intercept \alpha_i as well as the D site-specific latent factors z_i, as otherwise prediction can only happen conditional on the estimates of these variables for specific sites. My intuition on doing that would be to generate a set of uncorrelated vectors of draws from a normal distribution, one per variable, and compute the log probability of each observation for all combinations of these vectors for each set of MCMC draws, which sounds computationally expensive!
My questions are:
- Would my method of estimating the marginal distribution work, and is there an easier way to approximate it?
- Assuming it is possible to estimate, what is the best method to perform cross-validation? Will PSIS-LOO work here, or will I have to look into K-fold methods? Ideally the solution would not involve re-fitting the model with left-out data as the dataset we are building will have possibly thousands of species across about 1,000 observations, which I suspect will be too time-consuming to run re-fits on.