Constructing factor loadings as a sum of two parts

I’ve finally found time to follow up an idea suggested to me by @jsocolar to incorporate biological trait information into a GLLVM/exploratory factor analysis model.

To briefly state the problem, I have data on the abundance of 30 fungal species on approximately 600 seedlings. These fungi have to compete for space on their hosts, so I’m interested in exploring the residual correlations among pairs of fungal species (I have no measured environmental covariates). I hypothesise that some component of this residual correlation is a function of fungal traits - each fungal species is assigned to one of 6 categorical traits.

It’s simple enough to find this residual correlation matrix among species - construct a simple factor analysis model, with a factor loadings matrix \textbf{L} and latent variable matrix \textbf{Z}. We can get the residual correlations among species by calculating \Sigma = \textbf{L}\textbf{L}'.

\begin{align} y_{ij} &\sim \textrm{Poisson}(\lambda_{ij}) \\ log(\lambda_{ij}) &= \textbf{l}_j\textbf{z}_i \end{align}

To incorporate trait information into this model, @jsocolar suggested replacing each factor loading with a trait-specific component (which I’ve called \delta here), and a species-specific component (which I’ve called \gamma). We could then compute either partial correlation matrices (e.g. \Sigma_{\textrm{trait}} = \boldsymbol{\delta}\boldsymbol{\delta}') or the whole component.

y_{ij} \sim \textrm{Poisson}(\lambda_{ijk}) \\ log(\lambda_{ij}) = (\boldsymbol{\gamma}_{j} + \boldsymbol{\delta}_{\textrm{trait}[j]}) \textbf{z}_i

The problem comes in due to identifiability concerns - as I learned here, these models are only identifiable as far as \Sigma = \textbf{L}\textbf{L}', as there are an infinite number of possible rotations of the factor loadings. To fix this, in Bayesian models it’s typical to force the upper triangle of the factor loading matrix to be zero, and sometimes the diagonal itself to be strictly positive - this constrains the factor loadings to either a single mode or a set of sign-flipped modes, which are easy to deal with in the generated quantities.

However, by constructing the factor loadings as a sum of two components, I don’t think it’s possible to apply this constraint, which means Stan has severe problems exploring the highly multimodal posterior - with the code below, I get warnings that all samples have reached the maximum treedepth, and (with the inclusion of an offset in the model) a lot of divergences, and BFMI warnings. Increasing the maximum treedepth or adapt_delta does very little, as one might expect.

Does anyone have any thoughts on how to parameterise this model that might help with identifiability? Or (perhaps more realistically), can anyone suggest a different modelling framework that would allow me to explore the contribution of shared traits versus species identity to correlations in species abundance?

edit: I should add that I have no interest in a direct interpretation of any of the factor loadings - only on the correlation/covariance matrices they imply.

edit 2: thinking as I go, one option might be to split the trait correlations away from the latent variable altogether - estimate a hierarchical intercept with correlations between traits, and include only a species-specific factor loading with identfiable constraints. This would mean I could reduce the number of parameters by avoiding directly estimating a 30x30 species correlation matrix, and only have to directly estimate the smaller 6x6 trait correlation matrix. Would this approach give me a similar final output?

data {
  int<lower=1> N; //Number of seedlings
  int<lower=1> S; //Number of species
  int<lower=1> T; //Number of traits
  int<lower=1> D; //Number of latent dimensions
  
  array[N * S] int Y; //Abundance
  array[N * S] int Seedling; //Seedling id
  array[N * S] int Species; //Species identity
  array[N * S] int Trait; //Trait
}
parameters {
  // Species factor loadings
  matrix[S, D] Gamma;
  
  // Trait factor loadings
  matrix[T, D] Delta;
  
  // Latent variables
  matrix[D, N] LV; 
}
model {
  // Factor priors
  to_vector(LV) ~ std_normal();

  // Loading priors
    to_vector(Gamma) ~ normal(0, 3);
    to_vector(Delta) ~ normal(0, 3);

  // Model
  vector[N * S] mu;
  for (i in 1:(N * S)) {
      mu[i] = (Gamma[Species[i], ] + Delta[Trait[i], ]) * LV[, Seedling[i]];
      Y[i] ~ poisson_log(mu[i]);
  }
}
1 Like

Hello!

Your post just made a lot thoughts coming back to the surface of my mind ahah.

I thought a lot about how to integrate traits to inform loadings. Some ecological theories would link distant between traits to species interactions (and possibly covariance?). I thus thought about making the latent \lambda to be distributed multinormally, with the covariance matrix depending upon some covariance function of distances among species in the trait space, for example :

log(\lambda_{ij}) \sim MultiNormal(\mu, \Sigma)\\ \Sigma_{jj'} = exp(-(\frac{d_{jj'}}{v})^2)

I made some successfull simulations, but I realized the problems I was interested in needed a lot more data points than I ever had in my projects!

However, I guess this approach is not that much adapted to discrete traits… At least not if you wish to make inference about traits.

Another idea poping : what about using a factor loading by categories? It is kind of a “Functional group” approach, it could be pertinent to observe how much the structure of the covariance matrix is degraded by summarizing species by their functional traits…

Stan is a wonderfull tool. I dare suggesting the gjam R package by Clark, which implement an efficient dimensionality reduction of the residual covariance matrix for species abundance modelling. Drawback : it only samples one chain, so I am sure the posterior is well explored, because of my habits as a stan user…

Hope that wandering thought might help!
All the best,
Lucas

EDIT : added log in the equation.

1 Like

Is there a reason that it wouldn’t work to apply the constraint on just the \gamma component of the factor loadings? It seems like this would fix the latent variables Z_i to some particular mode. Once the latent variables are fixed, the model becomes y ~ Z1 * traits + Z2 * traits + (Z1 + Z2 | species) which shouldn’t have any problems with estimation. So all you need is a constraint on something, whether from the trait-based part or the residual part, to identify which mode you are in in the space of how the latent variables get configured.

Edited to add:
Tangentially, note that I’ve kept traits out of the random effect of species above. Make sure that your construction does this too. Unless you’ve measured within-species trait variation, then there is no such thing as a variable effect of trait across species; the effect of trait is intrinsically interspecific.

1 Like

Hi both,

Yes, this could be an option, or something similar. A thought I had would be to calculate species co-occurrence probabilities using the approach in the cooccur package, which estimates the probability of co-occurrence using a hypergeometric distribution, and then model those resulting probabilities as a function of trait. However the approach is limited as it only takes presence/absence data, whereas I have a wealth of abundance data for these species - I’m not sure if these can be weighted accordingly.

As an aside, it might be more correct to term these traits as functional groups, but generally in mycology mycorrhizal fungi are classed as a single functional group, which is why I chose trait instead.

Thanks for the recommendation - I haven’t seen this before, it looks quite interesting and potentially applicable to some of the other things I’m working on!

When I first tried the model I had constraints on both components, and was having a hard time fitting anything, and seeing some multimodal effects. I’ve rewritten it to put constraints on just the \gamma components (code below), and get a lot of divergences and max treedepth warnings:

Warning: 555 of 4000 (14.0%) transitions ended with a divergence.
See https://mc-stan.org/misc/warnings for details.

Warning: 3426 of 4000 (86.0%) transitions hit the maximum treedepth limit of 10.
See https://mc-stan.org/misc/warnings for details.

Digging in to the trace plots, there’s definitely still multimodal (?) behaviour going on, though it’s constrained to a degree - however, some of the chains are wandering around a lot:

Gamma:

gamma_trace.pmg

Delta:

delta

Latent variables:

lv

I don’t think I’ve done this, but I’m not sure I’ve specified the random effect correctly in the code. As I’m only assigning parameters to the lower triangle of the species factor matrix, I’m modelling that as a single vector with a Gaussian prior with a hierarchical sd, and then munging that vector into a matrix. I’m not sure if this leads to correct inferences for the random effect.

Code here:

data {
  int<lower=1> N; //Number of samples
  int<lower=1> S; //Number of species
  int<lower=1> T; //Number of traits
  int<lower=1> D; //Number of latent dimensions
  
  array[N * S] int Y; //Abundance
  array[N * S] int Seedling; //Seedling id
  array[N * S] int Species; //Species identity
  array[N * S] int Trait; //Trait matrix
  array[N * S] int Offset; //Offset
}
transformed data{
  // Number of non-zero lower triangular factor loadings
  // Ensures identifiability of the model - no rotation of factors
  int<lower=1> M_s;
  M_s = D * (S - D) + D * (D - 1) / 2 + D;
}
parameters {
  // Species factor loadings
  real<lower=0> sigma_gamma; // variance of species loadings
  vector<multiplier=sigma_gamma>[M_s] L_gamma; // lower triangle of species loadings

  // Trait factor loadings
  matrix[T, D] Delta;
  
  // Latent variables
  matrix[D, N] LV; // Per-site latent variable
}
transformed parameters {
  // Construct factor loading matrices
  matrix[S, D] Gamma;

  // Constraints to allow identifiability of loadings
  for (i in 1:(D-1)) {
    for (j in (i+1):(D)){
      Gamma[i, j] = 0;
    }
  }
  {
    int index;
    index = 0;
    for (j in 1:D) {
      for (i in j:S) {
        index = index + 1;
        Gamma[i, j] = L_gamma[index];
      }
    }
  }
}
model {
  // Factor priors
  to_vector(LV) ~ std_normal();

  // Random effect priors
  to_vector(L_gamma) ~ normal(0, sigma_gamma);
  to_vector(Delta) ~ normal(0, 3);

  vector[N * S] mu;
  for (i in 1:(N * S)) {
      mu[i] = (Gamma[Species[i], ] + Delta[Trait[i], ]) * LV[, Seedling[i]];
      Y[i] ~ poisson_log(mu[i]);
  }
}
generated quantities {
  // Calculate covariance matrices
  matrix[S, S] Cov_gamma;
  matrix[T, T] Cov_delta;
  
  Cov_gamma = multiply_lower_tri_self_transpose(Gamma);
  Cov_delta = multiply_lower_tri_self_transpose(Delta);
}
1 Like

So I haven’t dug deeply but from what I can see your implementation looks fine. It’s neat that your trait is categorical, which gives it an interesting symmetry with the species (each observation corresponds to exactly one species and exactly one trait; you would need to express the trait part in a different way if you had overlapping categorical traits or if you had continuous traits).

My guess here is that the effects structure might be too weakly identified. You have 30 species and 6 trait categories, which suggests that perhaps some trait categories contain very few species. Note that in the limit where a category contains only one species, it will be completely impossible to attribute the loading to the species versus the trait. All that’s identified in this case is the total amount of loading. Again, we can see what would happen if we fixed the latent factor Z and modeled y ~ Z + (1 + Z | species) + (1 + Z | trait) in a standard GLMM framework. For the trait that includes one species, the intercept and coefficient for each latent factor Z associated with that species would be completely non-identified from the intercept and coefficients associated with that trait, which would mess with your posterior. Now, on top of that (or maybe I should say underneath that), we have the latent variable structure…

The other thing I notice here is that parameters like Delta[2,2] and Gamma[4,2] look like they are completely not identified here. Maybe the true posterior, if you could compute it, actually does identifiy these parameters, but my intuitive guess is that it doesn’t.

1 Like

It’s interesting to me because my natural inclination is to think of traits as categories - as someone with a background in plant ecology, I tend to think in terms of broad functional groups (e.g. herb, forb, shrub, etc.), but even going beyond that, I think a lot of traits can be usefully defined in categorical terms (e.g. leaf shape, presence/absence of basal meristem, presence of wood). I’ve found some continuous traits (e.g. leaf area) to be poorly descriptive, especially when they elide useful and obvious distinctions between species.

I think another natural extension of this approach here would be to include factor loadings for different taxonomic ranks - e.g. fitting extra loadings at the genus and family level. In the absence of trait information, this could allow inference about the extent to which species correlations are a function of homology over species-specific deviations.

This is a very good point, which I should have caught when fitting the model above. I have data from 18 sites, and the final goal is to make inferences about interactions within each site. Species presence/absence is strongly site-dependent (there are lots of zeroes!). At the moment I’ve not included site in the model, so I just subset the data for one site and fit the model - but I should probably have filtered out species that aren’t present at all.

I looked into the data for the first site, and (of the 5 species present), only one trait contained more than one species. I would imagine this pattern holds more generally across all the sites (there’s 6 traits, and on average 5.6 species per site). Fitting the model to just this subset data also gave bad results, though the traces do look a bit better, probably due to the lack of zeroes.

My feeling now is that it would be impossible to fit this model without stronger assumptions on the trait factor loadings, such that they are informed by data from all sites. One option would be to assume a single set of trait factors across all sites, while allowing the species factor loadings to vary by site. The degree to which one could accept that result is the degree to which a single trait correlation matrix would be biologically plausible, though - it strikes me as a possibility, but I can imagine that environmental factors play a strong role in determining that outcome. I think it might also have implications on interpretations of the latent variables themselves, and what kind of latent unmeasured environmental covariates they can represent. The other option might be to model the trait factor loadings for each site hierarchically, allowing information to be pooled across sites - but this might be harder to fit, and I’m not sure if this avoids the identification problem.

1 Like

One other thought - would it help to include some kind of occupancy component in the model? (we can assume perfect detection here, my time at the microscope deserves it!)

If we don’t include an occupancy component, then if a species is not present at a site, but it shares a trait with a species that is present, it will get a non-zero prediction for abundance. We could modify the model as such:

y_{ijk} \sim \textrm{Poisson}(O_{jk}\lambda_{ijk})

Where O_{jk} is 1 if species j is present at site k and 0 otherwise (passed as data). Would it also make sense to fix the species-specific factor loadings to 0 in the case that a species isn’t present? (My intuition says that otherwise, in the above model there will be no information to inform the parameter at all, so the sampler might have a hard time there).

The model you’re suggesting is effectively a hurdle model with a truncated Poisson… with the hurdle component not modeled but just fixed to a completely overfitted estimate. FWIW, this doesn’t strike me as a good idea, partly because overfitting them feels like cheating but mostly because I think that the zeros are likely to be informative in the abundance-based process that you are trying to understand.

1 Like

That makes sense, I think.

Do you think this is a problem in general? I guess in reality, one would ignore/avoid making model predictions for cases where species aren’t expected to be present and nonsense would result.

I don’t think it’s a problem in general. We do expect these species to be present at these sites with some non-zero probabiliy; we observe that they are absent. To me this question has a little bit the same flavor as asking whether it’s a problem to model the abundance as Poisson when we observe that the true abundance is fixed to some particular value.

A problem would arise if the process that governs presence is substantially different from the process that governs abundance conditional on presence. This can certainly happen in ecology, e.g. if dispersal limitation governs presence and ecological factors govern conditional abundance. In such a situation, if you are interested only in ecological process and you are willing to assume that all of the zeros result from dispersal limitation and none of them are related to ecological process, then you might throw out all of the zeros and try to understand ecological process only across the species/site combinations where you’ve had the opportunity to observe the outcome of ecological process.

So this is the part of the problem where domain-specific knowledge comes in, and what you describe here is what I was considering when I suggested including the occupancy component. To be a little more clear, the response variable here is not really ‘abundance’, but the number of times each seedling has been colonised by each fungal species. Each fungal species is likely only present at each site as one to a handful of ‘individuals’, and we can (somewhat) reasonably assume they’re distributed homogeneously within each site - thus, abundance and presence here are fairly well decoupled, with colonisation intensity depending on competitive interactions and the specifics of host-fungal interactions. That said, it may also be reasonable to think that the competitive process I’m trying to model could also result in zeros for a species at a whole site - so I’ll take some more time to mull this over before committing one way or another.


In the meantime, I’m having some trouble getting the model to fit, and I’m not sure where the problems lie or if they’re fixable. I’ve implemented the trait loadings as a hierarchical effect across sites with a non-zero mean, as I felt that there would be problems of interpretation for the latent varaibles having both ‘global’ and ‘local’ factor loadings. The model fits with no divergences, but every iteration hits a maximum treedepth warning:

Warning: 4000 of 4000 (100.0%) transitions hit the maximum treedepth limit of 10.

My understanding is that treedepth warnings are less important than divergences, and indicate that the sampler has not been able to explore the posterior efficiently, meaning they can often be ignored if the other diagnostic factors look good - here, there are lots of Rhats > 1.1, and ESS is incredibly low. Looking at the trace plots, they’re a lot worse than they were for the model with a single site, with chains exploring completely different modes (these chains are for the same site as the ones in the single site model):

Is this at all fixable, or am I maybe in the situation where the posterior I’m interested in just isn’t identifiable with this dataset?

data {
  // Indexes
  int<lower=1> N; //Number of samples
  int<lower=1> S; //Number of species
  int<lower=1> T; //Number of traits
  int<lower=1> G; //Number of grids
  int<lower=1> D; //Number of latent dimensions
  
  // Array data
  array[N * S] int Y; //Abundance
  array[N * S] int Seedling; //Seedling id
  array[N * S] int Species; //Species identity
  array[N * S] int Grid; // Grid identity
  array[N * S] int Trait; //Trait matrix
  array[N * S] int Offset; //Offset
}
transformed data{
  // Number of non-zero lower triangular factor loadings
  // Ensures identifiability of the model - no rotation of factors
  int<lower=1> M;
  M = D * (S - D) + D * (D - 1) / 2 + D;
}
parameters {
  // Species factor loadings
  matrix[M, G] z_l_gamma;
  vector<lower=0>[M] sigma_l_gamma;
  cholesky_factor_corr[M] L_Rho_l_gamma;

  // Trait factor loadings
  matrix[D*T, G] z_delta;
  vector<lower=0>[D*T] sigma_delta;
  cholesky_factor_corr[D*T] L_Rho_delta;
  matrix[T, D] mu_delta;
  
  // Latent variables
  matrix[D, N] LV; // Per-site latent variable
}
transformed parameters {
  matrix[G, M] l_gamma;
  l_gamma = (diag_pre_multiply(sigma_l_gamma, L_Rho_l_gamma) * z_l_gamma)';
  
  matrix[G, D * T] delta;
  delta = (diag_pre_multiply(sigma_delta, L_Rho_delta) * z_delta)';
  
  // Construct factor loading matrices
  array[G] matrix[S, D] Gamma;
  array[G] matrix[T, D] Delta;

  // Constraints to allow identifiability of loadings
  for(g in 1:G){
    for (i in 1:(D-1)) {
      for (j in (i+1):(D)){
        Gamma[g, i, j] = 0;
      }
    }
    {
      int index;
      index = 0;
      for (j in 1:D) {
        for (i in j:S) {
          index = index + 1;
          Gamma[g, i, j] = l_gamma[g, index];
        }
      }
    }
    
    Delta[g] = to_matrix(delta[g, ], T, D);
  }
}
model {
  // Factor priors
  to_vector(LV) ~ std_normal();

  // Species factor loading priors
  to_vector(z_l_gamma) ~ normal(0, 1);
  sigma_l_gamma ~ exponential(1);
  L_Rho_l_gamma ~ lkj_corr_cholesky(2);
  
  // Trait factor loading priors
  to_vector(z_delta) ~ normal(0, 1);
  sigma_delta ~ exponential(1);
  L_Rho_delta ~ lkj_corr_cholesky(2);
  to_vector(mu_delta) ~ normal(0, 2);
  
  vector[N * S] mu;
  for (i in 1:(N * S)) {
      mu[i] = log(Offset[i]) + 
              (mu_delta[Trait[i], ] + Delta[Grid[i], Trait[i], ] + 
                Gamma[Grid[i], Species[i], ]) * LV[, Seedling[i]];
      Y[i] ~ poisson_log(mu[i]);
  }
}