Generalized linear latent variable model in Stan


#1

Hello everyone,

This is my first post on the Stan discussion forum! A big thank you to the developers and to the community for all the excellent documentation that is out there.

I’m trying to reproduce a model from Warton et al 2015. This approach uses latent variables to approximate the correlation among many responses – in this case, between multiple species. However, although my model sort of runs, the answer doesn’t look at all like simulated data.

As far as I understand it, the model looks like this:

30

for the i sites and j species. The key part is the Uij, which are supposed to estimate the correlation among species across all sites. g() is the link function (e.g. for count data).

I wasn’t sure how much detail to include in this question, so I will include the Stan program below. I have a completely reproducible example, including R code to simulate data and fit the model, which is all in a github repository – the entire story is in this file. You can see a rendered version of this file on this website.

data{
    int<lower=1> N;
    int<lower=1> N_site_id;
    int<lower=1> N_spp_id;
    int obs_abd[N];
    int spp_id[N];
    int site_id[N];
    int<lower=2> D;              // number of latent dimensions 
}
transformed data {
  int<lower=1> M;
  // We calculate the number of below-zero loadings 
  // the second term is the number in the "square part" below diag
  // the first term is the "non square part" below that
  M  = D * (N_spp_id - D) + D * (D-1)/2;  // number of non-zero loadings
}
parameters{
    real inter;
    vector[N_site_id] site_intercept;
    real<lower=0> sitevar;
    vector[N_spp_id] spp_intercept;
    real<lower=0> sppvar;
    // adding this part: latent variables? is that you?
    matrix[N_site_id, D] latent_vars;
    // matrix[2, N_spp_id] spp_loadings;
    // blog parameters
    vector[M] spp_load_L_t;   // lower diagonal elements of L = spp_loadings
    vector<lower=0>[D] spp_load_L_d;   // diagonal elements of L = spp_loadings
   // vector<lower=0>[N_spp_id] psi;         // vector of variances
    // hyperparameters
    //real<lower=0>   mu_psi;
    //real<lower=0>  sigma_psi;
    // real   mu_lt;
    // real<lower=0>  sigma_lt;
}
transformed parameters{
  cholesky_factor_cov[N_spp_id,D] spp_loadings;  //lower triangular factor loadings Matrix 
  {
    int idx2;
    idx2 = 1;
    for(i in 1:N_spp_id){
      for(j in (i+1):D){
        spp_loadings[i,j] = 0; //constrain the upper triangular elements to zero 
      }
    }
    for (j in 1:D) {
      // add the diagonal elements
      spp_loadings[j,j] = spp_load_L_d[j];
      for (k in (j+1):N_spp_id) {
        // add the lower triangular elements
        spp_loadings[k,j] = spp_load_L_t[idx2];
        idx2 = idx2 + 1;
      } 
    }
  } 
}
model{
    vector[N] lamb;
    // the hyperpriors 
    // mu_psi ~ cauchy(0, 1);
    // sigma_psi ~ cauchy(0,1);
    // mu_lt ~ cauchy(0, 1);
    // sigma_lt ~ cauchy(0,1);
    // the priors 
    // note that spp_load_L_d is constrained with lower=0
    spp_load_L_d ~ cauchy(0,3);
    spp_load_L_t ~ cauchy(0, 4);
    // psi ~ cauchy(mu_psi,sigma_psi);
    sppvar ~ cauchy( 0 , 4 );
    spp_intercept ~ normal( 0 , sppvar );
    sitevar ~ cauchy( 0 , 4 );
    site_intercept ~ normal( 0 , sitevar );
    inter ~ normal( 0 , 5 );
    // latent variable priors??
    for (j in 1:N_site_id){
      for(k in 1:D ){
        latent_vars[j, k] ~ normal(0, 1);
      }
    }
    // the liklihood
    // add the row of 
    for ( i in 1:N ) {
        lamb[i] = inter + site_intercept[site_id[i]] + spp_intercept[spp_id[i]] + latent_vars[site_id[i],] * spp_loadings[spp_id[i],]';
    }
    obs_abd ~ poisson_log( lamb );
}
generated quantities{
    vector[N] lamb;
    matrix[N_spp_id,N_spp_id] Q;   //Covariance mat
    real dev;
    dev = 0;
    for ( i in 1:N ) {
        lamb[i] = inter + site_intercept[site_id[i]] + spp_intercept[spp_id[i]] + latent_vars[site_id[i],] * spp_loadings[spp_id[i],]';
    }
    dev = dev + (-2)*poisson_log_lpmf( obs_abd | lamb );
    //add the correlation matrix in here! Q = L*L'
    Q = spp_loadings * spp_loadings'; //+ diag_matrix(psi);
    // OK this gives me an error when I declare that Q needs to be positive definite, by using cov_matrix[N_spp_id]
}

To illustrate the problem, here is a quick plot of a DCA with the original simulated data on the left and one produced from latent variables on the right, which i think are supposed to resemble each other.


#2

Howdy howdy, I had a quick look at this.

spp_intercept ~ normal( 0 , sppvar );

The second argument to normal for a Stan model is a standard deviation, but that’d only mess up your labeling.

matrix[N_site_id, D] latent_vars;

These are the zs, right? Aren’t there just N_sites of them?

The matrix uij is a function of z and lambda. Given those are both parameters, it’s probably going to be painful to sample that (from the paper “Estimation is difficult because neither the latent variables zi nor the factor loadings lj are known”). How did they get around this?

It’s probably worth your time to generate stuff from the approximate model exactly. This’ll help you isolate issues with the model (if you know exactly how data was generated – hopefully you can get it to fit). This latent variable vs. factor loadings thing seems like something worth understanding.

If that works out, then you can move to working with this model when it’s an approximation.


#3

Estimation is difficult because neither the latent variables zi nor the factor loadings lj are known”). How did they get around this?

You need to do two things:

  • constrain the sign of one species on each latent variable.
  • If you have P species, the nth latent variable should only have loadings for P - (n-1) species, others will =0

This is a pretty great run-down of issues and solutions for this kind of analysis:
https://rfarouni.github.io/assets/projects/Identification/Identification.html


#4

You’re probably also going to need a non-centered parameterization for your hierarchical parameters like site_intercept. The Cauchy priors can also cause computational problems as we’re finding more and more because with little data, they’ll cause long tail excursions which don’t really affect posterior expectations, so aren’t really doing anything for you.

For efficiency, you want to recode lamb as a vectorized matrix multiply. It’s both more efficient and less memory intensive.

Your Q is only guaranteed to be positive definite if spp_loadings is lower triangular with positive values on the diagonal. We also have a multiply_lower_tri_self_transpose() function for just this purpose.