Sparse infinite factor priors


#1

I would like to implement sparse infinite factor priors (ala Bhattacharya & Dunson 2011) on the loading matrix in a latent factor model. Sparse infinite factor priors allow for introducing infinitely many factors, with the loadings increasingly shrunk towards zero as the column index increases. I’m very new to Stan, but implemented this in Jags as follows

For j=1,...,J rows and q=1,...,Q latent factors for loading matrix L

for ( j in 1:J ) { for ( q in 1:Q ) { L[j,q] ~ dnorm( 0, phi[j,q]*tau[q] ) } }
for ( j in 1:J ) { for ( q in 1:Q ) { phi[j,q] ~ dgamma( 3/2, 3/2 ) } }
for ( q in 1:Q ) { tau[q] = prod( delta[1:q] ) }
for ( l in 1:Q ) { delta[l] ~ dgamma( 50, 1 ) }

(rate and shape parameters taken from Ovaskainen et al. 2015)

My poor attempt to do the same in Stan

// Toy model
data {
int<lower=1> N; // number of observations
int<lower=1> J; // dimension of observed data (i.e., number of cols in y_ij)
int<lower=1> I; // number of sample units (i.e., number of rows in y_ij)
int counts[N]; // counts in y_ij
int<lower=1> Q; // number of latent factors
int <lower=1,upper=I> row_id[N]; // row indices for observations
int <lower=1,upper=J> col_id[N]; // column indices for observations
}

parameters {
    matrix[I,Q] LF;
    matrix[J,Q] L;
    matrix[J,Q] phi;
    vector[Q] delta;
}

transformed parameters {
    real log_lambda[N];
    vector[Q] tau;
    for ( q in 1:Q ) tau[q] = prod(delta[1:q]);
    
    //linpred
    for ( n in 1:N ) log_lambda[n] = LF[row_id[n],]*L[col_id[n],]';
}

model {
   for( i in 1:I ) for ( q in 1:Q ) LF[i,q] ~ normal(0,1);  
   for ( j in 1:J ) for ( q in 1:Q ) L[j,q] ~ normal( 0, phi[j,q]*tau[q] );
   for ( j in 1:J ) for ( q in 1:Q ) phi[j,q] ~ gamma( 3/2, 3/2 );
   for ( l in 1:Q ) delta[l] ~ gamma( 50, 1 );

   counts ~ poisson_log( log_lambda );
}

This produces 100 error msgs:

Rejecting initial value:
Error evaluating the log probability at the initial value.
Exception: normal_lpdf: Scale parameter is -0.00111817, but must be > 0!
Initialization between (-0.1, 0.1) failed after 100 attempts. 
Try specifying initial values, reducing ranges of constrained values, or reparameterizing the model.

Data-generation model in R

J=20
I=100
Q=5
log_lambda <- y <- matrix(0,nrow=I,ncol=J)
beta0 <- runif(J,-1,1)
alpha0 <- runif(I,-1,1)

#simulating sparse factor priors
LF <- matrix(0,nrow=I,ncol=Q)
for(k in 1:2) {LF[,k] <- rmvnorm(1, mean = rep(0,I))}
L <- matrix(0,nrow=J,ncol=Q)
for(k in 1:2) {L[,k] <- rmvnorm(1, mean = rep(0,J))} 

for(i in 1:I) {
  for(j in 1:J) {
    for(q in 1:Q) {
       log_lambda[i,j] <- exp(LF[i,q]%*%t(L[j,q))
       y[i,j] <- rpois(1,log_lambda[i,j])
 }}}

#prepare data for stan
yy <- reshape2::melt(y)
colnames(yy) <- c("rows","cols","counts")
row_id <- yy$rows
col_id <- yy$cols
counts <- yy$counts

#run stan
stan_d <- list(I = I, J = J, N = N, Q = Q, counts = counts, row_id = row_id, col_id = col_id)

rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())
m_mod <- stan_model("model.stan")
m_fit <- sampling(object = m_mod, data = stan_d, chains = 1, iter = 2000, 
                  pars = c(""log_lambda","L","LV","phi","tau","delta"), init_r = .1)

Any advice is highly appreciated.


#2

Looks like the scale parameter for one of the normal distributions is initialized to be negative. How do things look after declaring the support of parameters that must be positive by specifying a lower bound? e.g.,

parameters {
  ...
  matrix<lower = 0>[J, Q] phi;
  vector<lower = 0>[Q] delta;
  ...
}
transformed parameters {
  ...
  vector<lower = 0>[Q] tau;
   ...

Other things that you could improve (though unrelated to that initialization error) relate to vectorization. For instance:

model {
  to_vector(LF) ~ normal(0, 1);

  for (q in 1:Q)
    L[, q] ~ normal(0, phi[, q] * tau[q]);

  to_vector(phi) ~ gamma(3/2, 3/2);
  delta ~ gamma(50, 1);

  counts ~ poisson_log(log_lambda);
}


#3

Many thanks @mbjoseph! That seems to do the trick!


#4

One more thing that I just noticed - also unrelated to the initialization error - Stan parameterizes the normal distribution in terms of its mean and standard deviation, unlike BUGS which uses the precision - so you’ll also need to convert the precision (phi[j,q]*tau[q]) to a standard deviation.


#5

@mbjoseph Yes, that’s right! So for Stan it should be for (q in 1:Q) L[,q] ~ normal(0, 1/(phi[,q] * tau[q])); as the original formulation by Bhattacharya & Dunson is, really, in terms of precision. That may explain why I don’t get expected results.


#6

Actually, it should be

for (q in 1:Q) L[,q] ~ normal(0, 1/sqrt((phi[,q] * tau[q])))


#8

Hello!

I tried last week to fit such a model (with some other parameters, like a hierarchical intercept per row), but keeping the usual constraints of factor analysis (0 on upper diag of loadings and positive diagonal of loading matrix). If I understand well, Bhattacharya & Dunson argue that identifiability is not necessary for most applications of factor models, like covariance estimation. Ohma, the problem with this approach is that we loose a lot of diagnostic power, which is important for very complicated/diffuse models like these ones.

Fitting the following model, I had a lot of divergent transitions and BFMI warnings. Moreover, the shrinkage were unsufficient : I simulated 3 non-zero loadings, but the model estimated at least 7 non-zero loading. However, the covariance were not too badly approximated.

We have to keep in mind a major difference between Gibbs sampler proposed by the authors and stan. The sampler described in the paper remove columns falling under a determined threshold, and reimplement them randomly to ensure their “zeroness”. This principle cannot be implemented in stan, and I guess we have to find an efficient trick to replace it.

I abandonned this approach and started wondering about implementing horseshoe-like priors for the loadings, but I don’t know what is the better way to increase the shrinkage intensity with column indices. I thought about sampling the parameter of the local shrinkage cauchy distribution from another cauchy distribution with mu = 1/d, d being the column index, but I doubt such a rude method could work. @avehtari could certainly have some ideas!

data{
  int N; // sample size
  int S; // number of species
  int D; // number of factors
  int<lower=0> Y[N,S]; // data matrix of order [N,S]
}
transformed data{
  int<lower=1> M;
  vector[D] FS_mu; // factor means
  vector<lower=0>[D] FS_sd; // factor standard deviations
  M = D*(S-D)+ D*(D-1)/2;  // number of lower-triangular, non-zero loadings
  for (m in 1:D) {
    FS_mu[m] = 0; //Mean of factors = 0
    FS_sd[m] = 1; //Sd of factors = 1
  }
}
parameters{
  //Parameters
  real alpha; //Global intercept
  vector[N] d0_raw; //Uncentered row intercepts
  vector[M] L_lower; //Centered lower diagonal loadings
  vector<lower=0>[D] L_diag; //Centered positive diagonal elements of loadings
  matrix[N,D] FS; //Factor scores, matrix of order [N,D]
  cholesky_factor_corr[D] Rho; //Correlation matrix between factors
  real<lower=2> a1; //Hyperprior for first delta
  real<lower=3> a2; //Hyperprior for other deltas
  //Hyperparameters
  real<lower=0, upper=pi()/2> sigma_d_unif; //Uncentered sd of the row intercepts
  //Shrinkage parameters
  vector<lower=0>[M+D] phi_jh; //Local shrinkage parameter
  vector<lower=0>[D] delta_j; //Elements of the factor specific shrinkage priors
}
transformed parameters{
  // Final parameters
  vector[N] d0; //Final row intercepts
  cholesky_factor_cov[S,D] L; //Final matrix of laodings
  matrix[D,D] Ld; // cholesky decomposition of the covariance matrix between factors
  // Final hyperparameters
  real sigma_d; //Final sd of the row intercepts
  // Final shrinkage priors
  vector<lower=0>[D] tau_j; //Final factor specific shrinkage prior
  //Predictors
  matrix[N,S] Ups; //intermediate predictor
  matrix<lower=0>[N,S] Mu; //predictor
  
  // Compute the final hyperparameters
  sigma_d = 0 + 1 * tan(sigma_d_unif); //sigma_d ~ Halfcauchy(0,2.5)
  
  //Compute the final parameters
  d0 = 0 + sigma_d * d0_raw; //do ~ Normal(0, sigma_d)
  
  //Compute final global shrinkage prior
  for(d in 1:D) tau_j[d] = prod(delta_j[1:d]); //Compute the product of gamma distributed deltas
  
  // Correlation matrix of factors
  Ld = diag_pre_multiply(FS_sd, Rho); //Fs_sd fixed to 1, Rho estimated
  
  {
    int idx2; //Index for the lower diagonal loadings
    idx2 = 0;
    
    // Constraints to allow identifiability of loadings
  	 for(i in 1:(D-1)) { for(j in (i+1):(D)){ L[i,j] = 0; } } //0 on upper diagonal
  	 for(i in 1:D) L[i,i] = L_diag[i]; //Positive values on diagonal
  	 for(j in 1:D) {
  	   for(i in (j+1):S) {
  	     idx2 = idx2+1;
  	     L[i,j] = L_lower[idx2]; //Insert lower diagonal values in loadings matrix
  	   }
  	 }
  }
  
  // Predictors
  Ups = FS * L';
  for(n in 1:N) Mu[n] = exp(alpha + d0[n] + Ups[n]);
  
}
model{
  // Uncentered hyperpriors : the sampling will be automaticaly done as if they were defined on an uniform distribution between 0 or -pi/2 and pi/2 (see constraints)
  
  //Shrinkage priors
  {
    int idx3;
    idx3 = 0;
    for(d in 1:D){
      // Shrinkage prior for lower diagonal loadings
      for(m in (idx3+1):(d*(S-d)+ d*(d-1)/2)){
        L_lower[m] ~ normal(0, (1/phi_jh[m])*(1/tau_j[d]));
        idx3 = (d*(S-d)+ d*(d-1)/2);
     }
     // Shrinkage prior for diagonal loadings
     target += (D - d) * log(L_diag[d]) - .5 * L_diag[d] ^ 2 /
     ( (1/phi_jh[M+d]) * (1/tau_j[d]) );
    }
  }

  //Shrinkage hyperpriors
  phi_jh ~ gamma(3/2, 3/2);
  delta_j[1] ~ gamma(3,1);
  delta_j[2:D] ~ gamma(4,1);
  a1 ~ gamma(2,1);
  a2 ~ gamma(2,1);

  // Priors
  alpha ~ student_t(3,0,5); //Weakly informative prior for global intercept
  d0_raw ~ normal(0,1); //Uncentered regularizing prior for row deviations
  Rho ~ lkj_corr_cholesky(1); //Uninformative prior for Rho
  
  //Compute the likelihood
  for(i in 1:N){	
    Y[i,] ~ poisson(Mu[i,]);
    FS[i,] ~ multi_normal_cholesky(FS_mu, Ld);
  }
}
generated quantities{
  matrix[S,S] cov_L;
  matrix[S,S] cor_L;
  matrix[N,S] Y_pred;
  matrix[N,S] log_lik1;
  vector[N*S] log_lik;
  
  cov_L = multiply_lower_tri_self_transpose(L); //Create the covariance matrix
  
  // Compute the correlation matrix from de covariance matrix
  for(i in 1:S){
    for(j in 1:S){
      cor_L[i,j] = cov_L[i,j]/sqrt(cov_L[i,i]*cov_L[j,j]);
    }
  }
  
  //Compute the likelihood and predictions for each observation
  for(n in 1:N){
    for(s in 1:S){
      log_lik1[n,s] = poisson_lpmf(Y[n,s] | Mu[n,s]);
      Y_pred[n,s] = poisson_rng(Mu[n,s]);
    }
  }
  log_lik = to_vector(log_lik1); //Tranform in a vector usable by loo package
}

The data generating process and a detailed explanation of a similar code can be found here. The code is a little bit complicated because lower diag loadings are stored in a vector, and we have to select the ones corresponding to each columns (loop trick in the model part). For loadings, I tried to use the prior proposed by Leung & Drton (2014). This prior should “maintain invariance property under reordering variables”

I would be happy to continue the discussion, especially if we are able to find a reliable solution!

Lucas


#9

which is what happens when there are an infinite number of combinations of parameters that yield the same covariance matrix.

That paper is pre-Stan but in order to fit a model in Stan, you need to have a reasonably well-behaved posterior log-density kernel function.

Think about a generative process for the phenomenon of interest. If there is more than one generative model that yields the same prior predictive distribution, then you haven’t thought about the generative process for the phenomenon of interest long enough.


#10

I’m not sure of my understanding, but when I’m fitting this model, I assume that the covariance between poisson distributed abundances are the result of a limited number of unmeasured latent variables determining abundances of each species (loadings), the number of latent variables being unknown. I imagine the challenge is to define a data generating process leading to a unique posterior predictive distribution by identifying the effective number of unobserved variables.

I just realized I made a mistake in the code, the model part should be as follow, with finer constraints on a1 and a2 :

  //Shrinkage hyperpriors
  phi_jh ~ gamma(3/2, 3/2);
  delta_j[1] ~ gamma(a1,1);
  delta_j[2:D] ~ gamma(a2,1);
  a1 ~ gamma(2,1);
  a2 ~ gamma(2,1);

#11

simulated 3 non-zero loadings, but the model estimated at least 7 non-zero loading.

This sounds weird. How many factors did you define from the beginning? The thing is that you in practice still need to define the number of factors.

You shouldn’t have any corner constraints on the loadings—your constraints are the sparse priors.


#12

Oh @bgoodri, I just saw on your personal page that you are working exactly on that subject, I will read some of your papers ;)


#13

I haven’t worked on that stuff in a long time. And when I was working on it, Stan didn’t exist. But I did formulate an optimization problem with a unique optimum (that corresponded under some conditions to a — not necessarily unique — solution that minimized the number of factors). There is a similar principle here: If you are going to contemplate an excessive number of factors, you have to marginalize out a bunch of stuff in order to obtain a posterior distribution that Stan can sample from.


#14

I found a related article who seems to be interesting. They propose a spike-and-slabe prior to determine the remaining cross-loadings after setting the constraints for identification.


#15

You can’t really do a full spike-and-slab in Stan because it’s combinatorially intractable. But you can get close with things like the horseshoe prior.


#16

Thanks! I have horseshoe prior in mind for a while, but I have not thought long enough to determine how to set up a regularized horseshoe, especially if the slab have to increase with column indices. I imagine a too restrictive prior on non-zero loadings might mess up the estimation.

@jpiironen & @avehtari paper demonstrated that regularized horseshoe prior should be globally better, in term of estimation efficiency, than a sclassical horseshoe, but with more demanding tuning. However, given the inherent complexity of factor analysis, I am not advanced enough to propose a good tuning a priori. I might be better to start with a classical horseshoe and to explore the trouble it will surely cause!