Constraining Latent Factor model / baysian probabalisic matrix factorization to remove multimodality

Hi!

I am still working with the data from my previous question:

In summary my data is vector Y of Bernoulli variables that are linked to a parameter matrix theta such that P(Y[v]) = Phi(theta[jj[v],ll[v]] + theta[jj[v],ll[v]]) where ll and rr index the item on the right or the left respectively and jj indexes the respondent. In my previous question I was trying to infer a univariate normal distribution for each column of theta.

Now I am making things much harder for myself by attempting to cluster items and respondents using low rank matrix factorization. I am aware of the non-identifiability/multimodality problem, and I am a novice when it comes to solving it, but I am trying anyway. I have been studying Rick Farouniā€™s great posts on identification and Latent factor analysis, and Michael Betancourtā€™s post on identifiability in mixture models.

I first tried ordering the priors on the latent factors, but it was obvious that this was not sufficient. Next, by analogy to the Farouniā€™s approach, I am constraining square block of each factor matrix to be lower triangular with positive diagonals and initializing the chains from the MLE estimate, but different chains are still heading to different modes and Nuts is having trouble.

Is there anything else worth trying or is the model too pathological?
If itā€™s too pathological then I might give Farouniā€™s LF model a shake.
Thanks for your help!

Edit: This morning Iā€™m thinking about the attempt below. My intuition is that 1. constraining both U and W doesnā€™t provide much over just constraining W, and 2. I likely need some constraints on the priors.

My code is below and an Rscript to generate to fit the model on simulated data is attached.simulate_LF.R (2.7 KB)

data{
// number of latent factors
int<lower=1> R;

// number of respondents
int<lower=R-1> J;

// number of rows / responses
int<lower=1> V;

 // number of items
 int<lower=R-1> K;

int<lower=1, upper=K> ll[V];

int<lower=1, upper=K> rr[V];

int<lower=1, upper=J> jj[V]; 

// recorded votes 1 / 0
int<lower=0,upper=1>  y_vec[V];


// hyper prior for the distribution of lambda_U and lambda_V
// Salakhutdinov & Mnih use df=V0=R, sigma=I
 cov_matrix[R] W0;

// hyperprior for the mean of lambda_U and lambda_V
// Slakhutdinov & Mnih use 0
vector<lower=0>[R] mu0;
}

transformed data {
   real<lower=0> V0;
   cov_matrix[R] I;
   vector[R] zero_vec;

   int L_entries;
   int L_lengths[R+1];

   zero_vec = rep_vector(0,R);
   V0 = R;
   L_entries = R*(R-1)/2;

   for(r in 1:R){
     I[:,r] = rep_vector(0,R);
     I[r,r] = 1;
     L_lengths[r] = (r-1)*R-(r-1)*r/2;
   }
   L_lengths[R+1] = L_entries;
}

 parameters{
   vector<lower=0>[R] c_U;
   vector<lower=0>[R] c_W;

  vector[L_entries] z_U;
  vector[L_entries] z_W;

  // priors for the factor means are ordered to force identifiability

  vector[R] mu_U_raw;
  vector[R] mu_W_raw;
  row_vector[R] U_raw[J-R];
  row_vector[L_entries] U_raw_lower_tri;
  vector<lower=0>[R] U_diag;

  // try constraining W to have a triangle
  vector[R] W_raw[K-R];
  vector[L_entries] W_raw_lower_tri;
  vector<lower=0>[R] W_diag;

  matrix[J,K] theta_raw; 
}

transformed parameters{
  matrix[R,R] L_U;
  matrix[R,R] L_W;

  matrix[R,R] W_sqr;
  matrix[R,R] U_sqr;

  row_vector[R] U[J];
  vector[R] W[K];

  vector[R] mu_U;
  vector[R] mu_W;
  matrix[J,K] theta;

  // fill in the L matrices
  for (r in 1:R){

    U_sqr[r,1:r-1] = rep_row_vector(0,r-1);
    U_sqr[r,r] = U_diag[r];
    U_sqr[r,r+1:R] = U_raw_lower_tri[L_lengths[r] + 1: L_lengths[r+1]];

    W_sqr[1:r-1,r] = rep_vector(0,r-1);
    W_sqr[r,r] = W_diag[r];
    W_sqr[r+1:R,r] = W_raw_lower_tri[L_lengths[r] + 1: L_lengths[r+1]];

    L_U[1:r-1,r] = rep_vector(0,r-1);
    L_U[r,r] = sqrt(c_U[r]);
    L_U[r+1:R,r] = z_U[L_lengths[r] + 1: L_lengths[r+1]];

    L_W[1:r-1,r] = rep_vector(0,r-1); 
    L_W[r,r] = sqrt(c_W[r]);
    L_W[r+1:R,r] = z_W[L_lengths[r] + 1: L_lengths[r+1]];
  }

  mu_U = mu0 + L_U * mu_U_raw;
  mu_W = mu0 + L_W * mu_W_raw;

  for (k in 1:K-R)  W[k] = mu_W + L_W*W_raw[k];

  for (j in 1:J-R)U[j] = mu_U' + U_raw[j]*L_U;


  for (r in 1:R){
    W[K-R+r] = mu_W + L_W*W_sqr[1:R,r];
    U[J-R+r] = mu_U' + U_sqr[r,1:R]*L_U;
  }

  for (j in 1:J) {
     for (k in 1:K) theta[j,k] = U[j] * W[k] + theta_raw[j,k];
  }
}


model{
  // TODO: consider using LKJ priors (STAN Recommended) instead of normal-wishart
  vector [V] p;

  // z and c are the prior on the lower entries of the cholesky matrix
  z_U ~ normal(0,1);
  z_W ~ normal(0,1);

 for (r in 1:R){
    c_U[r] ~ chi_square(V0 + 2 - r + 1);
    c_W[r] ~ chi_square(V0 + 2 - r + 1);
  }

  //mu_X_raw is the prior on mu_X
  mu_U_raw ~ normal(0,1);
  mu_W_raw ~ normal(0,1);

  for(j in 1:J-R) U_raw[j] ~ normal(0,1);
  for(k in 1:K-R)  W_raw[k] ~ normal(0,1);

  W_raw_lower_tri ~ normal(0,1);
  W_diag ~ normal(0,1);

  U_raw_lower_tri ~ normal(0,1);
  U_diag ~ normal(0,1);

  for(k in 1:K) theta_raw[k] ~ normal(0, 1);

  for (v in 1:V)  p[v] = Phi_approx(theta[jj[v],ll[v]] - theta[jj[v],rr[v]]);
  y_vec ~ bernoulli(p);

}

What I have been doing is using no contstraints and post-hoc orthogonal procrustes transformations on the samples. The lack of constraint means you donā€™t have biases in your prior. But you donā€™t get valid mcmc convergence measures.

https://www.ifw-members.ifw-kiel.de/publications/bayesian-analysis-of-dynamic-factor-models-an-ex-post-approach-towards-the-rotation-problem/KWP1902.pdf

wopt.r (871 Bytes)

1 Like

This constraint ought to be sufficient up to a scaling of -1 of the factors and a corresponding scaling of -1 by the loadings. On one hand, this is not likely to be a big deal. But you can get rid of it by constraining one loading to be positive for each factor.

Thanks! This seems like something I would do in the future. For now Iā€™m content with the ordering bias. I can fit multiple chains on different orderings to get a sense of how bad the bias is.

Hmmm. Maybe I have another problem then.

Iā€™ll try adding a lower=0 constraint to a column of the loading matrix.

Thank you!

That is overly restrictive and inadequate; you only need one positively-constrained loading per factor.

Oops. Sorry for getting hasty. So 1 diagonal of the loading matrix needs to be positive. I guess Iā€™m confused since when I said ā€œeach factor matrixā€ I meant to refer to the factor matrix and the loading matrix.

To clarify, the difference between what Iā€™m trying to do and Farouniā€™s tutorial is that I am not factoring out the factor matrix. In my mind this makes the difference between ā€˜matrix factorizationā€™ and ā€˜latent factor analysis.ā€™

I am now working with what seems to be a straightforward extension of Farouniā€™s model to estimate probit responses.
Using simulated data I am able to recover the true mean, but not the loading matrix. I have divergent transitions and low bmfi warnings. Any help is greatly appreciated!

Edit: Iā€™ve been staring at my pairs plots, looking for funnels. Are there any other strategies I could use?

LF.stan (1.9 KB)
simulate_LF.R (2.9 KB)

Not a direct solution to your problem, but related:
I found the implementation of the Leung & Drton (2006) approach to Bayesian factor analysis implemented by @mbjoseph here to work better.

The associated google groups discussion of latent factor models is here.

@Guido_Biele
Thanks for sharing. The article was quite helpful, but the model does not seem to be working for me. I think I understand my problem better and why what I am trying to do may be very difficult.

In standard factor analysis one sets
y ~ multi_normal_cholesky(zero, L_Sigma);

Where L_Sigma is the Cholesky factorization of the constrained loading matrix, and y is data.

In my case I am trying to put latent factors in a Bayesian Thurstone-Mosteller model so.

theta ~ multi_normal_cholesky(zero, L_Q);
for (v in 1:V)  p[v] = Phi_approx(theta[jj[v],ll[v]] - theta[jj[v],rr[v]]);
 y ~ bernoulli(p);

Since theta depends on Y Leung & Drtonā€™s assumption that theta is ball-normal is violated.

So Iā€™m back to the drawing board.

I found a forthcoming article (cited in @vsaaseā€™s article) that uses parameter expansions to guarantee the model is identified under the likelihood. Constraining priors (like Leung & Drtonā€™s trick) are not required for identifiability. Even better, there is no need to fix elements of factors to 0 so the model is order independent.

Their specification is a bit hairy and they have not yet provided code so Iā€™m going to take my time and think things through before I take a stab at implementing it. I wish I knew enough to proove that it will work in my set up.

Do you think that might work or is the Thurstone-Mosteller setup causing more problems than I imagine?

I donā€™t know what you mean by Thurstone-Mosteller setup. But have you tried using non-centered parameterizations for the hierarchical components of the model like L_t?

Hi @Bob_Carpenter,

Thank you for the response! Using non-centered parameterizations seems to work for the Leung model.
I still have some divergent transitions, but increasing adapt_delta seems to get rid of them.

You must tire of reminding everyone to do the Matt trick! I promise that I am learning.

I ran into a different problem with the Chan model I linked to above.

 transformed_parameters{
 matrix[R,R] V_F_inv;
 matrix[R,R] V_L_inv;
 matrix[R,R] V_F_chol;
 matrix[R,R] V_L_chol;
 matrix[J,K] theta;
 matrix[J,R] F;
 matrix[R,K] L;  

V_F_inv = I_r + tcrossprod(L);
V_L_inv = crossprod(F);

V_F_chol = cholesky_decompose(V_F_inv);
V_L_chol = cholesky_decompose(V_L_inv);

for (j in 1:J) F[j,1:R] = mdivide_right_tri_low(F_raw[j],V_F_chol);
for (k in 1:K) L[1:R,k] = mdivide_left_tri_low(V_L_chol,L_raw[k]);

theta =  F * L + theta_raw*diag_pre_multiply(sigma,Lcorr);
}

The problem is that L depends on F and F depends on L, so I cannot sample either variable. Is there any way to solve this kind of mutual dependence?

Have you seen section 4 of this paper? https://economics.sas.upenn.edu/system/files/event_papers/Rockova%20and%20George%202014%20Factor%20Analysis.pdf

Thank you for the response! This seems like a very cool paper that may be applicable to my project. It is a bit advanced for me to grasp quickly so it will help me if you can explain more about why you think it is relevant to my problem. Are you suggesting that I could take an EM-inspired approach and alternate sampling for L and R from one iteration to the next? If I was implementing my own gibbs sampler that is what I would do, but I am worried about how hacking stan to do this would might break stanā€™s diagnostics. It seems like stan assumes every parameter is sampled every iteration. Could I put everything in the model block?

I was referring more to the prior they use to induce sparsity. Basically, putting a sparsity-inducing prior on the factor loadings.

They use spike-and-slab but that wonā€™t work in Stan because it needs to be differentiable. You could consider a Cauchy prior on the factor loadings with a variable scale parameter. This will have the effect of automatically ā€œrotatingā€ the factors towards an interpretable solution.

Itā€™d be nice if we did it automatically. Or could at least provide some helpful automatic hints. Until then, itā€™s really our problem as much as the users, who we canā€™t expect most users to read hundreds of pages of doc or follow our Discourse discussions.

If you canā€™t write down the joint density, you canā€™t code it in Stan.

If I_r is a diagonal matrix, this is very inefficient. You want a loop so as not to get all the zeros involved.

this wonā€™t work because F and L arenā€™t defined before you use them. What were you thinking or hoping this would do?

You canā€™t really induce sparsity in a Bayesian posterior, because sparsity is trying to push you to a measure-zero maximum likelihood solution. At least thatā€™s how something like an L1-regularized regression works.

Understood. I guess I have to write my own sampler if I want to use the Chang model.

I knew that this doesnā€™t work.

I guess what I would like to happen is to have a way to tell Stan to take alternating samples. In this example odd iterations could sample L, leaving F fixed (on the first iteration F starts at an initial value) and even iterations could sample L leaving F fixed. I donā€™t know enough about HMC to know if this something like this is strictly unworkable or merely requires major changes.

It seems like quite a few people are working on ā€œsparsity inducing priorsā€ (e.g. https://arxiv.org/pdf/1610.05559.pdf).

In any case I now have a workable solution using the Leung et al model. I would like to try sparsity inducing priors, but Iā€™m still working on figuring out if that will violate the Leung et alā€™s assumptions that provide order invariance.
For now Iā€™m using post-hoc varimax rotation for interpretability.

Thanks everybody!

Sorry, I should have been more precise in my language. I meant a sparsity-encouraging prior, perhaps - ie a prior that encourages many coefficients to be small whilst a few coefficients are large. Agreed that it wonā€™t push the coefficients to be exactly zero.

@groceryheist
sorry for my late reply! I couldnā€™t access more than the abstract of the forthcoming article, but I am also not sure if my math would have been strong enough to provide a satisfying answer.

Hereā€™s a copy. invarient_inverence_latent_factor.pdf (447.5 KB)

Most of the hard math goes into explaining why their innovation works. The trick itself is not that complicated, but based on the discussion above about the interdependency of L and F may not be possible to implement in Stan.