Incorrectly bimodal posterior

Hi,

I am fitting a factor analysis model with a 1-dimensional latent variable z. Specifically, the observable vector x=(x1, x2, x3, x4) is generated as follows: x = Az + epsilon, with z ~ N(0, 1), epsilon ~ N(0, D). The parameters are the vector A=(A1, A2, A3, A4) and the diagonal covariance matrix D. I restrict A1 and A2 to positive values for the sake of identification.

I am pretty sure this model is identifiable, and have checked that it fulfills the identification criteria here. But I am running into computational issues where the constrained parameters A1 and A2 have a mode at 0 as well as at their true value, and the unconstrained A values have peaks at both their true value and the negation of their true value. It appears that some chains are stuck in a space where all of the values are negated, and they continue trying to move closer to the -A1 and -A2 values but get stuck at 0 because of the lower bound I’ve set.

To fix the issue, I have tried adding a positive lower bound on A1 and A2 (e.g. A1 > 0.5), as well as tried using a gamma(4, 2) prior on A1 and A2 in order to try moving the model away from the space where the values are 0. This has helped, but some of my runs are still getting stuck in the bad mode. I was wondering if there are any approaches with different sampling or priors or initialization that I could try here?

Thank you!

Can you show your Stan model?

While various constraints can be used to avoid some of the multimodality in factor models due to sign/label-switching, the likelihood is often still truly multimodal with many local modes. Sometimes more data and/or tight priors might help, but latent factor models and mixture models in general are just really difficult to estimate in general in fully Bayesian setting. While the multimodality is also present in maximum likelihood approaches, you might get away with it more easily as you are focused on finding a single, hopefully global mode (but the uncertainty estimates are then typically still off as you are ignoring the other modes).

1 Like

Here is my stan model:

data {
    int<lower=1> N;         // number of observations
    int<lower=1> x_n;           // dimension of observations (size of x)
    int x_n_constrained;        // number of x dimensions that are constrained
    matrix[N, x_n] X;       // observed features (symptoms)
}

parameters {
    vector[N] z;
    vector<lower=0.5>[x_n_constrained] A_constrained;
    vector[x_n - x_n_constrained] A_unconstrained;

    real mu_z;
    real<lower=0> sigma_z;
    vector<lower=0>[x_n] sigma_eps;
}

model {
    A_constrained ~ gamma(4, 2);
    A_unconstrained ~ normal(0, 1);
    
    mu_z ~ normal(2, 1);
    sigma_z ~ normal(1, 1) T[0,];
    sigma_eps ~ normal(0, 0.5) T[0,];

    z ~ normal(mu_z, sigma_z);
    
    for (n in 1:N){
        for (i in 1:x_n) {
            if (i <= x_n_constrained) {
                X[n, i] ~ normal(A_constrained[i] * z[n], sigma_eps[i]);
            } else {
                X[n, i] ~ normal(A_unconstrained[i - x_n_constrained] * z[n], sigma_eps[i]);
            }
        }
    }
}
1 Like

Welcome to the Stan discourse.

There are a number of little tweaks here and there. But three things stand out to me.

First, you’re not estimating separate means for each indicator. If the indicators have very different means, then the factor structure might be battling the mean structure.

Second, why not try a more conventional constraint, like setting the first loading to 1? You can always transform the loading matrix after-the-fact.

Third, you’re setting up a situation where you can have a likelihood of infinity. Imagine your latent variable Z perfectly aligns with one of the observed variables X and completely ignores the others. Then the indicator standard deviation will move towards zero for that particular variable and cause an estimation problem. You could constrain sigma_eps to be larger than some arbitrarily small positive number, which would be relatively simple.

Thank you for these suggestions! To each of the points you mention:

  1. I wanted to clarify that although A values (within constrained and unconstrained categories) are drawn from the same distribution, this is an assumption of the data generating process, and I still learn a separate estimate for each individual A value. Does this make sense?

  2. Could you explain this suggestion a bit more? Why would we expect it to help? And is there a way to use the approach if I am fitting the model on real data where I do not know what the true value of the loading matrix should be?

  3. I don’t actually think that this would lead to a likelihood of infinity. In particular, I believe that factor analysis likelihood can be written as a function of the parameters (as is done here), if I am understanding it correctly?

If you haven’t already, you might take a look at a general introduction to factor analysis. I get the sense that you are bumping against some foundational challenges (constraints; mean vs covariance structure; rotations) that will only compound as you build more complex models. As you’ve already discovered, they get even more complex in a Bayesian/MCMC setting. You might take a look at Kline’s book, which is not as easy as indicated but at least short and to the point. UCLA also has a comprehensive introduction.

Per point 1: It sounds like the lack of mean estimates shouldn’t be an issue in this case, though I don’t understand the quote below (no need to go into detail clarifying; just saying I’m not quite parsing). But applied to other problems, you will want to include parameters for the indicator means (with appropriate constraints). For example, if you were to add 100 to one of your variables, then you will run into problems that are unrelated to the factor structure. You avoid this by specifying the mean structure.

Per point 2: this is a very common (maybe the most common) method of constraining the loadings. You’ll see it discussed in many introductions to factor analysis (e.g. here). As it relates to your current problem, it locks the loadings into place around the first variable. Your current approach estimates all loadings, with a positivity constraint on a subset of them. This is in theory identifiable with the addition of your priors on the mean and standard deviation of the latent variable. But in practice it can be hard to identify. From my experience, setting the first loading to 1 is the most likely constraint to work in most situations.

Per point 3: there is a key difference in your likelihood compared to the models described which has implications for fitting the model and for efficiency. Your model directly estimates the values of the latent variable (z) for each observation and conditions on them to model the data as conditionally-independent.

The likelihoods discussed in the links use the “marginalized” multivariate normal approach. Rather than estimating the latent variables directly, they use the implied covariance matrix to model the data. This is more efficient and avoids the infinite-likelihood problem I mentioned.

Just to elaborate that point: imagine the model converged to a point where z_i = X_{1i} and \sigma_1 = 0 (sigma_eps), that is, each observation i's estimated latent variable was the same value as the first variable’s value, and the standard deviation of that variable was 0. This would produce an infinite likelihood, regardless of the value of any other parameter.

Thank you very much for the help!