Mixture of incrementally nested factor models

I brought this model up in another post, but thought it might be better to discuss it on its own terms.
A standard probabilistic PCA model with no bells or whistles for observations \lbrace\mathbf{y}_i\rbrace_{i=1}^N, K-dimensional latent factor loadings \lbrace\mathbf{z}_i\rbrace_{i=1}^N, and factor matrix \mathbf{A}\in\mathbb{R}^{D\times K}, takes the form

\mathbf{z}_i\sim \mathcal{N}(0,\mathbf{I})\\ [\mathbf{A}]_{ij}\sim \mathcal{N}(0,1)\\ \mathbf{y}_i\sim\mathcal{N}(\mathbf{A}\mathbf{z}_i,\Sigma)

Here, K is an important hyperparameter which controls the complexity of the model. We could generalize the model to include K as a random variable directly:

w\sim \mathbf{Dir}(\mathbf{1})\\ k\sim \mathbf{Cat}(w)\\ \mathbf{z}_i\sim \mathcal{N}(0,\mathbf{I})\\ [\mathbf{A}]_{ij}\sim \mathcal{N}(0,1)\\ \mathbf{y}_i\sim\mathcal{N}(\mathbf{A}_{:,1:k}\mathbf{z}_{i,1:k},\Sigma)

Note that all \mathbf{y}_i are drawn with respect to the same k, so they are not i.i.d. from a mixture distribution. Rather the mixture is on the dataset level. A problem with this design and mixtures in general (as pointed out by Aki and others) is that the component parameters are usually only informed by the posterior fraction of the data assigned to the component. This design is a bit different in that parameters are shared between the components, which should hopefully alleviate that issue.

An intuitively appealing part of this construction is that it mimics the iterative construction of PCA where one first finds the direction along which the variance is maximized, and then adds in components that complement it. Since for k=1 the first vector has to stand on its own, this should also help with identification of the model (permutation should not be a problem if the likelihood is otherwise well-defined).

Would this design work? Or is there a construction that is more suitable for this sort of design than the mixture (geometric mixture perhaps)? Is the prior on w suitable, or should it penalize complexity? That extra hierarchical layer is mostly for the sake of stan, as it allows us to marginalize over k.

Actually, how would you go about implementing this in Stan? I have been taking the easy disregard-the-normalization-constant philosophy of Stan a bit too seriously, so I only just realized that you cannot apply that trick in mixtures — the component densities need to be properly normalized.

But maybe it is a non-issue here; conditioned on A and z, the conditional p(\lbrace\mathbf{y}_i\rbrace|\mathbf{A},\lbrace\mathbf{z}\rbrace,\Sigma) (which is normal) should be trivially normalized, meaning that the mixture is well-formed, right?

As an aside, would it make sense to use a fixed uniform categorical for k, marginalize it out, and then calculate the posterior draws from p(k|y) in the generated quantities block? I’m thinking this would be more stable.

Right—you can’t use it anywhere you need the normalizing constants. Truncation is another place this comes up.

Sharing parameters indeed mitigates the issue in question for the shared parameters.

I don’t see why you couldn’t implement a model like this in Stan.

Here’s my attempt:

data {
    int<lower=1> N;
    int<lower=1> S;
    int<lower=1> K;
    matrix[S,N] y;

parameters {
    matrix[S,K] A_tilde;
    matrix[K,N] Z_tilde;
    real<lower=0> tau;
    matrix<lower=0>[S,K] Psi;

transformed parameters {
    matrix[S,K] A = A_tilde;
    matrix[K,N] Z = Z_tilde;
    vector[K] compllk;
    for (k in 1:K) {
        matrix[S,N] F = tau * A[,1:k] * Z[1:k,];
        compllk[k] = 0;
        for (n in 1:N) {
            compllk[k] += normal_lpdf(y[,n] | F[,n], Psi[,k]);

model {
    tau ~ normal(0,1);
    to_vector(A_tilde) ~ normal(0,1);
    to_vector(Z_tilde) ~ normal(0,1);
    to_vector(Psi) ~ normal(0,1);
    target += log_sum_exp(compllk - log(K));

generated quantities {
    vector[K] responsibility;
    responsibility =  exp(compllk-log_sum_exp(compllk));

I discarded the Dirichlet prior to get a more balanced model, made the noise covariance depend on k, and added an unorthodox variance parameter.

I get very odd sampling behaviour. Looking at the responsibility, all mass has collapsed until a single complexity level (k), but the k seems to depend on initialization. Same behaviour is realized when running MAP. This seems like unrealistic uncertainty quantification/credit assignment.

I wonder if that has to do with the fact that if you collapse to one responsibility, the other ones have no data to train with, and you default to the prior.

You want to add a prior on the mixture, but as is, you’re just assuming it’s uniform. I don’t see immediately how to prevent the collapse.

Also, can’t you just use A_tilde and Z_tilde? I don’t see why they’re getting copied into A and Z.

But if you collapse, you still share information about the shared parameters, so each component should be informed up to a degree. But noise and higher-level factors will of course fall back to the prior.

It seems reasonable to me that there should be configurations where two adjacent models, let’s say k=1 and k=2, both share responsibility. But if we then perturb the individual noise parameter of k=1 to a setting that doesn’t fit the data, responsibility would collapse onto k=2. So I think you might be right that there are spurious modes where responsibility collapses. So I guess it’s just a multiple modes problem.

What would be the behaviour if we instead of a mixture did a product of the likelihoods?

yeah, the tildes are unnecessary - they are just leftovers from previous versions.

I don’t understand the model well enough to understand where a product and mixture would be exchangeable—I think of them as an “and” and an “or” operation.

I also meant to mention that the main problem with complete collapse is that it’ll cause problems computationally with divide-by-zero, infinite values, etc.