Gaussian LDA

Hi! I’m trying to use Stan to fit a Gaussian LDA model. The situation is that I have a list of genes, where each gene has different number of regions. Each region is characterized by a vector assumed to be in multivariate normal distribution. The model is described in http://www.aclweb.org/old_anthology/P/P15/P15-1077.pdf.

My stan code is attached. But unfortunately I got an Error as below Any suggestion is greatly appreciated!

Stan code:

LDA_GMM_code = """
data{
    
// observations

    int<lower=2> K;     // Number of modules
    int<lower=2> V;     // Number of features (dimensions)
    int<lower=2> M;     // Number of genes
    int<lower=2> N;     // Number of regions for all genes
    
    vector[V] w[N];     // Observations (region x features)
    int<lower=1,upper=M> gene[N];    // gene ID for region n
    

// priors

    real alpha;  
    
    real nu;
    cov_matrix[V] W;
    vector[V] m;
    real beta;
}
    
    
transformed data{
    vector<lower=0>[K] alpha_vec;
    for (k in 1:K){
        alpha_vec[k] = alpha; //symmetric dirichlet prior
    }
}


parameters{
    simplex[K] theta[M];
    vector[V] mu[K];
    cov_matrix[V] lambda[K];
}
    
model{

// priors
    for (d in 1:M)
        theta[d] ~ dirichlet(alpha_vec);
    for (k in 1:K){
        lambda[k] ~ wishart(nu, inv(W));
        mu[k] ~ multi_normal_prec(m, beta * lambda[k]);
    }

// log marginal (summing over all z) probability of the inputs
    for (n in 1:N){
        real gamma[K];
        for (k in 1:K)
            gamma[k] <- log(theta[gene[n],k]) + multi_normal_lpdf(w[n]|mu[k], inv(lambda[k]));  
        increment_log_prob(log_sum_exp(gamma));   // multiplies the probabilities for each state k on the log scale
    }
}
"""

Error:

Rejecting initial value:
  Error evaluating the log probability at the initial value.
Exception: wishart_lpdf: LDLT_Factor of scale parameter is not positive definite. last conditional variance is nan.  (in 'unkown file name' at line 46)

... Repeated

Rejecting initial value:
  Error evaluating the log probability at the initial value.
Exception: wishart_lpdf: LDLT_Factor of scale parameter is not positive definite. last conditional variance is nan.  (in 'unkown file name' at line 46)


Initialization between (-2, 2) failed after 100 attempts. 
 Try specifying initial values, reducing ranges of constrained values, or reparameterizing the model.
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pystan/api.py", line 395, in stan
    n_jobs=n_jobs, **kwargs)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pystan/model.py", line 725, in sampling
    ret_and_samples = _map_parallel(call_sampler_star, call_sampler_args, n_jobs)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pystan/model.py", line 86, in _map_parallel
    map_result = list(map(function, args))
  File "stanfit4anon_model_2bf781a5074c0e9a94b7ea6251164595_3606004117030050398.pyx", line 368, in stanfit4anon_model_2bf781a5074c0e9a94b7ea6251164595_3606004117030050398._call_sampler_star
  File "stanfit4anon_model_2bf781a5074c0e9a94b7ea6251164595_3606004117030050398.pyx", line 401, in stanfit4anon_model_2bf781a5074c0e9a94b7ea6251164595_3606004117030050398._call_sampler
RuntimeError: Initialization failed.

The data is simulated as below:

import numpy as np
from numpy.random import dirichlet
from numpy.random import multinomial
from numpy.random import multivariate_normal
from scipy.stats import wishart

## topics for each document
alphai = 0.25  ## prior of Dirichlet process

M = 20  ## number of genes
K = 4   ## number of clusters
Nd = np.random.choice(range(50,150),M)   ## number of regions in each gene
N = np.sum(Nd)   ## number of total regions
V = 20   ## dimension of features

theta = dirichlet(np.ones(K)*alphai, M)

## cluster assignment for each region
word_assginment_z = []
for d in xrange(M):
    word_assginment_z.append([multinomial(1, theta[d]) for t in xrange(Nd[d])])
    word_assginment_z[d] = np.array(word_assginment_z[d])
    
    
## distribution of mu and precision matrix for each module
nu0 = V + 5.0
W0 = np.identity(V) * 0.018 ## Tuned to avoid too large/small det(W0)
m0 = np.zeros(V)
beta0 = 1.0

Lambbda = wishart.rvs(df=nu0, scale=W0, size=K, random_state=0)
mu = [multivariate_normal(m0, np.linalg.inv(beta0 * Lambbda[k]), size=1) for k in xrange(K)]

    
    
## generate word values
words_values = []
for d in xrange(M):
    words_values.append([])
    for n in xrange(Nd[d]):
        zzz = np.where(word_assginment_z[d][n,:])[0][0]
        words_values[d].append(multivariate_normal(mu[zzz][0], np.linalg.inv(Lambbda[zzz]), size = 1)[0])
    words_values[d] = np.array(words_values[d])



gene_dat = {'alpha': alphai,
               'nu': nu0,
               'W': W0,
               'm': m0,
               'beta': beta0,
    
               'K': K,
               'V': V,
               'M': M,
               'N': np.sum(Nd),
               
               'w': [a for b in words_values for a in b],
               'gene': [t for i, j in zip(range(1,M+1), Nd) for t in np.repeat(i,j)]}

The error is saying that your scale matrix isn’t positive definite. Am I correct in saying that this:

produces your scale matrix? If so, I don’t see why it should be positive definite. Perhaps try np.linalg.cholesky(words_values) after that block. It will return the Cholesky decomposition if word_values is positive definite and raise an error otherwise.

Thanks for your help! The word values matrix is for observations, and the scale matrix is called inv(W).
The initial scale matrix is generated to be p.s.d. by

cov_matrix[V] W;  // does defining it to be cov_matrix constrain it to be psd?
W = np.identity(V) * 0.018. 
lambda[k] ~ wishart(nu,W);

The precision matrix is generated by:

Lambbda = wishart.rvs(df=nu0, scale=W, size=K, random_state=0)

Is it possible that it becomes non-psd during the MCMC process?

Also, I’ll use the np.linalg.cholesky function to test. Thanks for the suggestion!

No need to test. I confused ‘W’ and ‘w’. It speaks to how little caffeine I’ve had today.

You say that you chose the factor 0.018 to avoid a problem with the size of the determinant of W0, but the determinant here is 0.018**20. This seems pretty small. It could be causing a precision error. Maybe make that factor closer to 1?

cov_matrix should constrain your covariance matrices to always being covariance matrices. It’s also complaining about a Nan. I can’t see where it’d be running into a Nan unless there’s a precision problem.

Thanks for your help! After tuning the initialization of the covariance matrix, it starts working! However, I still got the errors like :

Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Exception: wishart_lpdf: LDLT_Factor of random variable is not positive definite. last conditional variance is nan.  (in 'unkown file name' at line 46)

If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.

This line appears like tens of times. I’m a little confused. Does this mean that if this is because of covariance matrix (which is the case here), then I don’t need to worry much about it? But it appeared so many time, does it mean my model is not well specified?

Thanks!

It seems to be that your model is experiencing some ill-condition. It looks like you’re trying to give the inverse of lambda[k] an inverse Wishart distribution in your prior. Is this correct? It might be better to change your model block to

model{

// priors
    for (d in 1:M)
        theta[d] ~ dirichlet(alpha_vec);
    for (k in 1:K){
        inv_lambda[k] ~ inv_wishart(nu, W);
        mu[k] ~ multi_normal_prec(m, beta * inv(inv_lambda[k]));
    }

// log marginal (summing over all z) probability of the inputs
    for (n in 1:N){
        real gamma[K];
        for (k in 1:K)
            gamma[k] <- log(theta[gene[n],k]) + multi_normal_lpdf(w[n]|mu[k], inv_lambda[k]);  
        increment_log_prob(log_sum_exp(gamma));   // multiplies the probabilities for each state k on the log scale
    }
}
"""

oh that’s a good catch! It seems to me that the prior on the Wishart needs fine tuning, and the variance of the data should be in a reasonable range for this model to run…

Better yet, don’t use a Wishart. I’d recommend the advice in the regression chapter of the manual and use a scaled correlation matrix with Cholesky-factor parameterization. It cites the Goodrich et-al paper comparing Wishart, inverse Wishart and scaled LKJ priors.

1 Like

Thanks a lot for the advice! Very helpful! I find Wishart easily gets numerical instability when using variational inference, because of the inevitable matrix inversion step in updating the posterior.

Exactly! We could probably solve this problem by defining a Wishart on the Cholesky factors directly, but we haven’t done that because we never use Wishart priors. If someone else wants to add it, it’d be a great feature to have. Just not on our immediate to-do list.

I see. I was wondering why this model is written in various text books and review papers, but not widely used in application. It may get more popular if more feasible approaches come out :)

Just replace the Wishart with a scaled LKJ prior.

The real problem with LDA models is that they’re intractably multimodal, so there’s no way to do full Bayesian inference on them.

2 Likes

Sorry, by “full Bayesian”, do you mean that with prior on the clusters’ parameters mu and Sigma, instead of assuming a constant Sigma for all clusters using the empirical covariance matrix?

Full Bayes propagates estimation uncertainty through inference. We use posterior predictive inference for new observations y' given “training” data y:

p(y' | y ) = INTEGRAL_theta p(y' | theta) * p(theta | y) d.theta

where p(theta | y) is the usual Bayesian posterior.

What people in ML tend to do is find a maximum posterior mode,

theta* = ARGMAX_theta p(theta | y)

and then plug that in for approximate prediction as

p(y' | y) =approx= p(y' | theta*)

That’s only approximate Bayes, not full Bayes. It seriously underestimates uncertainty and leads to uncalibrated predicitons.

In other words, full Bayes concerns itself only with posterior expectations and accurate estimation of those expectations.

If you build a model but use an expectation approximate algorithm with arbitrary error (like the modal methods discussed by @Bob_Carpenter ) then most of us would say that is not full Bayes. This would also apply to algorithms that might have reasonable error but haven’t been validated empirically or theoretically, such as general variational Bayes algorithms.

In particular, there are many in machine learning who advocate an approach where you build an ill-posed model and then let your computational algorithm “filter” it into something that works well in practice. Think early-stopping or selecting out a single mode from a multimodal target distribution. This approach convolves the model with the algorithm making it very hard to understand either.

1 Like

Thanks for your explanation! This is very helpful. Because the multimodality of LDA model, it is impossible to integral over all combination of latent variable for documents in the whole corpus.

Thanks a lot! It extends my understanding of full Bayes. So all approximation methods (Gibbs sampler and general variational bayes) all not full Bayes.

People have developed methods about collapsed Gibbs sampler and collapsed VI, where they integral over the latent variable to update the parameters. Do these count as full Bayes?

I don’t quite understand the third paragraph of “filter”, could you send me some relevant papers? Thanks!!!

It’s a bit more subtle than that.

If you have a well-behaved posterior (roughly speaking, unimodal prior without too much posterior correlation and tails that aren’t too thick or too thin), then Gibbs will converge to the right answer given enough compute time. In any finite amount of run time, there will be some stocahstic error. The technical details are in the MCMC central limit theorem and in geometric ergodicity.

Contrast this with basic floating point arithmetic. That is also approximate in that it only gives you fixed precision and there are gaps between the numbers. It’s not really continuous, just an approximation. You can get funny results for a, b > 0 like a + b = a.

Variational inference is quite different than either of these in that you can give it as much floating-point precision and as much compute time as you want, and it won’t converge to the true posterior (except in limited cases where the posterior is multivariate normal). How good the approximate is will depend on how normal the posterior is and how little correlation there in the mean-field case.

The problem with LDA is that it’s not well behaved. The number of modes explodes combinatorially as will be easy to see in any real example because each place you choose to initialize will lead to different estimates. This doesn’t matter if you try to fit it with collapsed Gibbs, variational inference, Hamiltonian Monte Carlo, Metropolis-Hastings, or simple maximum likelihood.

That’s interesting! So besides the requirement that the inference should approach the true posterior, the posterior needs to be computationally feasible. LKJ prior on the correlation matrix is a way to deal with the numerical instability in Wishart update, but it can’t solve the problem of not being fully Bayes.

I’m trying to derive VI update with LKJ prior on the correlation and gamma prior for the variance component as stated in Giordano, Ryan, Tamara Broderick, and Michael Jordan. “Robust inference with variational Bayes.” arXiv preprint arXiv:1512.02578 (2015). But it is taking a long time, and I’m not sure if I can get it in the end…

Another way to circumvent the problem is to set constraints on the precision matrix, like Won, Joong Ho, and Seung-Jean Kim. “Maximum likelihood covariance estimation with a condition number constraint.” Signals, Systems and Computers, 2006. ACSSC’06. Fortieth Asilomar Conference on. IEEE, 2006. But I don’t actually know how to insert constraints into variational inference process… Do you have any suggestions? Thanks!!!