Hi! I’m trying to use Stan to fit a Gaussian LDA model. The situation is that I have a list of genes, where each gene has different number of regions. Each region is characterized by a vector assumed to be in multivariate normal distribution. The model is described in http://www.aclweb.org/old_anthology/P/P15/P15-1077.pdf.
My stan code is attached. But unfortunately I got an Error as below Any suggestion is greatly appreciated!
Stan code:
LDA_GMM_code = """
data{
// observations
int<lower=2> K; // Number of modules
int<lower=2> V; // Number of features (dimensions)
int<lower=2> M; // Number of genes
int<lower=2> N; // Number of regions for all genes
vector[V] w[N]; // Observations (region x features)
int<lower=1,upper=M> gene[N]; // gene ID for region n
// priors
real alpha;
real nu;
cov_matrix[V] W;
vector[V] m;
real beta;
}
transformed data{
vector<lower=0>[K] alpha_vec;
for (k in 1:K){
alpha_vec[k] = alpha; //symmetric dirichlet prior
}
}
parameters{
simplex[K] theta[M];
vector[V] mu[K];
cov_matrix[V] lambda[K];
}
model{
// priors
for (d in 1:M)
theta[d] ~ dirichlet(alpha_vec);
for (k in 1:K){
lambda[k] ~ wishart(nu, inv(W));
mu[k] ~ multi_normal_prec(m, beta * lambda[k]);
}
// log marginal (summing over all z) probability of the inputs
for (n in 1:N){
real gamma[K];
for (k in 1:K)
gamma[k] <- log(theta[gene[n],k]) + multi_normal_lpdf(w[n]|mu[k], inv(lambda[k]));
increment_log_prob(log_sum_exp(gamma)); // multiplies the probabilities for each state k on the log scale
}
}
"""
Error:
Rejecting initial value:
Error evaluating the log probability at the initial value.
Exception: wishart_lpdf: LDLT_Factor of scale parameter is not positive definite. last conditional variance is nan. (in 'unkown file name' at line 46)
... Repeated
Rejecting initial value:
Error evaluating the log probability at the initial value.
Exception: wishart_lpdf: LDLT_Factor of scale parameter is not positive definite. last conditional variance is nan. (in 'unkown file name' at line 46)
Initialization between (-2, 2) failed after 100 attempts.
Try specifying initial values, reducing ranges of constrained values, or reparameterizing the model.
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pystan/api.py", line 395, in stan
n_jobs=n_jobs, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pystan/model.py", line 725, in sampling
ret_and_samples = _map_parallel(call_sampler_star, call_sampler_args, n_jobs)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pystan/model.py", line 86, in _map_parallel
map_result = list(map(function, args))
File "stanfit4anon_model_2bf781a5074c0e9a94b7ea6251164595_3606004117030050398.pyx", line 368, in stanfit4anon_model_2bf781a5074c0e9a94b7ea6251164595_3606004117030050398._call_sampler_star
File "stanfit4anon_model_2bf781a5074c0e9a94b7ea6251164595_3606004117030050398.pyx", line 401, in stanfit4anon_model_2bf781a5074c0e9a94b7ea6251164595_3606004117030050398._call_sampler
RuntimeError: Initialization failed.
The data is simulated as below:
import numpy as np
from numpy.random import dirichlet
from numpy.random import multinomial
from numpy.random import multivariate_normal
from scipy.stats import wishart
## topics for each document
alphai = 0.25 ## prior of Dirichlet process
M = 20 ## number of genes
K = 4 ## number of clusters
Nd = np.random.choice(range(50,150),M) ## number of regions in each gene
N = np.sum(Nd) ## number of total regions
V = 20 ## dimension of features
theta = dirichlet(np.ones(K)*alphai, M)
## cluster assignment for each region
word_assginment_z = []
for d in xrange(M):
word_assginment_z.append([multinomial(1, theta[d]) for t in xrange(Nd[d])])
word_assginment_z[d] = np.array(word_assginment_z[d])
## distribution of mu and precision matrix for each module
nu0 = V + 5.0
W0 = np.identity(V) * 0.018 ## Tuned to avoid too large/small det(W0)
m0 = np.zeros(V)
beta0 = 1.0
Lambbda = wishart.rvs(df=nu0, scale=W0, size=K, random_state=0)
mu = [multivariate_normal(m0, np.linalg.inv(beta0 * Lambbda[k]), size=1) for k in xrange(K)]
## generate word values
words_values = []
for d in xrange(M):
words_values.append([])
for n in xrange(Nd[d]):
zzz = np.where(word_assginment_z[d][n,:])[0][0]
words_values[d].append(multivariate_normal(mu[zzz][0], np.linalg.inv(Lambbda[zzz]), size = 1)[0])
words_values[d] = np.array(words_values[d])
gene_dat = {'alpha': alphai,
'nu': nu0,
'W': W0,
'm': m0,
'beta': beta0,
'K': K,
'V': V,
'M': M,
'N': np.sum(Nd),
'w': [a for b in words_values for a in b],
'gene': [t for i, j in zip(range(1,M+1), Nd) for t in np.repeat(i,j)]}