LDA for word proportions fit issue

I’m trying to fit a version of vanilla LDA that takes in the word distribution for each document. So I just take
\beta_k = topic distributions ~ Dir(\gamma)
\theta_d = per document mixture proportions ~ Dir(\alpha)
\xi_d = per document observed word distribution~ Dir(B \cdot \theta)
where B is the matrix of \beta_k as row vectors.

Here’s my stan code

data {
  int<lower=1> K; // num topics
  int<lower=1> V; // num words
  int<lower=0> D; // num docs
  simplex[V] xi[D]; // word proportions for each doc

  // hyperparameters
  vector<lower=0>[K] alpha;
  vector<lower=0>[V] gamma;


}

parameters {
  simplex[K] theta[D]; // topic mixtures
  simplex[V] beta[K]; // word dist for k^th topic
}

model {
  for (d in 1:D) {
    theta[d] ~ dirichlet(alpha);
  }

  for (k in 1:K) {
    beta[k] ~ dirichlet(gamma);
  }

  for (d in 1:D) {
    vector[V] eta;
    eta = beta[1] * theta[d, 1];

    for (k in 2:K) {
      eta = eta + beta[k] * theta[d, k];
    }
    xi[d] ~ dirichlet(eta);
  }

}
}

When I try to fit this model, I get the usual Stan error

Rejecting initial value:
Log probability evaluates to log(0), i.e. negative infinity.
Stan can’t start sampling from this initial value.

Initialization between (-2, 2) failed after 100 attempts.
Try specifying initial values, reducing ranges of constrained values, or reparameterizing the model.
Error in sampler$call_sampler(c(args, dotlist)) : Initialization failed.

I must have missed a constraint, or maybe the model is flawed. Any help is greatly appreciated!

specify init_r to some value closer to 0 than the default of 2

I tried 1, 0.1, 0.01 for init_r and I get the same error unfortunately.

I don’t think it should matter, but my initial proportion vectors \xi_d are pretty sparse – they are length 2582 with roughly 80 percent zeros.

Yes, that could matter. What are you using for alpha and gamma?

So far I’ve been fitting the model with

  • \alpha and \gamma uniform
  • K = 4
  • V = 2582
  • D = 54

EDIT: Also, I forgot a crucial piece of information. I’ve been fitting this with variational Bayes so far.

All bets are off then

I switched to standard Stan fit and printed the log posterior before and after each draw from a distribution. It seems the \xi_d are the issues. When I modify the last loop to be

for (d in 1:D) {
    vector[V] eta;
    eta = beta[1] * theta[d, 1];

    for (k in 2:K) {
      eta = eta + beta[k] * theta[d, k];
    }

    print("log-posterior = ", get_lp())
    xi[d] ~ dirichlet(eta);
    print("log-posterior = ", get_lp())
  }

I find the log posterior is finite before the first \xi assignment and infinite after. It’s unclear to me whether this is because of poor initialization of the parameters or because the \xi data is highly sparse.

Try printing eta and xi[d] as well.

The code can be simplified to:

vector[V] eta = sum(beta .* theta[d]);

What’s making sure that eta > 0?