Assigning to subsets of a matrix

Hi all,

I am trying to implement a model for detecting “influence” in word usage in House of Commons debates. The idea is to model the words of a given speech as a mixture of a) that speech’s own word distribution, b) the word distribution of all previous speeches in a given debate.

The outcome W[v,d] is the count of the times that word v occurs in document d, which is poisson distributed. On the right-hand side, we have Beta which is a V by D matrix of word (V) parameters for each document (D), and Theta which is a D by D parameter matrix which aims to capture the amount that each document “influences” each other document.

Currently, each row in the influence matrix is drawn from a dirichlet distribution, (Theta[d,] ~ dirichlet(delta)), where delta is a hyper-parameter vector of length D. But this set-up implies that documents that occur later can influence those that occur earlier. I would like to be able to constrain some elements of Theta to be 0 such that I have something like the following (for D = 3):

           Document_1 Document_2 Document_3
Document_1      1.000      0.000      0.000
Document_2      0.367      0.633      0.000
Document_3      0.215      0.196      0.589

That is, for document 1, I would like to be able to draw from a dirichlet with length 1, for document 2 a draw of length 2, etc.

I have tried using sub_row and block and segment to draw only from subsets of my delta dirichlet prior and assign to subsets of Theta but to no avail. I’ve also tried zeroing out the cells of Theta that are not necessary in the transformed parameters block, but then of course the dirichlet distributions will not add up to 1.

Any pointers of a way forward with this would be great. I’ve included the full model code below. Apologies if there is not enough detail here.

Thanks,

Jack

data {

int<lower=2> V;               // num words
int<lower=1> D;               // num docs
int<lower=0> W[V,D];          // document-feature matrix as integer array

}

parameters {

real phi[D];                  // document length parameter 
matrix[V,D] Beta;          // standardized word parameter for document d
simplex[D] Theta[D];
vector<lower=0>[D] delta;

}


transformed parameters {

matrix[V,D] lp;
matrix<lower=0>[V,D] mu;

for (d in 1:D) {
for (v in 1:V) {

// Linear predictor
lp[v,d] = phi[d] + dot_product(Beta[v],Theta[d,]);

// Mean
mu[v,d] = exp(lp[v,d]);
}
}

}

model {

// priors

for(d in 1:D) {
Beta[,d] ~ normal(0, 2);
phi[d] ~ normal(0,2);
Theta[d,] ~ dirichlet(delta);

}

// likelihood
for (d in 1:D) {
for (v in 1:V) {
W[v,d] ~ poisson(exp(mu[v,d]));
}
}

}

Section 25.5. Matrices with Parameters and Constants explains how to do what you’re asking.

Words in documents tend to be much more overdsipersed than Poisson, so you might want to consider at least a negative binomial a la Mosteller and Wallace (1964).

I’d suggest trying to think more generatively about this problem, like the time-series LDA model by Blei and Lafferty:

But be forewarned, you can’t do Bayesian inference over latent mixture models, even simple ones like K-means, much less complex ones like LDA—the posterior multimodality is combinatorially intractable.

Propagating effects back in time or around in space is typical of Bayesian (and other) time series or spatial models, because information flows both ways through a model.

Many thanks, Bob, that is very helpful.

Yes, indeed, this project takes inspiration from those types of models, and the model I pasted above is a toy example and therefore far from complete!

I will try the approach in 25.5 and see how I get on.

Thanks again.

Hi Bob, what do you mean that Bayesian inference won’t work on LDA? There are a number of LDA examples in the documentation so I’m a bit confused…

I’m not Bob, but I’ll respond to this one.

First, you’ll want to read Bob’s post on Bayesian Inference on LDA is intractable. That explains why Bayesian inference doesn’t work on LDA.

LDA, like any other statistical model, can be written in Stan. When the ML community talks about different variants of LDA, they’re really talking about statistical model + inference algorithm. When most people think LDA, they really mean LDA solved with (some specific) variational inference or even LDA solved with online variational inference. There are many different variants on LDA. These usually come in two forms: a change to the statistical model or a change to the inference method (typically, a different variational approximation). In short, the statistical model can be expressed fine using the Stan language. It’s the inference part that will trip you up.

If you wanted to do something close, you could just run optimization. Or even ADVI. But just know that you’ll end up in a mode (hard to tell if it’s a global mode) and the variance estimates won’t actually line up with the true variance of the posterior distribution (which is intractable).

1 Like

Daniel, thanks so much for the explanation and the resources! It finally makes sense why everybody seems to have a different opinion on the topic. BTW, it was great meeting you at StanCon. Please keep up the great work!

1 Like

I need to port that post to Andrew’s blog and also clarify what’s up more in the manual.

Lots of the literature is confused in that it purports to be doing Bayesian inference but really isn’t. For instance, lots of people build collapsed Gibbs samplers for LDA (you can find the math for that on my old blog), but they won’t explore the posterior.

And the variational approximation to an LDA solution doesn’t magically solve the multimodality problem—it’ll get stuck in a single mode just like sampling will. And you can’t solve this by tempering as the number of meaningful modes is combinatorially intractable—flattening won’t help you explore a super-exponential number of possibilities.

2 Likes