Wordfish model (aka poisson ideal point model, poisson or multinomial IRT model): Fixing reflection invariance

Marko · July 26, 2018, 8:55am

Dear all,

I am trying to implement the so-called “wordfish model” in Stan. The model is a variant of a poisson ideal point model or poisson or multinomial IRT model, which is primarily applied in political science. The goal is to estimate the political position of parties or politicians based on the words, which they use in political texts (e.g., party manifestos or speeches). The data is typically organized in a document-feature-matrix, where rows are political texts, columns are features (e.g., words, word stems, n-grams), and cells contain counts (how often has party i used feature j?).

The counts are modelled as

counts_i,j ~ poisson(exp(alpha_i + psi_j + beta_j * theta_i),
where alpha is a vector of document fixed effects, psi is a vector of feature fixed effects, beta contains the feature positions and theta contains the document positions (following the notation of quanteda::textmodel_wordfish()).

The first implementation (which I am aware of) is Slapin & Proksch (2008), implementation with EM in R here. A more sophisticated implementation of the EM solution in C++ (via RCPP) is the textmodel_wordfish() function in the Quanteda package.

I want to implement a Stan version because (a) the existing versions are primarily interested in theta and therefore do not provide standard errors for the other parameters, (b) the standard errors seem to be rather small in many applications, which let’s me believe that the uncertainties are not completely propagated throughout the whole model, and © I want to be able expand the model, e.g. a hierarchical model with politicians in parties.

The good news is that the basic model is straightforward to implement in Stan (wordfish_discourse.stan for replication below):

data {
  int<lower=1> I; // Number of documents
  int<lower=1> J; // Number of features
  int<lower=1> N; // Number of counts (I * J in most cases)
  int<lower=0> counts[N]; // long vector of counts
  int<lower=1,upper=I> docs[N]; // index document to count
  int<lower=1,upper=J> features[N]; // index feature to count
}
parameters {
  vector[I] theta; // Document position
  vector[I] alpha; // Document fixed effects
  vector[J] beta; // Feature position
  vector[J] psi; // Feature fixed effects
}
model {
  vector[N] lambda;
  for (n in 1:N)
    lambda[n] = alpha[docs[n]] + psi[features[n]] + beta[features[n]] * theta[docs[n]]; // actually, log_lambda
  
  alpha ~ normal(0, 10); // non-informative
  psi ~ normal(0, 10); // non-informative
  beta ~ normal(0, 3); // some regularization like quanteda model 
  theta ~ normal(0, 1); // identify model with unit normal prior

  counts ~ poisson_log(lambda); // poisson ideal point model
}
generated quantities {
  vector[I] theta_std = (theta - mean(theta)) / sd(theta); // standardize theta for comparison with quanteda::wordfish()
}

The model is reasonably fast and recovers very similar parameter values as quanteda::textmodel_wordfish() in quite a few test cases (if there is no reflection invariance). The uncertainties of all parameters are also sensible. A simple example is included in the attachment.

The bad news is that I did not yet succeed in fixing the reflection invariance, that is, the chains converge to symmetric solutions where the documents are in exactly the opposite direction, but with about equal distances in theta, the same for the features in beta.

What I have tried so far (wordfish2.stan contains the commented out variations):

Splitting theta into ordered[2] and vector[I-2] and ordering the input data such that document 1 is plausibly to the left of document 2. Code:

parameters {
  ordered[2] theta_dir;
  vector[I-2] theta_rest;
  ...
transformed parameters {
  vector[I] theta = append_row(theta_dir, theta_rest);
  ...
model {
  ...
  theta_dir ~ normal(0, 1);
  theta_rest ~ normal(0, 1);

This prevents reflection invariance, but there are frequently solutions where theta_1 and theta_2 are estimated to be basically identical, with only a marginal distance to satisfy the order constraint. The same thing happens with an order constraint on beta or on both theta and beta.

Using an indicator for the most left and right documents and constraining its relationship with theta to be positive (as recommended in Gelman/Hill, p. 318). Code:

data {
  ...
  vector<lower=-1,upper=1>[I] left_right; // c(-1, 1, rep(0, (I-2)))
parameters {
  ...
  real<lower=0> b1;
model {
  ...
  theta ~ normal(b1 * left_right, 1);
  b1 ~ normal(0, 10);
  ...

This does not prevent reflection invariance at all.

Priors with means -1 and 1 for the first two values of theta. Code:

model {
  ...
  theta[1] ~ normal(-1, 1);
  theta[2] ~ normal(1, 1);
  for (i in 3:I) theta[i] ~ normal(0, 1);

This does not prevent reflection invariance at all.

In addition, I have set sum-to-zero constraints or have set one value to zero in psi and/or alpha. These constraints are not to fix the reflection invariance but are used in other implementations for identification.

Any ideas how to fix the reflection invariance are highly appreciated!

wordfish2.stan (1.5 KB)
wordfish_discourse.R (1.9 KB)
wordfish_discourse.stan (1.1 KB)

Marko · July 26, 2018, 2:20pm

P.S.: I found a rather simple solution: Simply switching the direction of theta and beta in the generated quantities block:

generated quantities {
  vector[I] theta_fix = theta[1] < theta[2] ? theta : theta * -1;
  vector[J] beta_fix = theta[1] < theta[2] ? beta : beta * -1;
}

However, I am unsure whether this is a good idea. More generally, is it ok to “fix” a problem only in the generated quantities block?

bbbales2 · July 26, 2018, 3:16pm

I don’t know for sure, but it sure sounds reasonable to me.

I’m guessing you might not see this sorta thing a lot because in general you won’t know where the symmetries are in a multimodal posterior unless there is something special going on in your problem that you understand.

Sounds like you have something special going on that you understand? So cool beans! We take those

Gregory · July 27, 2018, 2:58pm

I wonder if anyone else can chime in on this problem. I’ve often faced the same situation dealing with reflection invariance in analogous models. I can usually get around this by over-identifying the model, setting more constraints than would be theoretically necessary to identify the model. e.g. constraining theta_j < theta_j’ for multiple pairs of documents/users/etc. Maybe there’s not a general strategy to deal with this, but the problem is a perennial one, so any guidance on this would be useful. Thanks!

saudiwin · July 27, 2018, 7:34pm

Hi Marko -

As you’ve already found, there a bunch of different things you can do to fix reflection invariance. The issue with simply flipping the signs post-hoc is that you can get a different rotation every time you run the model, which can be frustrating when you are doing model comparison, etc.

My current methodology that I’m implementing in my R package idealstan is to do the following:

Choose a document/feature (probably a document in your case) to constrain to be positive. Generally speaking you want a document that is far out on one end of the spectrum or the other. If you don’t know, you can do a quick run with variational inference (vb function) to figure out a good candidate.
Then constrain another document/feature to be a fixed constant minus the constrained parameter. I’ve found that has the advantage of helping to identify the scale and also makes the positivity constraint more binding. You then rebuild a vector with those two constrained parameters.

So it would look something like the following:

parameters {
vector[I-2] theta_free;
vector<lower=0>[1] restrict_high;
}
transformed parameters {
vector[I] theta_full;
vector[1] restrict_low;
restrict_low = restrict_high - 4;
theta_full = append_row(theta_free,append_row(restrict_high,restrict_low)); 
// use theta_full in place of theta
// requires you to have ordered documents so that constrained parameters are last
}

I’m not saying that this ID method will necessarily work better than those you have tried, just that this method has worked pretty well for me and it’s efficient in terms of stan code.

Bob_Carpenter · July 28, 2018, 2:46pm

That may also be affected by false independence assumptions. For example, naive Bayes is notoriously uncalibrated for this reason.

You can enforce this by ordering one of your parameters. Michael Betancourt wrote a super helpful case study on mixture models explaining why this works, including why you wind up collapsing in some cases.

Rather than setting a variable to literally sum to zero, we’ve found it more useful to use soft constraints, e.g.,

alpha ~ ... its regular prior...
mean(alpha) ~ normal(0, 0.001);

You may have to play with the scale there, but it generally gives you the same answer to within multiple decimals places with much more efficiency (ignore the Jacobian warning from the compiler if it’s not shut off for mean).

I’d rewrite the local variable as this:

vector[N] log_lambda = alpha[docs] + psi[features] + beta[features] .* theta[docs];

Or I’d just drop that vecotrized expression directly into possion_log.

Marko · August 2, 2018, 2:17pm

Dear all, thank you so much for your replies!

Bob_Carpenter:

Rather than setting a variable to literally sum to zero, we’ve found it more useful to use soft constraints, e.g.,

[…]

I’d rewrite the local variable as this:
vector[N] log_lambda = alpha[docs] + psi[features] + beta[features] .* theta[docs];
Or I’d just drop that vecotrized expression directly into possion_log .

The suggestions on the local variable and the soft constraint work nicely and make the model more efficient, thanks! (Although I ended up with not constraining alpha; the sum-to-zero constraint was only to imitate the quanteda implementation, but did not improve the Stan model).

I read the mixture model case study, which is really interesting (and goes well beyond my current understanding of sampling geometries, tbh). My takeaway for my model was that my ordered[2] constraint on two documents (see approach 1 in the original post) tends to work poorly because the order of the two selected documents on theta is not sufficiently strong in the data - similar to Betancourt’s example with less than 1 SD between the two mixture components. Did I understand this correctly?

This suggestion immediately worked for both the toy example and my actual data set, thank you! However, the specification restrict_low = restrict_high - 4 and in particular the 4 seem like black box voodoo (the good kind, no offense!). Can you say anything about how you came up with the constant and the importance of choosing the “right” constant?

I also have another follow up: If I specify the model according to your method and then write

theta_full ~ normal(0, 1);

, I get the “non-linear transformation” warning:

Warning (non-fatal):
Left-hand side of sampling statement (~) may contain a non-linear transform of a parameter or local variable.
[...]

Is this a problem in this case?

Finally, I am still curious about opinions on my first fix, i.e., flipping theta and beta in the generated quantities block.

Yes, I know were the first two documents are relativ to each other on theta. This is the same information which I use when I set an order constraint or use @saudiwin’s method.

This is also the reason why I did not understand this concern. The information used for switching the scales is the same as for a order constraint, so I will get the same rotation. Or am I missing something here?

Overall, I would prefer this solution, because it is the most simple one and it seems to be somewhat more efficient. Are there any principled reasons against the ex post fix?

Thanks again!

saudiwin · August 2, 2018, 5:38pm

Hi Markus -

I think you have a fairly clever way of reversing the signs. There are some issues, though, including that multimodality can mean that you have parameters that reflect around more than one axis. So it could be the case that multiplying beta * -1 will yield corrected estimates (i.e. beta flips with theta), but not always. So of course if that method works for you, go for it, it may be a unique solution to this particular model.

You don’t need to put a prior on theta, just theta_free and restrict_high.

I chose that constant because if the two parameters are at either end of the polarity (which of course they needn’t be), then 4 equals 2 standard deviations either of side of zero given a standard normal prior. But really, you can use any number you want.

I believe this method works because it combines 1) Stan’s ability to positive constrained parameters with 2) an additional fixed piece of information added to the model. It’s known that the identification problem can be resolved by essentially dropping one of the parameters to be estimated, which is what that line of code does (assigns one of the parameters to 4 minus another one). So I think it’s the use of both #1 and #2 that makes it work well in these latent variable models–then again, maybe next week someone will discover something better!

Bob

Marko · August 2, 2018, 7:26pm

Hi Bob,
Thank you for your fast reply.

This is indeed an important caveat. It has not happened yet with a few example data sets and my actual data, but I do not know whether it cannot happen by design.

Yes, this makes the warning disappear and the model works perfectly fine. Just as a technicality: Strictly speaking, I can no longer describe the model as having a standard normal prior on theta. What would I write instead? Or would a standard normal on theta still be a sufficiently accurate description within a text, if the model is documented for all readers who are interested in the details?

This sound super reasonable, thank you for the explanation.

Btw, I checked your idealstan package - it looks super interesting! Too bad that the German parliamentary system is not well suited to roll call models ;)

Bob_Carpenter · August 2, 2018, 11:01pm

Yes, that sounds right.

Only if theta_full is a non-linear function of parameters. If it’s just seelcting some or rescaling by a constant, then you’ll be OK.

You’d have to work out the implied prior on the variable of interest by looking at the prior on the raw value and the transform. For instance, if alpha ~ normal(0, 1) then exp(alpha) is disitrubted lognormal(0, 1).

Marko · August 3, 2018, 7:50am

Dear all, thank all of you again for your input. I attach the two versions of the model, which are working best for me, in case anybody ever stumbles across this thread in the future.

The version with the ex post fix of the multimodality in theta and beta. For the fix to work, the first document must be to the left (i.e., lower) of the second document in theta (similar to quanteda::textmodel_wordfish()). This version worked best for me, may however fail if the simple fix does not generalize to all cases:
wordfish_expost.stan (1.0 KB)

The version with @saudiwin’s method of a positive constrained document and one document which is constrained to be below the first one. The implementation treats the first document as the negative one and the second as the positive one, similar to quanteda::textmodel_wordfish(). This version also worked well, but was somewhat less efficient (somewhat lower n_eff, higher Rhat).
wordfish_ident.stan (1.4 KB)

Best,
Marko

Bob_Carpenter · August 9, 2018, 1:40pm

Thanks so much for following up. It’s really helpful for the community.

Topic		Replies	Views
Identification problems: Unable to recover parameters in Multinomial IRT Modeling fitting-issues , python	4	1024	November 29, 2021
Trouble converting JAGS IRT model to Stan Modeling rstan , irt , hierarchical-model	3	92	July 28, 2025
Non convergence issue on polytomous IRT model Modeling	15	1961	June 21, 2023
Thurstonian model Modeling	14	3423	August 27, 2019
Translation of an SEM with Ordered Categorical Variables from WinBUGS Modeling fitting-issues , specification , performance , gpu , cmdstanr	21	1062	October 16, 2023

Wordfish model (aka poisson ideal point model, poisson or multinomial IRT model): Fixing reflection invariance

Related topics