Dear all,

I am trying to implement the so-called “wordfish model” in Stan. The model is a variant of a poisson ideal point model or poisson or multinomial IRT model, which is primarily applied in political science. The goal is to estimate the political position of parties or politicians based on the words, which they use in political texts (e.g., party manifestos or speeches). The data is typically organized in a document-feature-matrix, where rows are political texts, columns are features (e.g., words, word stems, n-grams), and cells contain counts (how often has party i used feature j?).

The counts are modelled as

`counts_i,j ~ poisson(exp(alpha_i + psi_j + beta_j * theta_i)`

,

where `alpha`

is a vector of document fixed effects, `psi`

is a vector of feature fixed effects, `beta`

contains the feature positions and `theta`

contains the document positions (following the notation of `quanteda::textmodel_wordfish()`

).

The first implementation (which I am aware of) is Slapin & Proksch (2008), implementation with EM in R here. A more sophisticated implementation of the EM solution in C++ (via RCPP) is the `textmodel_wordfish()`

function in the Quanteda package.

I want to implement a Stan version because (a) the existing versions are primarily interested in `theta`

and therefore do not provide standard errors for the other parameters, (b) the standard errors seem to be rather small in many applications, which let’s me believe that the uncertainties are not completely propagated throughout the whole model, and © I want to be able expand the model, e.g. a hierarchical model with politicians in parties.

The good news is that the basic model is straightforward to implement in Stan (wordfish_discourse.stan for replication below):

```
data {
int<lower=1> I; // Number of documents
int<lower=1> J; // Number of features
int<lower=1> N; // Number of counts (I * J in most cases)
int<lower=0> counts[N]; // long vector of counts
int<lower=1,upper=I> docs[N]; // index document to count
int<lower=1,upper=J> features[N]; // index feature to count
}
parameters {
vector[I] theta; // Document position
vector[I] alpha; // Document fixed effects
vector[J] beta; // Feature position
vector[J] psi; // Feature fixed effects
}
model {
vector[N] lambda;
for (n in 1:N)
lambda[n] = alpha[docs[n]] + psi[features[n]] + beta[features[n]] * theta[docs[n]]; // actually, log_lambda
alpha ~ normal(0, 10); // non-informative
psi ~ normal(0, 10); // non-informative
beta ~ normal(0, 3); // some regularization like quanteda model
theta ~ normal(0, 1); // identify model with unit normal prior
counts ~ poisson_log(lambda); // poisson ideal point model
}
generated quantities {
vector[I] theta_std = (theta - mean(theta)) / sd(theta); // standardize theta for comparison with quanteda::wordfish()
}
```

The model is reasonably fast and recovers very similar parameter values as `quanteda::textmodel_wordfish()`

in quite a few test cases (if there is no reflection invariance). The uncertainties of all parameters are also sensible. A simple example is included in the attachment.

The bad news is that I did not yet succeed in fixing the reflection invariance, that is, the chains converge to symmetric solutions where the documents are in exactly the opposite direction, but with about equal distances in `theta`

, the same for the features in `beta`

.

What I have tried so far (wordfish2.stan contains the commented out variations):

- Splitting
`theta`

into ordered[2] and vector[I-2] and ordering the input data such that document 1 is plausibly to the left of document 2. Code:

```
parameters {
ordered[2] theta_dir;
vector[I-2] theta_rest;
...
transformed parameters {
vector[I] theta = append_row(theta_dir, theta_rest);
...
model {
...
theta_dir ~ normal(0, 1);
theta_rest ~ normal(0, 1);
```

This prevents reflection invariance, but there are frequently solutions where theta_1 and theta_2 are estimated to be basically identical, with only a marginal distance to satisfy the order constraint. The same thing happens with an order constraint on `beta`

or on both `theta`

and `beta`

.

- Using an indicator for the most left and right documents and constraining its relationship with theta to be positive (as recommended in Gelman/Hill, p. 318). Code:

```
data {
...
vector<lower=-1,upper=1>[I] left_right; // c(-1, 1, rep(0, (I-2)))
parameters {
...
real<lower=0> b1;
model {
...
theta ~ normal(b1 * left_right, 1);
b1 ~ normal(0, 10);
...
```

This does not prevent reflection invariance at all.

- Priors with means -1 and 1 for the first two values of
`theta`

. Code:

```
model {
...
theta[1] ~ normal(-1, 1);
theta[2] ~ normal(1, 1);
for (i in 3:I) theta[i] ~ normal(0, 1);
```

This does not prevent reflection invariance at all.

In addition, I have set sum-to-zero constraints or have set one value to zero in psi and/or alpha. These constraints are not to fix the reflection invariance but are used in other implementations for identification.

Any ideas how to fix the reflection invariance are highly appreciated!

wordfish2.stan (1.5 KB)

wordfish_discourse.R (1.9 KB)

wordfish_discourse.stan (1.1 KB)