Identification problems: Unable to recover parameters in Multinomial IRT

I am trying to estimate the following multinomial IRT model:

\left[ y_{i1} \dots y_{iJ} \right] \sim Multinomial(\pi_{i1} \dots \pi_{iJ}, y_{i+})

where y_{ij} are the counts of word j in document i and y_{i+} is the total length of document i. The event probabilities are given by

\pi_{ij} = \frac{exp(\psi_{j} + \theta_i \beta_{j})}{\sum_1^J exp(\psi_{j} + \theta_i \beta_{j}) }

where \psi_j are document fixed effects, \beta_j are the ideological positions of words and \theta_i the ideological direction of documents.

These type of models, used especially in political science to recover the ideological position of documents, are known to have identification problems. First, I solve the problem of multiplicative aliasing by setting a unit prior on my latent parameter of interest \theta. Second, I solve the problem of reflectoin invariance as suggested in this post. As I understand, these two conditions should be sufficient to solve the identification problem. However, I have tested my model with simulated data and I am getting very strange results.

This is the model I am using:

data {
  int<lower=1> I; // Number of documents
  int<lower=1> J; // Number of features
  int<lower=1> N; // Number of counts (I * J in most cases)
  int<lower=0> counts[N];       // Long vector of counts
  int<lower=1,upper=I> docs[N]; // Long document index [1 1 1 ... I I I]
  int<lower=1,upper=J> word[N]; // Long feature  index [1 ... J 1 ... J] 
  int<lower=1,upper=I> neg_doc; 
  int<lower=1,upper=I> pos_doc; // Index for reference documents such that theta[neg_doc] < theta[pos_doc]
}
parameters {
  vector<lower=0>[I] theta; // Ideological direction of document
  vector<lower=0>[J] beta;  // Ideological direction of word
  vector<lower=0>[J] psi;   // Feature fixed effects
}
transformed parameters{ // Establish first word as a baseline (Lowe, 2016)
  vector<lower=0,upper=1>[N] Pi;      // Long vector of J-simplexes
  for (i in 1:I){
	Pi[J*(i-1)+1:J*i] = softmax(psi + theta[i] * beta); // https://mc-stan.org/docs/2_18/functions-reference/softmax.html
  }
}
model {
  psi ~ normal(rep_vector(0,J),rep_vector(10,J));  // Non-informative prior
  beta ~ normal(rep_vector(0,J),rep_vector(3,J));  // Regularization as in quanteda 
  theta ~ normal(rep_vector(0,I),rep_vector(1,I)); // Scale identification with unit normal prior

  for (i in 1:I){
    counts[J*(i-1)+1:J*i] ~ multinomial(Pi[J*(i-1)+1:J*i]); 
  }
}
generated quantities {
  // Scale identification (reflection invariance)
  vector[I] theta_fix = theta[pos_doc] < theta[neg_doc] ? theta : theta * -1;
  vector[J] beta_fix  = theta[pos_doc] < theta[neg_doc] ? beta  : beta  * -1;
}

This is the code I am using to generate my simulated data

import numpy as np
import random
from scipy.special import softmax

np.random.seed(1)

I = 5
J = 10
N = J * I
C = np.random.poisson(lam = 100, size = I)

theta = np.random.normal(size = I) 
theta = np.expand_dims(theta,axis = -1)
beta  = np.random.normal(scale = 3, size = J)
beta  = np.expand_dims(beta,axis = -1)
psi   = np.random.normal(scale = 3, size = J)
psi   = np.expand_dims(psi,axis = -1)

pi  = softmax(np.transpose(psi) + np.multiply(theta,np.transpose(beta)))

DTM = np.empty((I,J))
for i in range(0,I):
    DTM[i,:] = np.random.multinomial(C[i],pi[i,:])    

Finally, a comparison between the real values of \theta:

0.3190391,-0.24937038, 1.46210794, -2.06014071, -0.3224172

and the model’s fit


               mean se_mean     sd    2.5%    25%    50%    75%  97.5%  n_eff   Rhat
theta[1]       1.79  3.2e-3   0.39    1.16   1.52   1.75   2.01   2.67  14692    1.0
theta[2]       1.78  2.8e-3   0.39    1.15    1.5   1.73    2.0   2.65  18698    1.0
theta[3]        0.1  6.8e-4    0.1  2.7e-3   0.03   0.07   0.15   0.38  23224    1.0
theta[4]       2.65    0.03   0.89    0.79   2.33   2.84   3.25   4.03   1193    1.0
theta[5]       1.38  2.4e-3   0.28    0.89   1.19   1.36   1.55   1.98  12823    1.0

Any help or comment are appreciated, as I am not very familiar with Bayesian inference.

I noticed a mistake in my code (constraining my parameters to have a lower bound) and a couple of redundancies that I had to change (see code below). However, my problem still persists, ie. there is a huge difference between the actual values of the simulated parameter \theta and the estimates found by Stan.

I have some candidates to explain the bad performance of the model and the attempts at solving them:

  1. Identification. So far, I deal with identification by setting a very tight prior for \theta (my parameter of interest) and pinning down the first value of \psi to zero (see this post). I also deal with reflection invariance as mentioned in this post.

  2. Priors of the other two parameters are too vague. I don’t think this is the case, as I have played around with them and results do not seem to change.

  3. NUTS sampler and multimodality. This is where my lack of understanding of the topic will manifest, but I hope I can make it clear. My observation comes from the fact that when I check the fit of the model, the R_hat for my unconstrained parameters is way above zero, while this is not the case for constrained and generated quantities.

I provide also a plot of the trace, it surprises me that the distributions of some of the \psi is so wide:

This is the model:

data {
  int<lower=1> I; // Number of documents
  int<lower=1> J; // Number of features
  int<lower=1> N; // Number of counts (I * J in most cases)
  int<lower=0> counts[N];       // Long vector of counts
  int<lower=1,upper=I> docs[N]; // Long document index [1 1 1 ... I I I]
  int<lower=1,upper=J> word[N]; // Long feature  index [1 ... J 1 ... J] 
  int<lower=1,upper=I> neg_doc; 
  int<lower=1,upper=I> pos_doc; // Index for reference documents such that theta[neg_doc] < theta[pos_doc]
}
parameters {
  vector[I] theta; // Ideological direction of document
  vector[J] beta;  // Ideological direction of word
  vector[J-1] psi;   // Word fixed effects
}
transformed parameters{ // Establish first word as a baseline (Lowe, 2016)
  vector[J] psi_constrained;
  psi_constrained[1]   = 0;
  psi_constrained[2:J] = psi;
}
model {
  psi ~ normal(0,5);  // Non-informative prior
  beta ~ normal(0,3);  // Regularization as in quanteda 
  theta ~ normal(0,1); // Scale identification with unit normal prior

  for (i in 1:I){
    counts[J*(i-1)+1:J*i] ~ multinomial(softmax(psi_constrained + theta[i] * beta)); // https://mc-stan.org/docs/2_18/functions-reference/softmax.html); 
  }
}
generated quantities {
  // Scale identification (reflection invariance)
  vector[I] theta_fix = theta[pos_doc] < theta[neg_doc] ? theta : theta * -1;
  vector[J] beta_fix  = theta[pos_doc] < theta[neg_doc] ? beta  : beta  * -1;
}

I was curious and had a look – thank you for sharing a nice self-contained example.

There is a problem with the python code to simulate the data! The data is accidentally being generated by a somewhat different process than the intended model, which may account for some of the identifiability problems.

If we display the resulting shape (I, J) matrix DTM of word counts, we see:

array([[ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0., 98.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0., 88.],
       [ 0., 70.,  0.,  0.,  0.,  0.,  3.,  0.,  3., 14.],
       [ 2.,  0.,  3.,  0.,  1.,  0.,  0.,  5.,  0., 86.],
       [ 0.,  1.,  0.,  0.,  0.,  0.,  1.,  0.,  0., 96.]])

the counts for the last word j=9 are much larger than the other words, which is strange. The counts for words j=3 and j=5 are uniformly zero for all documents, so that may be a cause of some of of the non identifiability.

The very high counts for j=9 is strange as it doesn’t seem like your script is trying to generate such assymetric synthetic data – it turns out there’s an error with this line:

pi  = softmax(np.transpose(psi) + np.multiply(theta,np.transpose(beta)))

the input argument to softmax is a shape (I, J) array, but by default scipy.special.softmax performs the operation over the entire flattened array of size I*J:

axis int or tuple of ints, optional
Axis to compute values along. Default is None and softmax will be computed over the entire array x .
scipy.special.softmax — SciPy v1.7.1 Manual

This produces an output pi where the sum over both dimensions is 1, so then each time the script samples from the multinomial distribution with np.random.multinomial(C[i],pi[i,:]) , the pi[i,:] array has total probability much less than 1. The documentation of np.random.multinomial says this about the second argument pvals=pi[i,:] :

Probabilities of each of the p different outcomes. These must sum to 1 (however, the last element is always assumed to account for the remaining probability, as long as sum(pvals[:-1]) <= 1) .
numpy.random.multinomial — NumPy v1.21 Manual

So I think this explains how the problem with the softmax accidentally being over both dimensions I and J causes the symptom of very high counts when sampling the J-1th word.

When we fix the softmax to operate over the word dimension J by explicitly adding the axis=1 argument, then the script produces a new matrix DTM of word counts like so:

array([[ 5., 73.,  0.,  0.,  0.,  1., 17.,  1.,  1.,  0.],
       [33., 26.,  2.,  5.,  0.,  1., 16.,  3.,  0.,  2.],
       [ 0., 90.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [13.,  0., 48.,  0.,  5.,  0.,  0., 31.,  0.,  0.],
       [40., 36.,  2.,  0.,  1.,  1., 14.,  3.,  0.,  1.]])

That looks better! Here is the fixed version of the data generating script, with some sanity checks added:

import numpy as np
import random
from scipy.special import softmax

np.random.seed(1)

I = 5
J = 10
N = J * I
C = np.random.poisson(lam = 100, size = I)

theta = np.random.normal(size = I) 
theta = np.expand_dims(theta,axis = -1)
beta  = np.random.normal(scale = 3, size = J)
beta  = np.expand_dims(beta,axis = -1)
psi   = np.random.normal(scale = 3, size = J)
psi   = np.expand_dims(psi,axis = -1)

arg = np.transpose(psi) + np.multiply(theta,np.transpose(beta))

# for each document i in I, softmax over the word axis J
pi = softmax(np.transpose(psi) + np.multiply(theta,np.transpose(beta)), axis=1)

# sanity check: for each document i in I, we expect sum_{j in J} pi[i, j] = 1
assert pi.shape == (I, J)
assert np.all(np.isclose(np.sum(pi, axis=1), np.ones(shape=(I, ))))
assert np.all(pi >= 0.0)

DTM = np.empty((I,J))
for i in range(0,I):
    rhs = np.random.multinomial(C[i],pi[i,:])
    DTM[i,:] = np.random.multinomial(C[i],pi[i,:])

Re: the Stan model, I haven’t tried running it, but a couple of quick suggestions:

  1. the docs[N] and word[N] data inputs appear to be unused. If you don’t need them, it’d make the model easier to understand if the unused bits were completely removed.
  2. re: the earlier thread on reflection invariance, there’s a post from @Marko at the end attaching two different solutions. It looks like your model is based on the first solution. Have you tried out the second solution wordfish_ident.stan to see if that perhaps works better? Wordfish model (aka poisson ideal point model, poisson or multinomial IRT model): Fixing reflection invariance - #11 by Marko
1 Like