I was curious and had a look – thank you for sharing a nice self-contained example.
There is a problem with the python code to simulate the data! The data is accidentally being generated by a somewhat different process than the intended model, which may account for some of the identifiability problems.
If we display the resulting shape (I, J)
matrix DTM
of word counts, we see:
array([[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 98.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 88.],
[ 0., 70., 0., 0., 0., 0., 3., 0., 3., 14.],
[ 2., 0., 3., 0., 1., 0., 0., 5., 0., 86.],
[ 0., 1., 0., 0., 0., 0., 1., 0., 0., 96.]])
the counts for the last word j=9
are much larger than the other words, which is strange. The counts for words j=3
and j=5
are uniformly zero for all documents, so that may be a cause of some of of the non identifiability.
The very high counts for j=9
is strange as it doesn’t seem like your script is trying to generate such assymetric synthetic data – it turns out there’s an error with this line:
pi = softmax(np.transpose(psi) + np.multiply(theta,np.transpose(beta)))
the input argument to softmax is a shape (I, J)
array, but by default scipy.special.softmax
performs the operation over the entire flattened array of size I*J
:
axis int or tuple of ints, optional
Axis to compute values along. Default is None and softmax will be computed over the entire array x .
– scipy.special.softmax — SciPy v1.11.4 Manual
This produces an output pi
where the sum over both dimensions is 1, so then each time the script samples from the multinomial distribution with np.random.multinomial(C[i],pi[i,:])
, the pi[i,:]
array has total probability much less than 1. The documentation of np.random.multinomial
says this about the second argument pvals=pi[i,:]
:
Probabilities of each of the
p
different outcomes. These must sum to 1 (however, the last element is always assumed to account for the remaining probability, as long assum(pvals[:-1]) <= 1)
.
– numpy.random.multinomial — NumPy v1.26 Manual
So I think this explains how the problem with the softmax
accidentally being over both dimensions I
and J
causes the symptom of very high counts when sampling the J-1
th word.
When we fix the softmax
to operate over the word dimension J
by explicitly adding the axis=1
argument, then the script produces a new matrix DTM
of word counts like so:
array([[ 5., 73., 0., 0., 0., 1., 17., 1., 1., 0.],
[33., 26., 2., 5., 0., 1., 16., 3., 0., 2.],
[ 0., 90., 0., 0., 0., 0., 0., 0., 0., 0.],
[13., 0., 48., 0., 5., 0., 0., 31., 0., 0.],
[40., 36., 2., 0., 1., 1., 14., 3., 0., 1.]])
That looks better! Here is the fixed version of the data generating script, with some sanity checks added:
import numpy as np
import random
from scipy.special import softmax
np.random.seed(1)
I = 5
J = 10
N = J * I
C = np.random.poisson(lam = 100, size = I)
theta = np.random.normal(size = I)
theta = np.expand_dims(theta,axis = -1)
beta = np.random.normal(scale = 3, size = J)
beta = np.expand_dims(beta,axis = -1)
psi = np.random.normal(scale = 3, size = J)
psi = np.expand_dims(psi,axis = -1)
arg = np.transpose(psi) + np.multiply(theta,np.transpose(beta))
# for each document i in I, softmax over the word axis J
pi = softmax(np.transpose(psi) + np.multiply(theta,np.transpose(beta)), axis=1)
# sanity check: for each document i in I, we expect sum_{j in J} pi[i, j] = 1
assert pi.shape == (I, J)
assert np.all(np.isclose(np.sum(pi, axis=1), np.ones(shape=(I, ))))
assert np.all(pi >= 0.0)
DTM = np.empty((I,J))
for i in range(0,I):
rhs = np.random.multinomial(C[i],pi[i,:])
DTM[i,:] = np.random.multinomial(C[i],pi[i,:])
Re: the Stan model, I haven’t tried running it, but a couple of quick suggestions:
- the
docs[N]
andword[N]
data inputs appear to be unused. If you don’t need them, it’d make the model easier to understand if the unused bits were completely removed. - re: the earlier thread on reflection invariance, there’s a post from @Marko at the end attaching two different solutions. It looks like your model is based on the first solution. Have you tried out the second solution
wordfish_ident.stan
to see if that perhaps works better? Wordfish model (aka poisson ideal point model, poisson or multinomial IRT model): Fixing reflection invariance - #11 by Marko