Low number of effective samples with 2PL latent space model / IRT

ariddell · April 14, 2017, 6:18pm

I’ve got the following 2PL IRT / latent space model which is taken from Bafumi, Joseph, Andrew Gelman, David K. Park, and Noah Kaplan. “Practical Issues in Implementing and Understanding Bayesian Ideal Point Estimation.” Political Analysis 13, no. 2 (April 1, 2005): 171–87.

The model is giving reasonable results but the number of effective samples is really low even with very high numbers of iterations. Is this normal? I was expecting a few more than 3 n_eff with 100,000 samples. Is there anything that can be done?

There are J = 354 respondents, K = 6 questions, with respondents only answering 1.7 questions on average.

Are there better priors? Would putting Gamma(2, 0)s on the scales help?

Here’s the model:

data {
  int<lower=1> K;                  // # questions
  int<lower=1> J;                  // # respondents
  int<lower=1> N;                  // data items
  int<lower=1, upper=K> kk[N];     // question for n
  int<lower=1, upper=J> jj[N];     // respondent for n
  int<lower=0, upper=1> y[N];      // agree/disagree n
  vector[J] books_per_year_log10;  // log10 of number of books read per year
}
parameters {
  vector[J] alpha;             // ability for respondent j
  vector[K] beta;              // difficulty for question k
  vector[K] gamma;             // discrimitiveness for question k
  real delta_0;                // prior intercept
  real<lower=0> delta_1;       // coef for books_per_year_log10, constrained to be positive to resolve reflection invariance
  real<lower=0, upper=20> sigma_alpha;
  real<lower=0, upper=20> sigma_beta;
  real<lower=0, upper=20> sigma_gamma;
}
model {
  vector[N] log_odds;
  delta_0 ~ normal(0, 1);
  delta_1 ~ normal(0, 1);
  alpha ~ normal(delta_0 + delta_1 * books_per_year_log10, sigma_alpha);
  beta ~ normal(0, sigma_beta);
  gamma ~ normal(0, sigma_gamma);
                                                                                                                             
  for (n in 1:N)               // likelihood
    log_odds[n] = gamma[kk[n]] * (alpha[jj[n]] - beta[kk[n]]);
  y ~ bernoulli_logit(log_odds);
}
generated quantities {
  // adjust parameters to common scale as recommended in Bafumi, Gelman, Park, and Kaplan
  vector[J] alpha_adj;
  vector[K] beta_adj;
  vector[K] gamma_adj;
  alpha_adj = (alpha - mean(alpha)) / sd(alpha);
  beta_adj = (beta - mean(alpha)) / sd(alpha);
  gamma_adj = gamma * sd(alpha);
}

and a bit of the summary output

Inference for Stan model: anon_model_43a5c2b517d321af97808f8cab116c06.
4 chains, each with iter=100000; warmup=50000; thin=1; 
post-warmup draws per chain=50000, total post-warmup draws=200000.

                 mean se_mean     sd   2.5%    25%    50%    75%  97.5%  n_eff   Rhat
alpha[0]         0.31    1.11   2.93  -7.55  -0.51   0.65   1.85   4.75      7   1.17
alpha[1]        -0.09    2.79   4.83 -13.87  -0.67   1.01   2.48   5.99      3   1.51
alpha[2]         0.11    4.27    7.4 -20.91  -0.47   1.89   4.01   9.53      3    1.6
alpha[3]         0.18    0.73   2.64  -6.64  -0.67   0.38   1.49   4.46     13   1.08
alpha[4]         1.02    2.26   5.06   -5.3  -1.35   -0.1   1.63  16.02      5   1.23
alpha[5]        -0.49    4.29   7.43 -22.05  -0.69   1.47   3.33   8.29      3   1.58
alpha[6]          1.1    2.97   5.14  -4.57   -1.5  -0.35   1.41  16.25      3   1.49
alpha[7]          0.7    1.32   3.23  -3.45  -0.93   0.06   1.34   10.0      6    1.2
alpha[8]         0.46    1.25   3.55  -9.11  -0.43   0.82   2.16   5.98      8   1.14
alpha[9]         1.15    4.04    7.0  -6.76  -2.29  -0.82   1.26  21.84      3   1.53
alpha[10]        0.79    1.44   6.26 -15.83  -0.63   1.16   3.31  11.76     19   1.05
alpha[11]       -0.65    4.39    7.6 -22.62  -0.81   1.46   3.31   8.24      3   1.66
alpha[12]        3.08    5.44   9.42  -5.27  -1.79  -0.53   2.14   30.8      3   1.82
alpha[13]        1.24    4.04    7.0  -6.59  -2.33   -0.9   1.37  21.56      3   1.68
alpha[14]       -0.58    4.36   7.55 -22.31  -0.74   1.52   3.37   8.24      3   1.65
alpha[15]        0.43    1.28   6.15 -15.85  -0.95    0.8   2.82  11.39     23   1.04
alpha[16]       -0.45    4.59   7.96 -23.25  -0.68   1.74   3.75   8.97      3   1.67
alpha[17]       -0.45    4.27    7.4 -21.74  -0.66    1.5   3.38   8.28      3   1.59
alpha[18]        1.55    3.38   6.76  -5.56   -1.7  -0.41   1.54  21.85      4   1.45
alpha[19]        0.75    0.04   2.46  -3.87   -0.4   0.58   1.75   6.24   3096    1.0

Bob_Carpenter · April 15, 2017, 2:51pm

Those parameters have very very wide posteriors—we’re on a logit scale, so values like -20 are pretty unreasonable.

I’m pretty sure the problem you’re having is one of identification. You have both additive (alpha/beta) and multiplicative (gamma/{alpha, beta}) non-identifiability in these models. You should try pinning the scale of beta to normal(0, 1).

We have a ton of tutorials and case studies on IRT models that go over the issues and Gelman and Hill cover it extensively in their book.

We don’t recommend those constrained ranges for scale parameters (here [0, 20]). Just put a weakly informative prior on them. Both the stats and the computation goes much better if the unconstrained posterior would put any mass near the boundary. Also, when you use a range like this, init is by default in the middle of that range.

You can vectorize that likelihood as

y ~ bernoulli_logit(gamma[kk] .* (alpha[jj] - beta[kk]));

ariddell · April 17, 2017, 4:48pm

Thanks Bob. The vectorization is great.

I took out the constraint (which I think i borrowed from here, https://github.com/stan-dev/example-models/blob/master/knitr/irt/irt-multilevel.stan ). I’ll try using normal(0, 1) on the beta scale. I had gamma(2, 0.1) (in a subsequent model) which is probably too wide judging from the results.

FYI here’s the new model which is fitting much better.

data {
  int K;                  // # questions
  int J;                  // # respondents
  int N;                  // data items
  int kk[N];     // question for n
  int jj[N];     // respondent for n
  int y[N];      // agree/disagree n
  vector[J] books_per_year_log10;  // log10 of number of books read per year
}
parameters {
  vector[J] alpha;             // ability for respondent j
  vector[K] beta;              // difficulty for question k
  vector[K] gamma;             // discrimitiveness for question k
  real delta_0;                // prior intercept
  real delta_1;       // coef for books_per_year_log10, constrained to be positive to resolve reflection invariance
  real sigma_alpha;
  real sigma_beta;
  real sigma_gamma;
}
model {
  sigma_alpha ~ gamma(2, 0.1);
  sigma_beta ~ normal(0, 1);
  sigma_gamma ~ gamma(2, 0.1);
  delta_0 ~ normal(0, 1);
  delta_1 ~ normal(0, 1);
  alpha ~ normal(delta_0 + delta_1 * books_per_year_log10, sigma_alpha);
  beta ~ normal(0, sigma_beta);
  gamma ~ normal(0, sigma_gamma);

  y ~ bernoulli_logit(gamma[kk] .* (alpha[jj] - beta[kk]));
}
generated quantities {
  // adjust parameters to common scale as recommended in Bafumi, Gelman, Park, and Kaplan
  vector[J] alpha_adj = (alpha - mean(alpha)) / sd(alpha);
  vector[K] beta_adj = (beta - mean(alpha)) / sd(alpha);
  vector[K] gamma_adj = gamma * sd(alpha);
}

It was the normal(0, 1) that did it, I think. The only thing I changed (vs. a subsequent version of the first model) was switch the sigma_beta ~ gamma(2, 0.1) to sigma_beta ~ normal(0, 1).

Bob_Carpenter · April 17, 2017, 6:18pm

That identifies the scale. When you have gamma[kk] * (alpha[jj] - beta[kk]) then you can multiply gamma and divide alpha and beta and get the same result, or you can add to both alpha and beta. Pinning beta to have location zero deals with the additive non-identifiability and pinning its scale to one deals with the multiplicative non-identifiability.

ariddell · April 17, 2017, 9:23pm

This make sense. So when I had Gamma(2, 0.1) on sigma_alpha, sigma_beta, and sigma_gamma that just wasn’t informative enough to solve the additive aliasing problem, right? Clearly, given those priors, adding a constant to alpha and beta doesn’t yield the same likelihood because, in theory, the gamma(2, 0.1) should discourage higher values. I can see that it wouldn’t meaningfully discourage higher values though.

I take it the problem is solved here, right? This is fully identified and I only need to set a informative prior on one of the scales.

Bob_Carpenter · April 18, 2017, 3:13pm

Yes, solved.

There’s a difference between solving the identification problem in principle and solving it so that you actually get reasonable posteriors. With soft identification through priors (like you did here), you need a reasonably tight prior or the values will drift all over the place and you’ll get a strong funnel-like structure.

The “hard” alternative is to use a sum-to-one constraint on the actual values. Andrew and Jennifer’s book present some nice alternatives—what you can’t do is what they recommend to post-process the values—that won’t work in Stan because HMC is much better at racing off into the tails than Gibbs.

ariddell · May 11, 2017, 1:45pm

Identification in the ideal point model (where gamma can be negative or positive) with a small/modest amount of data is hard. A elaborate regression on the prior mean for alpha ended up working but it took a lot of work.

I wonder if solving the additive aliasing would be easier if there were a sum-to-zero data type, alongside simplex and unit_vector.

Bob_Carpenter · May 11, 2017, 8:24pm

I thought about that. There’s a description of how to do it in the manual:

parameters {
  vector[N-1] alpha_raw;

transformed parameters {
  vector[N] alpha;  // sum to one constraint with N-1 degrees of freedom in alpha_raw
  alpha[1:(N-1)] = alpha_raw;
  alpha[N] = -sum(alpha_raw);

ariddell · May 12, 2017, 8:19pm

Thanks. This is a simple change.

I gave it a shot and it didn’t help the model. I think the fragility
must be with the reflection invariance.

I realize there is a bit of Bafumi et al. that I haven’t been paying
attention to but probably should before concluding that JAGS/ARMS might
do something better than Stan: “In practice, it is
also important to pick initial values for the parameters to respect
these constraints, to avoid wasting computation time while the iterative
algorithm ‘‘discovers’’ the appropriate mode.”

It would be great if there were a foolproof way to fit the 2PL ideal
point model – even in the case when the 2PL is “inappropriate”, i.e.,
1PL is good enough. The diagnostics that Bafumi et al. propose don’t
really help if you can’t get some samples from the posterior in the
first place.

Bob_Carpenter · May 15, 2017, 6:31pm

Gelman and Hill discuss the specific case of identifying ideal-point models in their regression book. The problem is that IRT models fit just fine for IRT data where the discrimination parameter doesn’t invert.

JAGS/BUGS won’t solve this problem—they’ll just mask it.

Topic		Replies	Views
2pl IRT / Ideal Point Identification with brms Modeling fitting-issues , specification , irt , brms	5	1502	January 1, 2022
Low effective sample size after running Bayesian cognitive model in Stan Modeling rstan , fitting-issues	8	782	August 18, 2021
Effective sample size and sample size Modeling fitting-issues	7	934	April 29, 2022
Non-convergence on 2PL IRT model Modeling	11	3061	May 25, 2017
A tutorial for fitting IRT Models for zero-and-one inflated bounded continous response data Publicity	0	510	September 28, 2023

Low number of effective samples with 2PL latent space model / IRT

Related topics