Initialization differences between normal and logistic IRT

Hello all,

As part of my general efforts to understand how initialization (among many other things) really works under the hood in Stan, I wrote the following toy application (partially derived from the vignette of the edstan package):

library(rstan)
library(edstan)

set.seed(1)

## Begin code from edstan vignette
# Set parameters for the simulated data
I <- 20
J <- 500
sigma <- .8
lambda <- c(-10*.05, .05, .5, -.025)
w_2 <- rnorm(J, 10, 5)
w_3 <- rbinom(J, 1, .5)
W <- cbind(1, w_2, w_3, w_2*w_3)
beta_free <- seq(from = -1, to = 1, length.out = I-1)

# Calculate or sample remaining variables and parameters
N <- I*J
ii <- rep(1:I, times = J)
jj <- rep(1:J, each = I)
beta <- c(beta_free, -1 * sum(beta_free))
rasch_theta <-  rnorm(J, W %*% matrix(lambda), sigma)
rasch_eta <- (rasch_theta[jj] - beta[ii])
rasch_y <- rbinom(N, size = 1, prob = plogis(rasch_eta))

# Assemble the data list using an edstan function
sim_rasch_list <- irt_data(y = rasch_y, ii = ii, jj = jj, 
                           covariates = as.data.frame(W), 
                           formula = NULL)
## End vignette code 

twopl_code <- "
data {
  int<lower=1> I;               // # questions
  int<lower=1> J;               // # persons
  int<lower=1> N;               // # observations
  int<lower=1, upper=I> ii[N];  // question for n
  int<lower=1, upper=J> jj[N];  // person for n
  int<lower=0, upper=1> y[N];   // correctness for n
  int<lower=-1, upper=1> polarity[I];
}
parameters {
  vector[I] diff;
  vector[I] disc_free;
  vector<upper=0>[I] disc_neg;
  vector<lower=0>[I] disc_pos;
  vector[J] theta_raw;
}
transformed parameters {
  vector[J] theta;
  real mean_theta_raw;
  real<lower=0> sd_theta_raw;
  vector[I] disc;
  // Identify location and scale
  mean_theta_raw = mean(theta_raw);
  sd_theta_raw = sd(theta_raw);
  theta = (theta_raw - mean_theta_raw) ./ sd_theta_raw;
  for (i in 1:I) {
    if (polarity[i] < 0) {
      disc[i] = disc_neg[i];
    }
    if (polarity[i] == 0) {
      disc[i] = disc_free[i];
    }
    if (polarity[i] > 0) {
      disc[i] = disc_pos[i];
    }
  }
}
model {
  diff ~ normal(0, 1);
  disc_free ~ normal(0, 1);
  disc_neg ~ normal(0, 1);
  disc_pos ~ normal(0, 1);
  theta_raw ~ normal(0, 1);
  //y ~ bernoulli_logit((disc[ii] .* theta[jj] - diff[ii]) * 1.7);
  y ~ bernoulli(Phi_approx(disc[ii] .* theta[jj] - diff[ii]));
}
"

sim_twopl_list <- within(sim_rasch_list, {
    polarity = rep(c(1, -1, 0, 0), length.out = I)
    y = ifelse(polarity[ii] < 0, 1 - y, y)
})

sim_twopl_fit <- stan(model_code = twopl_code,
                      data = sim_twopl_list,
                      chains = 2,
                      iter = 1000,
                      seed = 3)


print(sim_twopl_fit, c("lp__", "disc"))

If I run the code as written above, it prints out a number of Rejecting initial value warnings before it finally initializes properly. If, however, I substitute the commented out y ~ bernoulli_logit((disc[ii] .* theta[jj] - diff[ii]) * 1.7) in the Stan code for the line that follows it, initialization proceeds without any problems. (This also happens with different random seeds.) Since I have encountered initialization problems with similar models that also employ a normal (instead of logistic) link function, I am wondering whether this is a general pattern and, if so, why is the logistic superior? Again, I am posing this question mainly to enhance my general understanding of initialization in Stan (which has been a continuing source of frustration for me), not because I care about this particular model. Any insight into these matters would be much appreciated. Thanks!

Devin

Computing the log density of a bernoulli is easier numerically if you keep everything on the log scale. If you transform back and forth from probabilities you can have loss of precision issues.

The bernoulli function doesn’t like the parameter value 1 (which will occur for log scale parameters negative enough). This has come up before: Initialization between (-2, 2) failed after 100 attempts . I think this is a bug. Made an issue (https://github.com/stan-dev/math/issues/879). Tnx for being patient and posting this.

Ah, I see. It’s not about the difference between logit and profit, but rather in how bernoulli_logit handles the scale transformations relative to doing it manually. Presumably it does so in a way that avoids the problems with numeric overflow that occur when the transformations are done manually. Thanks very much for making it an issue. For my purposes, a good solution would be to create a bernoulli_probit function that behaves analogously to bernoulli_logit, but perhaps there are reasons why that is not possible. Thanks again for the help!

Devin

To follow up: It occurred to me that one way to get bernoulli_logit to behave almost exactly like the imaginary bernoulli_probit function I mentioned is to transform its argument in a way that adjusts for the differences between the logistic and normal CDF. Bowling et al. (link), for example, suggest the function f(x) = 0.07056 \times x^ 3 + 1.5976 \times x, which all but eliminates the discrepancy between the normal CDF of x and the logistic CDF of f(x). So, if we define the function

vector approx_norm(vector x) {
    vector[num_elements(x)] y;
    for (i in 1:num_elements(x)) {
        y[i] = 0.07056 * pow(x[i], 3) + 1.5976 * x[i];
    }
    return y;
}

in the functions block and change the sampling statement to

y ~ bernoulli_logit(approx_norm(disc[ii] .* theta[jj] - diff[ii]));

the resulting estimates are almost exactly the same as when

y ~ bernoulli(Phi_approx(disc[ii] .* theta[jj] - diff[ii]));

is used.