Why does VB struggle with bernoulli(inv_logit(...)) vs. bernoulli_logit(...)?

I’m implementing an IRT model and have followed the relevant chapter in the user guide to the letter (except for changing the kk index to ii, to match edstan’s format). Hence the model block is simply

model {
  alpha ~ std_normal();
  beta ~ normal(0, sigma_beta);
  gamma ~ lognormal(0, sigma_gamma);
  mu_beta ~ cauchy(0, 5);
  sigma_beta ~ cauchy(0, 5);
  sigma_gamma ~ cauchy(0, 5);
  y ~ bernoulli_logit(gamma[ii] .* (alpha[jj] - (beta[ii] + mu_beta)));
}

I want to use vb() in R because my data is rather large, and the model above works fine. But when I rewrite it as

model {
  // ...
  y ~ bernoulli(inv_logit(gamma[ii] .* (alpha[jj] - (beta[ii] + mu_beta))));
}

Stan balks and outputs Error: stan::variational::advi::adapt_eta: Cannot compute ELBO using the initial variational distribution. Your model may be either severely ill-conditioned or misspecified.

Now before you say why not just stick to bernoulli_logit? It’s because eventually I want to generalize the two-parameter logistic to a three-parameter logistic, i.e.

\operatorname{Pr}[y_n = 1] = \delta_{ii} + (1 - \delta_{ii})\operatorname{logit}^{-1}(\gamma_{ii} [ \alpha_{jj} - (\beta_{ii} + \mu_{\beta}) ] )

which means having to split up bernoulli_logit inevitably, into something like

model {
  // ...
  y ~ bernoulli(delta[ii] + (1 - delta[ii]) .* inv_logit(gamma[ii] .* (alpha[jj] - (beta[ii] + mu_beta))));
}

So I’m curious why bernoulli(inv_logit(...)) doesn’t work with vb(). Because, surprisingly, sampling() (with a small random subset of the data) works with either model (although the later runs slower). But vb() does not. Any idea why? Is this a known issue?


P.S.: I suggest using edstan’s “aggression” data if you’re trying to replicate the error.

library(edstan)

data("aggression")
agg <- irt_data(y=aggression$dich,ii=aggression$item, jj=aggression$person)
vbsample <- vb(irt_model, data=agg)

The compound distributions (e.g., bernoulli_logit() vs bernoulli(inv_logit())) handle the application of the link function (in this case the inverse-logit) in a more numerically-stable way to avoid under- and overflow when exponentiating.

The compound distributions also have analytic gradients, which allows the estimation to be more efficient (and also more numerically-stable) than auto-diffing.

1 Like

Thank you for the explanation @andrjohns. Is there a workaround for this issue, or is the three-parameter logistic IRT generally not estimable with variational Bayes in Stan?

As mentioned before, sampling() works for the model. But with about 10,000 examinees and an 80-item test, MCMC sampling takes almost an entire day to finish, whereas VB takes only a few minutes.

You could maybe make progress by moving to the target += syntax along with bernoulli_logit_lpmf(), i.e.,

target += log( delta[ii] + (1 - delta[ii]) .* exp( bernoulli_logit_lpmf(y | gamma[ii] .* (alpha[jj] - (beta[ii] + mu_beta)) ) )

The workaround is to manually write the calculation in a more numerically stable form. I believe this should work for three-parameter bernoulli. Note the use of composed functions like log1m and log_inv_logit; all intermediate values are on either logit- or log-scale.

mu = gamma[ii] .* (alpha[jj] - (beta[ii] + mu_beta));
y ~ bernoulli_logit(
      log_sum_exp(log(delta[ii]), log1m(delta[ii]) + log_inv_logit(mu))
      - log1m(delta[ii]) - log1m_inv_logit(mu)
      );
2 Likes

Thanks to both of you for the suggestions. I tried the code by @nhuurre but there seems to be some type mismatches because during compiling Stan complained

Ill-typed arguments supplied to function 'log_sum_exp'. Available signatures: 
(real, real) => real
(vector) => real
(row_vector) => real
(matrix) => real
(real[]) => real
Instead supplied arguments of incompatible type: vector, vector.`

so I decided to just loop over the elements of the y vector instead.

model {
  alpha ~ std_normal();
  beta ~ normal(0, sigma_beta);
  gamma ~ lognormal(0, sigma_gamma);
  delta ~ beta(2,1);
  mu_beta ~ cauchy(0, 5);
  sigma_beta ~ cauchy(0, 5);
  sigma_gamma ~ cauchy(0, 5);

  for (n in 1:N) {
    real mu_n = gamma[ii[n]] * (alpha[jj[n]] - (beta[ii[n]] + mu_beta));
    real lse = log_sum_exp(log(delta[ii[n]]), log1m(delta[ii[n]]) + mu_n);
    y[n] ~ bernoulli_logit(lse - log1m(delta[ii[n]]) - log1m_inv_logit(mu_n));
  }
}

I ran the code with the aggression data mentioned above, and it works. Maybe it’s just me, but the expression for the likelihood looks very convoluted now. It’s difficult to keep track whether all the different log components are producing the 3PL likelihood they’re supposed to be producing.

I’m also wondering if there are new issues to be expected if this model was to be further generalized into a multidimensional IRT model. I had originally started with this excellent blog post by @rfarouni but then ran into the bernouli_logit() problem in the OP when extending it from 2PL to 3PL.