'Gradient evaluated at the initial value is not finite' with *very* simple multinomial model!

Hi everyone,

I’m running into issues with the error ‘Gradient evaluated at the initial value is not finite’ whilst trying to fit a very simple model (using cmdstanr). The data is a single observation from a 3-category multinomial distribution, and the simplex of multinomial probabilities, \theta, is parametrised by a single parameter, p, as \theta = (p, 0, 1-p) .

The issue originally arose in trying to specify multi-state capture-mark-recapture models, where the likelihood is a product of multinomials, and the easiest way to define the multinomial probabilities is using matrix products. In my attempt to understand what was going wrong I abstracted away from the mark-recapture setting and simplified, and below is a pair of models which define exactly the same distribution, but the first works and the second doesn’t!

The data is simply

y <- c(30,0,70)

Here’s Model 1, which works perfectly:

data {
  array[3] int<lower=0> y;    
}

parameters {
  real<lower=0,upper=1> p;
}

transformed parameters {
  simplex[3] theta;
  theta = [p,0,1-p]';
}

model {
  y ~ multinomial(theta);
}

Model 2 is very artificial, but is motivated by the multi-state models where the multinomial probabilities are defined using matrix products.

data {
  array[3] int<lower=0> y;     
}

transformed data {
  matrix[3,2] A = [ [1, 1], [0, 1], [-1, 1] ];
  vector[3] b = [0,0,1]';
}

parameters {
  real<lower=0,upper=1> p;
}

transformed parameters {
  simplex[3] theta;
  vector[2] x = [p,0]';
  theta = A*x + b;
}

model {
  y ~ multinomial(theta);
}

As I understand it, Model 1 and Model 2 define exactly the same posterior: the value of \theta = Ax + b should equal (p,0,1-p), as defined directly in Model 1. So why does Model 2 throw the infinite gradient error?

I believe the error is related to the presence of the zero in the multinomial probabilities \theta = (p, 0, 1-p), but that cannot be the whole story, because it’s not an issue for Model 1. So I assume it is something to do with this zero and the way the autodiff / chain rule handles the matrix products that define \theta in Model 2?

I’d be very grateful for any explanations and solutions!

1 Like

You put your finger on the source of the problem—chain rule going bad. Let’s see where.

You can use slightly more compact notation here.

As an aside, this is super dangerous as it’s going to fail whenever y[2] != 0.

Your second example is equivalent to the simpler example:

simplex[3] theta = [ p, 0 * p, 1 - p];

What’s going on is that we are going to pass 0 * p as the second argument to the multinomial. That’s going to evaluate log(y^(0 * p)) with y = 0 as (0 * p) * log(0) as part of evaluating the multinomial, which is going to propagate the -infinity value of log(0) back to p.

Thus 0 * p will indeed behave differently than 0 in this context because there’s nothing to pass the infinite derivative to in that case.

I’m afraid there’s not much we can do to fix this as it would require symbolic math to reduce the 0 * p to 0 without propagating derivatives down to p.

2 Likes

Hey @Bob_Carpenter just want to double check that it’s totally clear to you that this couldn’t be fixed with a patch for multiply_log(0, 0) has infinite gradient · Issue #2494 · stan-dev/math · GitHub. Not saying that it can, but it seems so closely related that I just wonder if it might.