Likelihood finite but gradient NaN, problem with parameter transformations?

Hi everyone

I’ve got yet another example of the gradients not being finite, but having a finite likelihood for the initial parameters. I’m reaching out for help because I’ve been trying to debug for days now! I have a very complex model, which I can share, but would like to check if the answer is simple first before asking for more involved help.

If I had to take a guess at my problem, it would be an underflow or overflow issue. I have about a dozen model parameters on very different scales, some best-fit values are around 1e-4 and others around 10,000. All are positive only. My thinking to get around this was to transform the parameters by their prior means i.e.

parameters {
  real param1;
  real param2;
transformed parameters {
  real<lower=0> param1_tr = exp(param1 - 9);
  real<lower=0> param2_tr = exp(param2 + 9);
model {
  param1 ~ normal(0, 1);
  param2 ~ normal(0, 1);
  target += normal_lpdf(Y | mu, sigma);

such that param1_tr is around 1e-4 and param2_tr is around 1e4. These should be very close to the true values.

When I take a look at the gradients using grad_log_prob(fit, upars), where upars is a vector of zeros with a 1 for the sigma, I get
[1] NaN NaN NaN NaN NaN NaN NaN NaN
[9] NaN NaN NaN 31295823
[1] -15653355

There is a magic number for initialisation values, where if they are less than around -0.035, the gradients are finite. This I find quite puzzling.

My question is, would those transformations be the likely culprit? If so, could anyone suggest a better way to rescale the parameters?

Alternatively, would you expect this to be fine? If so, I’ll prepare a simulated dataset and upload the full model.

Many thanks for your help, I love this forum

Operating System: Ubuntu
Interface Version: rstan 2.21
Compiler/Toolkit: gcc

You didn’t say what the rest of your model is and what’s overflowing. Often you have to do everything on the log scale (using addition instead of multiplication and log-sum-exp instead of addition) to make the gradients stable. exp() has a domain of around (-400, 400) without overflowing (using double-precision arithmetic).

To debug, start with prior and then go on to adding data and likelihood components a bit at a time to see where things go wrong.

P.S. You can build in the offset and convert to an array will see this.

parameters {
   array[2] real<offset={9, -9}> param_unc;
transformed parameters {
  array[2] real<lower=0> param = exp(param_unc);
model {
  param_unc ~ std_normal();

Initialization makes sense this way, initialization will be uniform(7, 11) and uniform(-11, -7).

Thanks very much for the response @Bob_Carpenter. I’ll spend a bit of time preparing a stripped
back, digestable example of the model (the calculation of the mean is 330 lines of Stan functions block code) and post again soon.

Having (-400, 400) as the domain for exp() makes me think it might not be related to the transformations after all. I’ve spent the day evaluating gradients numerically outside of Stan for single-parameter examples, and everything seems reasonable to my eye,. e.g. the graph of absolute gradients with respect to square root sum of squares (red line is initialisation value)


optim() is able to navigate this fine, although the model I have been given (to refit to a new dataset) is quite far from the data, so likelihoods are very low with the prior means. I’d still expect a finite gradient when initialising at these prior means, however.

Will post again soon, thanks again

It was exactly as you said @Bob_Carpenter, thank you so much for the suggestion. Scraping back the model and data to just the priors and incrementing through wasn’t as intimidating as I had anticipated either.

It turned out to be a division operation rather than a multiplication, but was solved in the same way.

Once again, high appreciated.