I am trying to use the integrate_1d function to integrate parameters out of my likelihood, but I get the error Exception: gradient_of_f: The gradient of f is nan for parameter 0.
I vastly simplified the model to only one parameter, but the problem persists. The code is as follows:
functions {
real pcm_likelihood(
real x, // Function argument
real xc, // Complement of function argument
real[] pars, // parameters
data real[] data_real, // data (real)
data int[] data_int // data (integer)
) {
int J = data_int[1];
int K = data_int[2];
int mu_index = 1;
vector[J + 1] item_prob;
vector[K] log_lik;
matrix[J, K] xmbetas = x - to_matrix(data_real, J, K);
for (k in 1:K) {
item_prob = log_softmax(
append_row(0, cumulative_sum(xmbetas[, k]))
);
log_lik[k] = item_prob[data_int[2 + k]];
}
print("pcm_likelihood x = ", x);
return exp(sum(log_lik) + normal_lpdf(x | pars[mu_index], 1));
}
}
data {
// Counts
int<lower=1> I; // No. persons
int<lower=1> K; // No. items
int<lower=1> J; // No. parameters per item
// Data
int<lower=1, upper=J+1> y[I, K]; // Data as Persons by Items Matrix
// Integration parameters
real tol;
// Item Parameters
real beta[K, J];
}
transformed data {
int<lower=1> integration_data[I, K + 2];
for (i in 1:I) {
integration_data[i] = append_array({ J, K }, to_array_1d(y[i,]));
}
}
parameters {
// Person Parameters
real mu_theta;
}
model {
// Priors
target += std_normal_lpdf(mu_theta);
print("mu_theta = ", mu_theta, " ----------------------------------");
// Likelihood
for (i in 1:I) {
real marg_lik = integrate_1d(
pcm_likelihood,
negative_infinity(),
positive_infinity(),
{ mu_theta },
to_array_1d(beta),
integration_data[i],
tol
);
print("marg_lik = ", marg_lik, " ----------------------------------");
target += log(marg_lik);
}
}
When I run this model with I=1, I get the following (truncated output):
Chain 1: mu_theta = -0.534418 ----------------------------------
pcm_likelihood x = 1.79769e+308
pcm_likelihood x = -1.79769e+308
pcm_likelihood x = 0
pcm_likelihood x = 3.08829
pcm_likelihood x = -3.08829
[...]
pcm_likelihood x = 87.715
pcm_likelihood x = -87.715
pcm_likelihood x = 124.21
pcm_likelihood x = -124.21
marg_lik = 0.0333858 ----------------------------------
Chain 1: mu_theta = -0.534418 ----------------------------------
pcm_likelihood x = 1.79769e+308
pcm_likelihood x = -1.79769e+308
pcm_likelihood x = 0
pcm_likelihood x = 3.08829
pcm_likelihood x = -3.08829
[...]
pcm_likelihood x = 87.715
pcm_likelihood x = -87.715
pcm_likelihood x = 124.21
pcm_likelihood x = -124.21
pcm_likelihood x = 1.79769e+308
Chain 1: Exception: gradient_of_f: The gradient of f is nan for parameter 0 (in 'modele66b9d35cc3_pcm_bs_marg_nobeta_discourse' at line 69)
So what happens, essentially, is that the integrator runs through clean (the computed integral is correct btw, so the integrator works and the function is defined correctly), then starts again with exactly the same parameter with the same nodes and then tries to start again (probably for autodiff?) with \approx+\infty and fails. I would be happy if someone could explain what is going on there?
From @nhuurreâs suggestion in this thread I figured it could be log_softmax and friends. So I would like to use
for (k in 1:K) {
log_lik[k] = categorical_logit_lpmf(
data_int[2 + k] |
append_row(0, cumulative_sum(xmbetas[, k]))
);
}
But since I want to integrate over the whole real number line, I get an error saying that in categorical_logit_lpmf, probabilities canât be infinite (donât recall the exact wording).
Following @bbbales2 suggestion in the aforementioned thread, I donât see what could go wrong with \displaystyle\frac{\partial}{\partial \mu_{\theta}}\int f(x, \mu_{\theta}) dx,
since in my case, f(x, \mu_{\theta})=g(x)N(x;\mu_{\theta},1)
with g a fraction of sums of exponentials, so my gradient is \displaystyle\frac{\partial}{\partial \mu_{\theta}}\int f(x, \mu_{\theta}) dx=\int (x-\mu_{\theta}) f(x, \mu_{\theta}) dx,
which should be well behaved (?).
For now, I am using adaptive Gauss-Hermite, which seems to work but is incredibly slow, plus I think the boost integrator has been shown to be optimal, so Iâm hoping there is a way to make this work.
Other questions:
Is there another way to program my integrand pcm_likelihood without the use of categorical_logit_lpmf, or, alternatively, can I make categorical_logit_lpmf work with \infty?
How can I âdebugâ this better? How can I see what happens with the autodiff to find out where exactly it goes wrong?
Is there a way to provide an analytical derivative? Can I program my own integrated_categorical_lpmf and add the proper derivatives to avoid numerical differentiation?
Hmm this is probably some at infinity sorta problem yeah.
It happens that out at the limits of double precision, youâll sometimes get NaNs in the gradients but at infinities if the integral is gonna exist it better go to zero.
We have a little logic to allow for this:
if (is_nan(gradient)) {
if (fx.val() == 0) {
gradient = 0;
} else {
throw_domain_error("gradient_of_f", "The gradient of f", n,
"is nan for parameter ", "");
}
}
Iâm guessing what is happening is that the function is evaluating somewhere way out and is returning something very near zero but the gradient is going to NaN.
This way we know what the input and output are as everything explodes. Hopefully we get all the prints and Stan isnât eating them somewhere internally.
My guess is that at extremely large values normal_lpdf(x| ...) becomes negative infinity. exp(...) takes that to zero but cannot propagate the gradient correctly. You could try to cut off the bad derivative like this
Oh wow so itâs evaluating something for the gradient it isnât evaluating for the main function. Can you turn your tolerances down and kick it off again to see if the behavior changes?
Thank you! I just tried your suggestion, at while it didnât work with if (retval == 0.0) {, I changed it to if (is_nan(retval)) { (which is what you meant, I guess?) and it finishes clean! Thatâs great! Iâm just wondering, how can we distinguish between -nan and +nan, i.e. how can we make sure there arenât fringe cases where this gives us bad values?
The first check is ignoring 0.0 values that lead to NaN gradients, which is something safe enough we do it in the code.
This is ignoring NaNs in values, which under the assumption that this is only happening out at crazy large values of the lpdf, seems fair.
You could make sure you arenât dropping NaNs you care about (ones that actually indicate problems) with an extra condition like if(is_nan(retval) && abs(pars[mu_index]) > 1e20)
Or maybe it makes more sense to just bound these integrals to some large range like -1e4, 1e4. Iâm not sure if thatâs a bad idea with these quadratures or not. You could compute the full integral too. (Edit: Not sure last comment makes sense â you could like integrate [-1e5, 1e5] in generated quantities and compute against the [-1e4, 1e4] integral to make sure the cutoff is behaving fine)
So I guess the nan comes from the log_softmax etc. rather than the âcorrectâ -inf of the normal_lpdf, right?
I am using rstan_2.21.2 by the way. I know you guys have updated the integrator in the latest version of CmdStan, but I havenât tried it yet (since on remote I am dependent on my sysadmin for updates and locally, I have Big Sur â Big Trouble with everything regarding compilation)
Oh! 1.79769e+308+1.79769e+308 overflows so cumulative_sum() produces infinities and log_softmax() is all nan.
The exp(sum(log_lik)) part that comes from softmax is always \leq 1 so x sufficiently far in the tails of the normal_lpdf will always yield zero total. You can put this at the start of the function
if (abs(x - pars[mu_index]) > 1000) {
return 0.0;
}
Similar idea should work with categorical_logit: check at the start if the value of x is so far from reasonable that no calculation is needed to see that the result is zero and thereâs no need to worry about infinities.
Chain 1: mu_theta = 1.90732 ----------------------------------
pcm_likelihood x = 1.79769e+308, log_lik = [-inf,-nan], normal_lpdf_val = -inf, retval = -nan
pcm_likelihood x = -1.79769e+308, log_lik = [0,-inf], normal_lpdf_val = -inf, retval = 0
[...]
pcm_likelihood x = 104.108, log_lik = [-416.331,-206.865], normal_lpdf_val = -5223.38, retval = 0
pcm_likelihood x = -104.108, log_lik = [0,-207.715], normal_lpdf_val = -5620.51, retval = 0
marg_lik = 0.0029028 ----------------------------------
Chain 1: mu_theta = 1.90732 ----------------------------------
pcm_likelihood x = 1.79769e+308, log_lik = [-inf,-nan], normal_lpdf_val = -inf, retval = -nan
pcm_likelihood x = -1.79769e+308, log_lik = [0,-inf], normal_lpdf_val = -inf, retval = 0
[...]
pcm_likelihood x = 1.79769e+308, log_lik = [-nan,-nan], normal_lpdf_val = -inf, retval = -nan
Chain 1: Exception: gradient_of_f: The gradient of f is nan for parameter 0 (in 'modele66b52768eeb_pcm_bs_marg_nobeta_discourse' at line 81)
It looks like that in the first 1.79769e+308 we get [-inf,-nan] and in the second we get [-nan,-nan]. For x=\infty, the likelihood contribution of the first k should be 0 and the second 1, so I understand whrere the -inf in [-inf, -nan] comes from, but not the -nan (it should be \log(1)=0).
Of course, this is a great suggestion! This works, even in the big model (I=1000, K=10, J=4 + mixture). However, it is insanely slow, even slower than the adaptive Gauss Hermite quadrature:
AGH:
Chain 1: Gradient evaluation took 1.20227 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 12022.7 seconds.
Chain 1: Adjust your expectations accordingly!
integrate_1d:
Chain 1: Gradient evaluation took 86.1009 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 861009 seconds.
Chain 1: Adjust your expectations accordingly!
Do you have any suggestion how this could be sped up?
Can you write out the Math for the integral youâre trying to evaluate? Not clear to me looking at the code but maybe @nhuurre can see something in the Eqs.
is needlessly computed for every i\in I. So I basically copy-pasted the whole sinh-sinh integrator from the boost GitHub, dropping everything unnecessary for my specific function and further employing a cutoff criterion by @nhuurreâs suggestion.
I managed to bring it down from
1000 transitions using 10 leapfrog steps per transition would take 924607 seconds.
to
1000 transitions using 10 leapfrog steps per transition would take 1982 seconds.
so a couple hundred times faster, even though still very slow (without integrating out the person parameters, this model takes about 2 hours for 2000 iter/chain instead of a couple of days now).
Since I have a lot of cores at my disposal, I want to try to implement reduce_sum now, maybe that brings along another big acceleration.
Thanks again both of you for your help, I have one more question that came up which I will investigate empirically, but I would be curious of your opinion, too:
When testing my accelerated quadrature, I picked out some specific log-lik terms and every time, the difference to the integrate_1d function turned out to be zero. However, there must be some slight differences since the root mean square error of the whole 1000-vector of differences between the likelihood terms of my integrator and the integrate_1d was always around the 5e-14 range. On the other hand, benchmarks say I can double the speed again with only 5 levels of refinement, but then the RMSE of difference between my integrator and the integrate_1d is only about 1.5e-6.
Since the first is slightly above machine epsilon I was wondering: What level of precision still matters for stan? Probably it heavily depends on the model and the data, but do you have an intuition where you say âokay after x significant digits it doesnât really matter anymoreâ?
It is good these this had high payoff, but can you talk a bit about how it was difficult/not possible to do it with integrate_1d? Ideally integrate_1d would just work and no need to send people off to implement their own sinh-sinh quadratures :P.
If you have y \sim N(f, \sigma) and f is the thing youâre computing, my intuition is youâre probably okay as long as the scale of the error in f is much less than either the difference between f and y or \sigma youâll be okay.
The other thing you can do is compute a log density with the high precision calculation, compute it with the low precision approximation, and use the psis stuff in the loo package to check if an importance sampling correction is possible. The idea is that if p_{high}/p_{low} is close to 1.0 then the importance sampling will work, so you can just sampling using the low precision model, compute the log density of the high precision model in generated quantities, check the ratios, and then resample according to the importance sampling weights. Programmatically it gets a little awkward, but itâs a way to check if youâre off/do a correction if possible that should be pretty hands-off.
Ask if you want more details on the second thing @jtimonen
After using the suggestions above, I managed to make it work and in the end it was just a matter of speed. With integrate_1d I broke off the estimation with 2 chains and 2000 iter/chain after a week or so, because it never reached 200/2000 iterations. The implementation I have now finishes in 1.5h. I would be curious how fast it will go when I reduce_sum the model with integrate_1d and what will happen in comparison to my implementation with reduce_sum.
Just some empirical results here: I estimated my model with my implementation with varying levels of refinement and I noticed that as soon as I have >=5 levels, my posterior means donât change any more. This is in contrast to the maximum of 7 levels that the boost integrator uses.
Interestingly, using 5 levels gets me errors (so |I_1 - I_0|/||f_0|| as implemented in the boost function, I know we talked about this somewhere else, I havenât managed to contact the boost guys with my question yet) as big as \approx 0.006, which corresponds to setting the tolerance to 0.006 and using 7 levels in integrate_1d (edit: of course it does).
Lastly, thank you for the suggestion, this seems to be really interesting and I might come back to it in the future when I have more time to get into the details of this approach.
Well if you can get a suitable solution and it is this much faster than integrate_1d it makes me wonder if weâre doing something wrong in integrate_1d.
What your solution to code up a custom sinh_sinh quadrature in Stan using the Boost implementation as reference? Can you share this?
I donât think there is anything wrong with the implementation in Stan, itâs just that I have a likelihood with 1000 factors each consisting of two integrals in a mixture, which means I have to compute 2000 integrals per iteration. At the core, the likelihood is a categorical_lpmf and what I did is just compute the probabilities at every abscissa value of the integrator beforehand and then use the loops to only pick the correct probability based on the data out of the pre-computed probabilities.
Also, from the printing I did above, I know that the integrator works as intended. The only thing I donât understand is the number of abscissas the boost quadrature takes at every refinement level, because I donât understand the function get_abscissa_row() in the boost code, but I donât speak enough C++ to figure this out.
Maybe there is a way to do this way faster with the on-board tools but I didnât see it. The code is rather long, but go ahead if you want to take a look. This is an old version, I improved it some more since then, but the effect is in the millisecond range now.
data {
// Counts
int<lower=1> I; // No. persons
int<lower=1> K; // No. items
int<lower=1> J; // No. parameters per item
int<lower=1> C; // No. Classes
// Data
int<lower=1, upper=J+1> y[I, K]; // Data as Persons by Items Matrix
// Priors
// Mixture Parameters
int<lower=1> lambda_prior;
// Scale Parameter
real scale_prior_mu;
real<lower=0> scale_prior_sigma;
// Integration parameters
int N_levels;
int max_N_nodes;
vector[max_N_nodes] abscissas[N_levels];
vector[max_N_nodes] weights[N_levels];
int node_indices[N_levels];
real tol;
}
transformed data {
int<lower=J> Jp1 = J + 1;
row_vector[K] zerovec_K = rep_row_vector(0, K);
row_vector[C] zerovec_C = rep_row_vector(0, C);
int max_node_indices = max(node_indices);
real log_half_pi = log(pi()) - log2();
}
parameters {
// Mixture Parameters
simplex[C] lambda;
// Item Parameters
matrix[J, K] beta[C];
// Person Parameters
ordered[C] mu_theta;
vector<lower=0>[C] sigma_theta;
}
transformed parameters {
vector[I] log_lik;
{
matrix[Jp1, K] item_probs_pos[C, N_levels, max_node_indices];
matrix[Jp1, K] item_probs_neg[C, N_levels, max_node_indices];
matrix[Jp1, K] item_probs_zero[C];
vector[max_node_indices] log_dnorm_vals_pos[C, N_levels];
vector[max_node_indices] log_dnorm_vals_neg[C, N_levels];
row_vector[C] log_dnorm_vals_zero;
int nonzero_node_indices_pos[C, N_levels, max_node_indices];
int nonzero_node_indices_neg[C, N_levels, max_node_indices];
for (c in 1:C) {
item_probs_zero[c] = append_row(zerovec_K, -beta[c]);
for (k in 1:K) {
item_probs_zero[c][, k] = log_softmax(cumulative_sum(item_probs_zero[c][, k]));
}
log_dnorm_vals_zero[c] = normal_lpdf(0 | mu_theta[c], sigma_theta[c]);
for (n in 1:N_levels) {
for (eval_index in 1:node_indices[n]) {
// print("eval_index = ", eval_index, ", dnormarg_pos = ", fabs((abscissas[n][eval_index] - mu_theta[c]) / sigma_theta[c]));
if (fabs((abscissas[n][eval_index] - mu_theta[c]) / sigma_theta[c]) < 100) {
item_probs_pos[c, n, eval_index] = append_row(zerovec_K, abscissas[n][eval_index] - beta[c]);
for (k in 1:K) {
item_probs_pos[c, n, eval_index][, k] = log_softmax(cumulative_sum(item_probs_pos[c, n, eval_index][, k]));
}
log_dnorm_vals_pos[c, n, eval_index] = normal_lpdf(abscissas[n][eval_index] | mu_theta[c], sigma_theta[c]);
nonzero_node_indices_pos[c, n, eval_index] = 1;
} else {
nonzero_node_indices_pos[c, n, eval_index] = 0;
}
}
for (eval_index in 1:node_indices[n]) {
// print("eval_index = ", eval_index, ", dnormarg_neg = ", fabs((abscissas[n][eval_index] + mu_theta[c]) / sigma_theta[c]));
if (fabs((abscissas[n][eval_index] + mu_theta[c]) / sigma_theta[c]) < 100) {
item_probs_neg[c, n, eval_index] = append_row(zerovec_K, -abscissas[n][eval_index] - beta[c]);
for (k in 1:K) {
item_probs_neg[c, n, eval_index][, k] = log_softmax(cumulative_sum(item_probs_neg[c, n, eval_index][, k]));
}
log_dnorm_vals_neg[c, n, eval_index] = normal_lpdf(-abscissas[n][eval_index] | mu_theta[c], sigma_theta[c]);
nonzero_node_indices_neg[c, n, eval_index] = 1;
} else {
nonzero_node_indices_neg[c, n, eval_index] = 0;
}
}
}
}
for (i in 1:I) {
row_vector[C] int_by_group = log_dnorm_vals_zero;
// real val;
for (c in 1:C) {
for (k in 1:K) {
int_by_group[c] += item_probs_zero[c][y[i, k], k];
}
// val = exp(int_by_group[c]);
// print(i, ",", c, ",", 0, ",", -mu_theta[c] / sigma_theta[c], ",", log_dnorm_vals_zero[c], ",", val);
int_by_group[c] = exp(int_by_group[c] + log_half_pi);
for (n in 1:2) {
row_vector[node_indices[n]] this_f_eval_pos;
row_vector[node_indices[n]] this_f_eval_neg;
for (eval_index in 1:node_indices[n]) {
if (nonzero_node_indices_pos[c, n, eval_index]) {
real this_log_f_evals_pos = log_dnorm_vals_pos[c, n, eval_index];
for (k in 1:K) {
this_log_f_evals_pos += item_probs_pos[c, n, eval_index][y[i, k], k];
}
// print(i, ",", c, ",", abscissas[n][eval_index], ",", ( abscissas[n][eval_index] - mu_theta[c]) / sigma_theta[c], ",", log_dnorm_vals_pos[c, n, eval_index], ",", exp(this_log_f_evals_pos));
this_f_eval_pos[eval_index] = exp(this_log_f_evals_pos);
} else {
this_f_eval_pos[eval_index] = 0;
}
}
for (eval_index in 1:node_indices[n]) {
if (nonzero_node_indices_neg[c, n, eval_index]) {
real this_log_f_evals_neg = log_dnorm_vals_neg[c, n, eval_index];
for (k in 1:K) {
this_log_f_evals_neg += item_probs_neg[c, n, eval_index][y[i, k], k];
}
// print(i, ",", c, ",", -abscissas[n][eval_index], ",", (-abscissas[n][eval_index] - mu_theta[c]) / sigma_theta[c], ",", log_dnorm_vals_neg[c, n, eval_index], ",", exp(this_log_f_evals_neg));
this_f_eval_neg[eval_index] = exp(this_log_f_evals_neg);
} else {
this_f_eval_neg[eval_index] = 0;
}
}
int_by_group[c] += (this_f_eval_pos + this_f_eval_neg) * weights[n][1:node_indices[n]];
}
int_by_group[c] /= 2;
for (n in 3:N_levels) {
row_vector[node_indices[n]] this_f_eval_pos;
row_vector[node_indices[n]] this_f_eval_neg;
real I0 = int_by_group[c];
int_by_group[c] /= 2;
for (eval_index in 1:node_indices[n]) {
if (nonzero_node_indices_pos[c, n, eval_index]) {
real this_log_f_evals_pos = log_dnorm_vals_pos[c, n, eval_index];
for (k in 1:K) {
this_log_f_evals_pos += item_probs_pos[c, n, eval_index][y[i, k], k];
}
// print(i, ",", c, ",", abscissas[n][eval_index], ",", ( abscissas[n][eval_index] - mu_theta[c]) / sigma_theta[c], ",", log_dnorm_vals_pos[c, n, eval_index], ",", exp(this_log_f_evals_pos));
this_f_eval_pos[eval_index] = exp(this_log_f_evals_pos);
} else {
this_f_eval_pos[eval_index] = 0;
}
}
for (eval_index in 1:node_indices[n]) {
if (nonzero_node_indices_neg[c, n, eval_index]) {
real this_log_f_evals_neg = log_dnorm_vals_neg[c, n, eval_index];
for (k in 1:K) {
this_log_f_evals_neg += item_probs_neg[c, n, eval_index][y[i, k], k];
}
// print(i, ",", c, ",", -abscissas[n][eval_index], ",", (-abscissas[n][eval_index] - mu_theta[c]) / sigma_theta[c], ",", log_dnorm_vals_neg[c, n, eval_index], ",", exp(this_log_f_evals_neg));
this_f_eval_neg[eval_index] = exp(this_log_f_evals_neg);
} else {
this_f_eval_neg[eval_index] = 0;
}
}
int_by_group[c] += (this_f_eval_pos + this_f_eval_neg) * weights[n][1:node_indices[n]] / 2^(n - 1);
if (fabs(I0 - int_by_group[c]) < int_by_group[c] * tol) {
break;
}
}
}
log_lik[i] = log(int_by_group * lambda); //sum(int_by_group); //
}
}
}
model {
// Priors
for (c in 1:C) {
real beta_sum = 0;
target += std_normal_lpdf(to_vector(beta[c]));
beta_sum += sum(beta[c]);
target += normal_lpdf(beta_sum | 0, 0.01 * J * K);
}
target += dirichlet_lpdf(lambda | rep_vector(lambda_prior, C));
target += std_normal_lpdf(mu_theta);
target += lognormal_lpdf(sigma_theta | scale_prior_mu, scale_prior_sigma);
target += sum(log_lik);
}