Wrong gradient in updated stan -- unsure if rstan or stan problem

Charles_Driver · February 8, 2021, 9:28am

This came up when testing pre-release rstan 2.26. I flagged it in this issue: https://github.com/stan-dev/rstan/issues/899#issuecomment-774483790 but doesn’t seem to have captured interest. The gradient of the model is wrong but the log prob is fine. Does cmdstanr have any way to check this? The model and data and rstan script to compare the gradient are here:

https://github.com/stan-dev/rstan/files/5936956/gradproblem.zip

edit: on my two windows test systems the gradient is obviously wrong, not simply a little different.

andrjohns · February 8, 2021, 3:23pm

Are you able to reduce the model at all? I’ve been able to produce the correct gradients from (simpler) models with the experimental rstan:

library(rstan)
library(numDeriv)

mod = "data{
  real y_mean;
} 
parameters{
  real y;
  matrix[2,2] test_mat;
}
transformed parameters{
  real trans_y = sum(matrix_exp(test_mat));
}
model{ 
  y ~ normal(y_mean,trans_y);
}"

samp = stan(model_code=mod,data=list(y_mean=0))

g=grad_log_prob(samp,c(0,1,2,3,4))
gfunc = function(x) log_prob(samp,x)
gnum = grad(func = gfunc,x = c(0,1,2,3,4))

Which gives:

> g
[1]  0.0000000 -0.2811593 -0.5304334 -0.3845396 -0.7188407
attr(,"log_prob")
[1] -5.998649
> gnum
[1]  0.0000000 -0.2811593 -0.5304334 -0.3845396 -0.7188407

This makes me think it’s likely a specific function that’s changed from 2.21 → 2.26, rather than Stan/RStan. But since your model has so many moving parts, it’s a bit hard to see where the problem is coming from. Is there a simpler version of the model that still gives wrong gradients?

Charles_Driver · February 8, 2021, 3:31pm

yes I spent a few hours on it previously chasing this weirdness down, only to have it ‘magically’ disappear – it ‘came back’ and appears to be staying, wanted to check that others see the same thing. I assume that you see the same problem when running my script? you’re right that it’s definitely not a problem in every, or even most, cases. I don’t mind doing some more digging, though since it has this weird disappearing flavor to it, I was hoping someone might be able to check it with cmdstan too…

andrjohns · February 8, 2021, 3:39pm

Yeah I’m seeing the same behaviour with your model.

Actually, looking at differences between the grad_log_prob and numerical gradients from your model it’s only the last 10 parameters that really differ:

> round(gnum,5) - round(g,5)
 [1]  0.00000  0.00000  0.00000 -0.00040  0.00000  0.00000  0.00000  0.00001  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  4.64357  4.41527  0.17354 -0.32280  4.17967  0.90057
[21] -0.42172 -0.01298  0.75311  0.00881

Which parameter(s) in the model are those from?

Charles_Driver · February 8, 2021, 3:50pm

edit: removed incorrect conclusions re continuoustime==1 section.

the first 14 parameters are the popmeans, the latter 10 relate to the covariance matrix of subject level parameters (which are themselves integrated out)

andrjohns · February 8, 2021, 4:05pm

Huh, interesting. I won’t be able to have a proper look until Wednesday sorry, but knowing that the gradients are incorrect for rawpopsdbase and sqrtpcov only when continuoustime == 1 should help narrow things down a fair bit.

If you’re also able to do some testing and try simplifying the model, that would also be a great help

Charles_Driver · February 8, 2021, 7:05pm

Ok I was wrong, it somehow relates to filling the rawpopc array of matrices (in transformed pars) and later getting a matrix out of that. if I make rawpopc just a matrix (ie non array) it is fine. Posting progress in case it gives someone an idea of what may have changed, but I guess I’ll track it down soon enough…

Charles_Driver · February 8, 2021, 10:07pm

Definitely something going wrong in these 2 lines:

if(si==0 || (sum(whenmat[8,]) + statedep[8]) > 0 ) {
          if(intoverpop && nindvarying > 0) T0VAR[intoverpopindvaryingindex, intoverpopindvaryingindex] = rawpopc[1];

If I remove the first condition check (which is more efficient but doesn’t change the log prob in this case), or just reference a matrix rawpopc instead of the first matrix in the rawpopc array (again, not changing the log prob), there are no problems. Can’t reproduce in a MWE so far though…

rok_cesnovar · February 8, 2021, 10:16pm

Not sure I follow what works and what does not.

Can you give an example? I might have time tomorrow to take a look.

Charles_Driver · February 8, 2021, 11:47pm

ok. attached includes a grad checking script, data, and 3 models. the base model is the only one showing the gradient issue. the ‘nocheck’ model skips the first line I mentioned in the previous message, and the ‘noarray’ model changes rawpopc from an array of matrices to just a matrix. hopefully clear enough from the script.

Charles_Driver · February 10, 2021, 11:55am

commenting out

if(si==0 || (sum(whenmat[8,]) + statedep[8]) > 0 ) {

is the most reliable way to avoid the gradient issue. That line is only there for efficiency and is not that important.reverting to multiple matrices instead of one array doesn’t seem to solve it reliably, I don’t really get it. I spent too much time on this already so have given up for now, just updating in case someone gets into it :)

hhau · February 10, 2021, 12:19pm

If either si, whenmat or statedep depend on model parameters then that line might introduce a discontinuity in the gradient?

Charles_Driver · February 10, 2021, 12:20pm

they are all data or local integers…

andrjohns · February 10, 2021, 2:56pm

Thanks for finding that, it massively narrowed things down. The issue is with the cholesky_decompose function. There are two different algorithms used, depending on whether the matrix is greater or smaller than 35x35, and both appear off.

You can verify with the following (and please do, to make sure that it’s not just something off with my install):

library(rstan)
library(numDeriv)

mod = "
data{
  int mat_size;
  real y_mean;
} 
parameters{
  real y;
}
transformed parameters{
  matrix[mat_size,mat_size] test_mat = diag_matrix(rep_vector(y,mat_size));
  real trans_y = sum(cholesky_decompose(test_mat));
}
model{ 
  y ~ normal(y_mean,trans_y);
}"


samp_mat4 = stan(model_code=mod,data=list(y_mean=0,mat_size=4),chains=0)
samp_mat40 = stan(model_code=mod,data=list(y_mean=0,mat_size=40),chains=0)

g_mat4=grad_log_prob(samp_mat4,c(5))
g_mat40=grad_log_prob(samp_mat40,c(5))

gnum_mat4 = grad(func = function(x) log_prob(samp_mat4,x),x = c(5))
gnum_mat40 = grad(func = function(x) log_prob(samp_mat40,x),x = c(5))

g_mat4[1]
gnum_mat4
g_mat40[1]
gnum_mat40

Under RStan 2.21:

2.21:
> g_mat4[1]
[1] -0.13125
> gnum_mat4
[1] -0.13125
> g_mat40[1]
[1] -0.1003125
> gnum_mat40
[1] -0.1003125

Under RStan 2.26:

2.26:
> g_mat4[1]
[1] -0.0171875
> gnum_mat4
[1] -0.13125
> g_mat40[1]
[1] -0.002492188
> gnum_mat40
[1] -0.1003125

andrjohns · February 10, 2021, 3:00pm

@rok_cesnovar or @stevebronder The cholesky_decompose rev C++ is well over my head, would one of you be able to chase this down further? It looks like the gradients are different from 2.21 → 2.26

bbbales2 · February 10, 2021, 3:13pm

Big oooof ooof thanks for tracking this down @Charles_Driver and @andrjohns. Will get on it

Charles_Driver · February 10, 2021, 3:34pm

nice… though I’m still confused, because a) the cholesky function gets called nsubjects times instead of once without the conditional check (and things ‘seemed’ fine in that case), and b) the cholesky function is only generating output parameters, not relevant for log prob. Though I guess if there is code wrong in there it could be doing anything…

bbbales2 · February 10, 2021, 3:49pm

Can you check to see if this model exhibits the same problems:

data{
  int mat_size;
  real y_mean;
} 
parameters{
  real y;
}
transformed parameters{
  matrix[mat_size,mat_size] test_mat;
  for(i in 1:mat_size) {
    for(j in 1:mat_size) {
      test_mat[i, j] = 0.0;
    }
  }

  for(i in 1:mat_size) {
    test_mat[i, i] = 1.0 * y;
  }

  real trans_y = sum(cholesky_decompose(test_mat));
}
model{ 
  y ~ normal(y_mean,trans_y);
}

I made a test case in C++ and I’m seeing a different in behavior between:

Eigen::MatrixXd diag = Eigen::VectorXd::Ones(4).asDiagonal();                                                                                                                                                                                                            
auto m = stan::math::multiply(y, diag);

and

auto m = stan::math::eval(stan::math::diag_matrix(stan::math::rep_vector(y, 4)));

so I’m wondering if the new cholesky_decompose is making a bad assumption about the input varis or something.

bbbales2 · February 10, 2021, 3:53pm

Actually make building the diagonal look like:

for(i in 1:mat_size) {
  test_mat[i, i] = test_mat[i, i] + y;
}

(the point here is to make sure every input var to the cholesky_decompose has its own vari)

rok_cesnovar · February 10, 2021, 4:11pm

Thanks @Charles_Driver @andrjohns and @bbbales2 for tracking this one down!

Once we have a fix in for this, we will do a patch release (2.26.1). Rstan hasnt been released yet, so that one will have it as well.

Topic		Replies	Views
Test_grad: No Output General	3	763	March 17, 2020
How to debug "Gradient evaluated at the initial value is not finite" error Modeling ode	2	857	June 5, 2021
Functions to debug and profile stan models in R RStan	5	2936	April 6, 2018
Error with my RSTAN Code Modeling rstan	2	475	March 9, 2021
Rstan - constrain_pars fails with cholesky, log_prob and gradient are fine Interfaces	2	321	September 17, 2019

Wrong gradient in updated stan -- unsure if rstan or stan problem

Related topics