Wrong gradient in updated stan -- unsure if rstan or stan problem

This came up when testing pre-release rstan 2.26. I flagged it in this issue: https://github.com/stan-dev/rstan/issues/899#issuecomment-774483790 but doesn’t seem to have captured interest. The gradient of the model is wrong but the log prob is fine. Does cmdstanr have any way to check this? The model and data and rstan script to compare the gradient are here:

https://github.com/stan-dev/rstan/files/5936956/gradproblem.zip

edit: on my two windows test systems the gradient is obviously wrong, not simply a little different.

1 Like

Are you able to reduce the model at all? I’ve been able to produce the correct gradients from (simpler) models with the experimental rstan:

library(rstan)
library(numDeriv)

mod = "data{
  real y_mean;
} 
parameters{
  real y;
  matrix[2,2] test_mat;
}
transformed parameters{
  real trans_y = sum(matrix_exp(test_mat));
}
model{ 
  y ~ normal(y_mean,trans_y);
}"

samp = stan(model_code=mod,data=list(y_mean=0))

g=grad_log_prob(samp,c(0,1,2,3,4))
gfunc = function(x) log_prob(samp,x)
gnum = grad(func = gfunc,x = c(0,1,2,3,4))

Which gives:

> g
[1]  0.0000000 -0.2811593 -0.5304334 -0.3845396 -0.7188407
attr(,"log_prob")
[1] -5.998649
> gnum
[1]  0.0000000 -0.2811593 -0.5304334 -0.3845396 -0.7188407

This makes me think it’s likely a specific function that’s changed from 2.21 → 2.26, rather than Stan/RStan. But since your model has so many moving parts, it’s a bit hard to see where the problem is coming from. Is there a simpler version of the model that still gives wrong gradients?

2 Likes

yes I spent a few hours on it previously chasing this weirdness down, only to have it ‘magically’ disappear – it ‘came back’ and appears to be staying, wanted to check that others see the same thing. I assume that you see the same problem when running my script? you’re right that it’s definitely not a problem in every, or even most, cases. I don’t mind doing some more digging, though since it has this weird disappearing flavor to it, I was hoping someone might be able to check it with cmdstan too…

Yeah I’m seeing the same behaviour with your model.

Actually, looking at differences between the grad_log_prob and numerical gradients from your model it’s only the last 10 parameters that really differ:

> round(gnum,5) - round(g,5)
 [1]  0.00000  0.00000  0.00000 -0.00040  0.00000  0.00000  0.00000  0.00001  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  4.64357  4.41527  0.17354 -0.32280  4.17967  0.90057
[21] -0.42172 -0.01298  0.75311  0.00881

Which parameter(s) in the model are those from?

1 Like

edit: removed incorrect conclusions re continuoustime==1 section.

the first 14 parameters are the popmeans, the latter 10 relate to the covariance matrix of subject level parameters (which are themselves integrated out)

Huh, interesting. I won’t be able to have a proper look until Wednesday sorry, but knowing that the gradients are incorrect for rawpopsdbase and sqrtpcov only when continuoustime == 1 should help narrow things down a fair bit.

If you’re also able to do some testing and try simplifying the model, that would also be a great help

Ok I was wrong, it somehow relates to filling the rawpopc array of matrices (in transformed pars) and later getting a matrix out of that. if I make rawpopc just a matrix (ie non array) it is fine. Posting progress in case it gives someone an idea of what may have changed, but I guess I’ll track it down soon enough…

Definitely something going wrong in these 2 lines:

if(si==0 || (sum(whenmat[8,]) + statedep[8]) > 0 ) {
          if(intoverpop && nindvarying > 0) T0VAR[intoverpopindvaryingindex, intoverpopindvaryingindex] = rawpopc[1];

If I remove the first condition check (which is more efficient but doesn’t change the log prob in this case), or just reference a matrix rawpopc instead of the first matrix in the rawpopc array (again, not changing the log prob), there are no problems. Can’t reproduce in a MWE so far though…

Not sure I follow what works and what does not.

Can you give an example? I might have time tomorrow to take a look.

ok. attached includes a grad checking script, data, and 3 models. the base model is the only one showing the gradient issue. the ‘nocheck’ model skips the first line I mentioned in the previous message, and the ‘noarray’ model changes rawpopc from an array of matrices to just a matrix. hopefully clear enough from the script.

2 Likes

commenting out

if(si==0 || (sum(whenmat[8,]) + statedep[8]) > 0 ) {

is the most reliable way to avoid the gradient issue. That line is only there for efficiency and is not that important.reverting to multiple matrices instead of one array doesn’t seem to solve it reliably, I don’t really get it. I spent too much time on this already so have given up for now, just updating in case someone gets into it :)

1 Like

If either si, whenmat or statedep depend on model parameters then that line might introduce a discontinuity in the gradient?

they are all data or local integers…

1 Like

Thanks for finding that, it massively narrowed things down. The issue is with the cholesky_decompose function. There are two different algorithms used, depending on whether the matrix is greater or smaller than 35x35, and both appear off.

You can verify with the following (and please do, to make sure that it’s not just something off with my install):

library(rstan)
library(numDeriv)

mod = "
data{
  int mat_size;
  real y_mean;
} 
parameters{
  real y;
}
transformed parameters{
  matrix[mat_size,mat_size] test_mat = diag_matrix(rep_vector(y,mat_size));
  real trans_y = sum(cholesky_decompose(test_mat));
}
model{ 
  y ~ normal(y_mean,trans_y);
}"


samp_mat4 = stan(model_code=mod,data=list(y_mean=0,mat_size=4),chains=0)
samp_mat40 = stan(model_code=mod,data=list(y_mean=0,mat_size=40),chains=0)

g_mat4=grad_log_prob(samp_mat4,c(5))
g_mat40=grad_log_prob(samp_mat40,c(5))

gnum_mat4 = grad(func = function(x) log_prob(samp_mat4,x),x = c(5))
gnum_mat40 = grad(func = function(x) log_prob(samp_mat40,x),x = c(5))

g_mat4[1]
gnum_mat4
g_mat40[1]
gnum_mat40

Under RStan 2.21:

2.21:
> g_mat4[1]
[1] -0.13125
> gnum_mat4
[1] -0.13125
> g_mat40[1]
[1] -0.1003125
> gnum_mat40
[1] -0.1003125

Under RStan 2.26:

2.26:
> g_mat4[1]
[1] -0.0171875
> gnum_mat4
[1] -0.13125
> g_mat40[1]
[1] -0.002492188
> gnum_mat40
[1] -0.1003125
4 Likes

@rok_cesnovar or @stevebronder The cholesky_decompose rev C++ is well over my head, would one of you be able to chase this down further? It looks like the gradients are different from 2.21 → 2.26

3 Likes

Big oooof ooof thanks for tracking this down @Charles_Driver and @andrjohns. Will get on it

2 Likes

nice… though I’m still confused, because a) the cholesky function gets called nsubjects times instead of once without the conditional check (and things ‘seemed’ fine in that case), and b) the cholesky function is only generating output parameters, not relevant for log prob. Though I guess if there is code wrong in there it could be doing anything…

Can you check to see if this model exhibits the same problems:

data{
  int mat_size;
  real y_mean;
} 
parameters{
  real y;
}
transformed parameters{
  matrix[mat_size,mat_size] test_mat;
  for(i in 1:mat_size) {
    for(j in 1:mat_size) {
      test_mat[i, j] = 0.0;
    }
  }

  for(i in 1:mat_size) {
    test_mat[i, i] = 1.0 * y;
  }

  real trans_y = sum(cholesky_decompose(test_mat));
}
model{ 
  y ~ normal(y_mean,trans_y);
}

I made a test case in C++ and I’m seeing a different in behavior between:

Eigen::MatrixXd diag = Eigen::VectorXd::Ones(4).asDiagonal();                                                                                                                                                                                                            
auto m = stan::math::multiply(y, diag);                                                                        

and

auto m = stan::math::eval(stan::math::diag_matrix(stan::math::rep_vector(y, 4)));

so I’m wondering if the new cholesky_decompose is making a bad assumption about the input varis or something.

Actually make building the diagonal look like:

for(i in 1:mat_size) {
  test_mat[i, i] = test_mat[i, i] + y;
}

(the point here is to make sure every input var to the cholesky_decompose has its own vari)

Thanks @Charles_Driver @andrjohns and @bbbales2 for tracking this one down!

Once we have a fix in for this, we will do a patch release (2.26.1). Rstan hasnt been released yet, so that one will have it as well.

1 Like