This came up when testing pre-release rstan 2.26. I flagged it in this issue: https://github.com/stan-dev/rstan/issues/899#issuecomment-774483790 but doesn’t seem to have captured interest. The gradient of the model is wrong but the log prob is fine. Does cmdstanr have any way to check this? The model and data and rstan script to compare the gradient are here:
This makes me think it’s likely a specific function that’s changed from 2.21 → 2.26, rather than Stan/RStan. But since your model has so many moving parts, it’s a bit hard to see where the problem is coming from. Is there a simpler version of the model that still gives wrong gradients?
yes I spent a few hours on it previously chasing this weirdness down, only to have it ‘magically’ disappear – it ‘came back’ and appears to be staying, wanted to check that others see the same thing. I assume that you see the same problem when running my script? you’re right that it’s definitely not a problem in every, or even most, cases. I don’t mind doing some more digging, though since it has this weird disappearing flavor to it, I was hoping someone might be able to check it with cmdstan too…
Yeah I’m seeing the same behaviour with your model.
Actually, looking at differences between the grad_log_prob and numerical gradients from your model it’s only the last 10 parameters that really differ:
edit: removed incorrect conclusions re continuoustime==1 section.
the first 14 parameters are the popmeans, the latter 10 relate to the covariance matrix of subject level parameters (which are themselves integrated out)
Huh, interesting. I won’t be able to have a proper look until Wednesday sorry, but knowing that the gradients are incorrect for rawpopsdbase and sqrtpcov only when continuoustime == 1 should help narrow things down a fair bit.
If you’re also able to do some testing and try simplifying the model, that would also be a great help
Ok I was wrong, it somehow relates to filling the rawpopc array of matrices (in transformed pars) and later getting a matrix out of that. if I make rawpopc just a matrix (ie non array) it is fine. Posting progress in case it gives someone an idea of what may have changed, but I guess I’ll track it down soon enough…
If I remove the first condition check (which is more efficient but doesn’t change the log prob in this case), or just reference a matrix rawpopc instead of the first matrix in the rawpopc array (again, not changing the log prob), there are no problems. Can’t reproduce in a MWE so far though…
ok. attached includes a grad checking script, data, and 3 models. the base model is the only one showing the gradient issue. the ‘nocheck’ model skips the first line I mentioned in the previous message, and the ‘noarray’ model changes rawpopc from an array of matrices to just a matrix. hopefully clear enough from the script.
is the most reliable way to avoid the gradient issue. That line is only there for efficiency and is not that important.reverting to multiple matrices instead of one array doesn’t seem to solve it reliably, I don’t really get it. I spent too much time on this already so have given up for now, just updating in case someone gets into it :)
Thanks for finding that, it massively narrowed things down. The issue is with the cholesky_decompose function. There are two different algorithms used, depending on whether the matrix is greater or smaller than 35x35, and both appear off.
You can verify with the following (and please do, to make sure that it’s not just something off with my install):
library(rstan)
library(numDeriv)
mod = "
data{
int mat_size;
real y_mean;
}
parameters{
real y;
}
transformed parameters{
matrix[mat_size,mat_size] test_mat = diag_matrix(rep_vector(y,mat_size));
real trans_y = sum(cholesky_decompose(test_mat));
}
model{
y ~ normal(y_mean,trans_y);
}"
samp_mat4 = stan(model_code=mod,data=list(y_mean=0,mat_size=4),chains=0)
samp_mat40 = stan(model_code=mod,data=list(y_mean=0,mat_size=40),chains=0)
g_mat4=grad_log_prob(samp_mat4,c(5))
g_mat40=grad_log_prob(samp_mat40,c(5))
gnum_mat4 = grad(func = function(x) log_prob(samp_mat4,x),x = c(5))
gnum_mat40 = grad(func = function(x) log_prob(samp_mat40,x),x = c(5))
g_mat4[1]
gnum_mat4
g_mat40[1]
gnum_mat40
@rok_cesnovar or @stevebronder The cholesky_decompose rev C++ is well over my head, would one of you be able to chase this down further? It looks like the gradients are different from 2.21 → 2.26
nice… though I’m still confused, because a) the cholesky function gets called nsubjects times instead of once without the conditional check (and things ‘seemed’ fine in that case), and b) the cholesky function is only generating output parameters, not relevant for log prob. Though I guess if there is code wrong in there it could be doing anything…