Gradient evaluated at the initial value is not finite although the lp is finite

Hi. I’m trying to fit a model implementing a cognitive Bayesian hypothesis testing model for a learning task. The cognitive learning model is written in a function (with a few additional auxilary functions), and after checking with (exporting the function) it works well. The main output of the model is called pRight and it is a vector specifying the probability that the participant will choose the ‘right’ option out of two (left/right).
However, when trying to fit the model in STAN to simulated data, it constantly results in the error “Gradient evaluated at the initial value is not finite”. What is even more strange about it, is that when I add the statement print(target()) to the model it gives a finite number, suggesting that the lp is ok. Moreover, when I try to run the same model but with a subset of the data, including only 2 trials (out of 88) - the model runs fine. No matter which 2 trials I randomlly sample. 3 trials already create an error! Does this implies some sort of an float-number overflow problem?

I attach the stan model, as well as an Rdata file with a datalist to be used in the model (with an .R extension, but it can be loaded with the load function in R. sorry for that)

any help would be much appreciated.

functions{
 row_vector na_to0r(row_vector X){
 row_vector [cols(X)] x1;
 for (i in 1:cols(X)) x1[i]=is_nan(X[i])?0:X[i];
 return(x1);
}

vector na_to0c(vector X){
  vector [rows(X)] x1;
  for (i in 1:rows(X)) x1[i]=is_nan(X[i])?0:X[i];
  return(x1);
}

vector col_sums(matrix X) {
  vector[cols(X)] s ;		
  for (j in 1:cols(X)) s[j] = sum(na_to0c(col(X, j))) ;
  return s ;
}

vector row_sums(matrix X) {
  vector[rows(X)] s ;	
  for (i in 1:rows(X)) s[i] = sum(na_to0r(row(X, i))) ;
  return s ;
}

row_vector normalize(row_vector X){
  row_vector[cols(X)] norm_x;
  real sm;
  sm=sum(X);
  if (sm>0){
    norm_x=X/sm;
  }else{
    norm_x=X;
  } 
  return(norm_x);
}


vector SAmodel(matrix cues, vector target, vector response, int nTrials, real gamma, real tau, real w, real lambda0){

  int nL=nTrials;
  matrix[nTrials,92] RES;
  matrix[nTrials,3] ll;
  matrix[nL,3] xCondL[nTrials];
  matrix[nTrials+1,3] x;
  matrix[nTrials,nL] lt;

  matrix[nTrials,nL]A;
  matrix[nL,nL]B;
  vector[nTrials] pRight;

  matrix[nTrials,3] pRightCond;
  matrix[nTrials,3] pRespCond;

  matrix[nL,3] pIdent[nTrials];
  matrix[nL,3] pDiff[nTrials];
  matrix[nL,3] lambda[nTrials];
  matrix[nL,3] LR[nTrials];
  matrix[nL,3] g[nTrials];

  vector[3] x0;
  vector[nL]lt0;
  vector[3] lambda0a;


  for (t in 1:nTrials){
    for (c in 1:3){
       ll[t,c]=(cues[t,c]==target[t])?gamma:(1-gamma);
       pRespCond[t,c]=(cues[t,c]==response[t])?(w+0.5*(1-w)):(0.5*(1-w));
       pRightCond[t,c]=(cues[t,c]==1)?(w+0.5*(1-w)):(0.5*(1-w));
     }
  }


  x0=rep_vector(1,3);
  x0=x0/3;
  lambda0a=rep_vector(lambda0,3);
  lt0=rep_vector(0,nL);
  lt0[1]=1;



///////for t=1//////

//response model
  pRight[1]=dot_product(pRightCond[1,1:3],x0[1:3]');

//Experimenter's model - after response and before feedback
  xCondL[1,1,]=pRespCond[1,1:3].*x0[1:3]'; 
  xCondL[1,2:nL,1:3]=rep_matrix(0,nL-1,3);
  A[1,]=row_sums(xCondL[1,,])'; 
  for (l in 1:nL) xCondL[1,l,]=normalize(xCondL[1,l,]);

  lt[1,]=A[1,].*lt0'; 
  lt[1,]=normalize(lt[1,]);
 
  x[2,1:3]=col_sums(xCondL[1,,].*rep_matrix(lt[1,]',3))';
  x[2,]=normalize(x[2,]);


//Agent's learning model after feedback at t=1
  pIdent[1,1,1:3]=ll[1,1:3].*lambda0a'; 
  pDiff[1,1,1:3]=0.5*(1-lambda0a)'; 
  lambda[1,1,1:3]=pIdent[1,1,1:3]./(pIdent[1,1,1:3]+pDiff[1,1,1:3]); 
  LR[1,1,1:3]=log(lambda[1,1,1:3]./(1-lambda[1,1,1:3]));
  g[1,1,1:3]=1.0./(1.0+exp(20*(LR[1,1,1:3]-0)));


  for (t in 2:nTrials){
//Response model
     pRight[t]=dot_product(pRightCond[t,1:3],x[t,]); 


//Experimenter's model (after response is made but before feedback is given)
     xCondL[t,1,]=pRespCond[t,1:3].*x0[1:3]'; 
     for (l in 2:nL)xCondL[t,l,1:3]=pRespCond[t,1:3].*xCondL[t-1,l-1,1:3]; 

     A[t,]=row_sums(xCondL[t,,])'; 
     for (l in 1:nL) xCondL[t,l,]=normalize(xCondL[t,l,]);



     B=rep_matrix(0,nL,nL);
     B[,1]=row_sums(xCondL[t-1,,].*(g[t-1,,]));
     for (i in 1:nL){
       for (j in 1:nL){
          if(j==(i+1)){
               B[i,j]=sum(xCondL[t-1,i,1:3].*(1-g[t-1,i,1:3]));
          }
       }
    }


   lt[t,]=A[t,].*col_sums(B .* rep_matrix(lt[t-1,]',nL))'; 
   lt[t,]=normalize(lt[t,]);

   x[t+1,1:3]=col_sums(xCondL[t,,].*rep_matrix(lt[t,]',3))';
   x[t+1,]=normalize(x[t+1,]);

//Agent's learning model (after feedback is given)
    pIdent[t,1,1:3]=ll[t,1:3].*lambda0a';
    pDiff[t,1,1:3]=0.5*(1-lambda0a)';

    for (l in 2:nL){
       pIdent[t,l,1:3]=ll[t,1:3].*(lambda[t-1,l-1,1:3]*tau+((1-lambda[t-1,l-1,1:3])*(1-tau)/2)); 
       pDiff[t,l,1:3]=0.5*(lambda[t-1,l-1,1:3]*(1-tau)+((1-lambda[t-1,l-1,1:3])*tau)); 
    }

    lambda[t,,1:3]=pIdent[t,,1:3]./(pIdent[t,,1:3]+pDiff[t,,1:3]);
    LR[t,,1:3]=log(lambda[t,,1:3]./(1-lambda[t,,1:3]));
    g[t,,1:3]=1.0./(1.0+exp(20*(LR[t,,1:3]-0)));
  }
return(pRight);
 }
}

data{
  int nTrials;
  matrix [nTrials,3] cues;
  vector [nTrials] targets;
  int response [nTrials];
  vector [nTrials] response1;
}

parameters{
  real<lower=0.5, upper=1> tau;
  real<lower=0.5, upper=1> gamma;
  real<lower=0, upper=1> w;

}

model{
  vector [nTrials] pRight;  
  pRight=SAmodel(cues,targets,response1,nTrials,gamma,tau,w,0.5);//[,1];
  response~bernoulli(pRight);
}

"


<a class="attachment" href="//discourse-cloud-file-uploads.s3.dualstack.us-west-2.amazonaws.com/standard14/uploads/mc_stan/original/2X/b/b657c4f63818a99b9e36ddc7740361a2252f8144.R">datalist.R</a> (367 Bytes)

Could be, hard to say in a model this involved. But! If you’re scared of something like this, probly the place to start is swapping in a bernoulli_logit in place of your bernoulli and make SAmodel work on the log scale. It sure doesn’t look easy to make SAmodel work on the log scale though.

Thanks for the reply. I could try but it will be quite hard, unless I just convert the very final part determining pRight according to other variables to a log scale.
I guess my first question is how to understand such an error where the derivative is inifnite when the lp itself looks finite? What are the most common causes for this error?
Another direction I was thinking in is that because of some features of the model, changing the parameters does not lead to marked enough changes in the liklelihood (e.g., changing one of the parameters by 0.01 units - where all parameters are bounded from 0 to 1 at-most changes most of the pRight values by no more than 0.0001 units) - could this be the reason for this issue? If this is the case - is there an option of increasing the step-size of the autodiff?
Also, is there any way of actually tracking the process STAN is doing when finding the derivative, to understand exactly what goes wrong?

Thanks,
Isaac

It’s nigh impossible to debug the reverse pass unfortunately.

But this just feels like it has to be an underflow issue somewhere. Working with probabilities on the 0-1 scale just blows up.

But looking at the code I agree it looks like a daunting task to convert stuff over, so it’d be nice to be more certain of a diagnosis before engaging in such an endeavor.

  1. Is there a simpler model that shows the same behavior?

  2. This function looks very complicated. How convinced are you that it is correct? Have you exposed it to R (check expose_stan_function https://cran.r-project.org/web/packages/rstan/rstan.pdf) and computed values to make sure things are sane?

  3. You can also get gradients at certain parameter points in R as well. I dunno if it’d be possible to figure out if the gradients work anywhere?

1 Like

Hi Ben.

I have done sanity checks for the function, and it seems to work. Moreover, it always produces finite log-likelihood and this ll does seem to change when I change one of the parameters even slightly.
I guess that what is hard for me to understand is how can the code produce a finite, normal lp (using print(target()) for example), where the gradient is infinite. How would you try to get gradients in R (with the output of the exposed function for example)?

Best,
I

Check grad_log_prob in the Rstan manual for getting gradients of your model in R. You can verify there that you’re getting finite targets and infinite gradients. Maybe only some of the gradients are infinite and that gives you a clue?

Yeah, I don’t understand the finite target() but grads blowing up either.

I’d say your best bet for figuring stuff out is just removing sections of the model bit by bit till the gradients start working. Or preferably build things up from smaller pieces. It probably won’t be a pleasant experience either way though.

1 Like

Thanks for the advice!

I actually managed to find the (very surprising) problem. In the custom col_sums and row_sums functions I’ve used, I’ve implemented something similar to na.rm=T in R by removing NaN items. It worked when exposing the function to R and using it but apparently it created the problem with the gradients.
So sampling works now, however I have a new problem which is more a question of optimization. When using parallel sampling (with 3/8 cores on my computer), I get the following error:

SAMPLING FOR MODEL ‘ee52d32c00d52af2033b4376d8f905c0’ NOW (CHAIN 3).
Exception: Exception: Exception: std::bad_alloc

From reading a bit I understand that this is related to RAM issues. So I’m now trying to optimize the model to make it use as little memory as possible yet run as fast as possible. I would be really grateful in case you have any advice here.

Thanks a lot!
Isaac.

1 Like

Good find!

How big is nTrials?

nTrials is only 88. I’m guessing that all of the matrices are taking a lot of memory (especially in a hierarchical model with 60 subjects). I’m was trying to convert all matrix arrays to simple matrices (without t) and just overwrite on each trial but that didn’t help so much…

Isaac.

bbbales2

    November 24

Isaac_Fradkin:
I actually managed to find the (very surprising) problem

Good find!

Isaac_Fradkin:
So I’m now trying to optimize the model to make it use as little memory as possible yet run as fast as possible.

How big is nTrials?

Hi, Isaac_Fradkin
With your inspiration,i also want to remove NAN items. But in functions block, for example
vector col_sums(matrix X) {
vector[cols(X)] s ;
for (j in 1:cols(X)) s[j] = sum(na_to0c(col(X, j))) ;
return s ;
}
add the statement na.rm=TRUE,like that
vector col_sums(matrix X) {
vector[cols(X)] s ;
for (j in 1:cols(X)) s[j] = sum(na_to0c(col(X, j)), na.rm=TRUE) ;
return s ;
}
It may result in the error “variable ‘na.rm’ does not exist”.Could you tell me how to “implemented something similar to na.rm=T in R by removing NaN items”.
Thanks a lot!

There is an is_nan function in the Stan language which you could probably use to do what you want (https://mc-stan.org/docs/2_22/functions-reference/logical-functions.html).

Go ahead and feel free to start a new topic if that doesn’t answer your question though – people are more likely to see the new stuff!

Thanks for the advice! is_nan means “Return 1 if x is NaN and 0 otherwise”, which can cause discontinuities in gradients when applied to parameters. I want to removie NaN items(something similar to na.rm=T in R), not to return int. I would be really grateful in case you have any advice here.
Thanks a lot!

With is.nan you can skip NaNs. Though it isn’t removing them, it can work out to be the same. Like this:

real total = 0.0;
for(i in 1:N) {
  if(!is_nan(vec[i])) {
    total += vec[i];
  }
}