Reinforcement Learning- with Decay Rate

Hi all,

I am running a fairly standard reinforcement learning model with decay rate on a data set. I am fitting the participant’s choice (0 or 1- corresponding to the stim) using the softmax function which is then run as a bernoulli trial (1 if they select the second stimuli 0 if they select the first).

When I run a simple model that is identical but does not include the following decay line I don’t get any issues- but as soon as I introduce the decay rate to the expected value based on the time elapsed

> ev *= exp(-lambda_1*((times[t]-times[t-1])/(1000*3600*24)));

`current_choice[t] ~ bernoulli(exp(beta_1current[2])/(exp(beta_1current[1])+exp(beta_1*current[2])))

generates the error:

> Exception: bernoulli_lpmf: Probability parameter is nan, but must be finite!

I’m not able to figure out why a change in ev would generate this error . I am attaching the full code below.

Thank you!

data {
  int Trial;
  int outcome[Trial];
  int choice[Trial];
  int stim1[Trial];
  int stim2[Trial];
  real times[Trial];
  int<lower=0, upper=1> current_choice[Trial] ;  
 

}

transformed data {
 
  vector<lower=0, upper=3>[99] initV;  // initial value for EV
  vector[2] initVcur;
   initV = rep_vector(.5, 99);
   initVcur = rep_vector(0.5, 2);


}



parameters {
  real <lower=0, upper=1>  alpha_1; //the learning rate 
  real <lower=0, upper=10> beta_1; // the temperature 
  real lambda_1; 
}



model {
  real PE;
  vector[99] ev;
  vector[2] current;
   current=initVcur; 
   ev=initV;
 // the prior distribution 
  alpha_1~ beta(1, 9);
  beta_1  ~ gamma(5, 1);
  lambda_1 ~ normal(0,1);
 

 for (t in 1:1){


          current[1] = ev[stim1[t]];
          current[2] = ev[stim2[t]];
          current_choice[t] ~ bernoulli(exp(beta_1*current[2])/(exp(beta_1*current[1])+exp(beta_1*current[2]))); // fitting a softmax function to the actual choice 

            // prediction error
            PE = outcome[t] - ev[choice[t]];

            // value updating (learning)
            //ev[choice[t]] += ev[choice[t]] + alpha_1 * PE;
            ev[choice[t]] +=  alpha_1 * PE;



 }
 
//introduce a decay rate 
  for (t in 2:Trial){
    
            //decay rate with t in hours
            ev *= exp(-lambda_1*((times[t]-times[t-1])/(1000*3600*24))); 

             current[1] = ev[stim1[t]];
             current[2] = ev[stim2[t]];
            
            current_choice[t] ~  bernoulli(exp(beta_1*current[2])/(exp(beta_1*current[1])+exp(beta_1*current[2]))); 
            
            // prediction error 
            PE = outcome[t] - ev[choice[t]];
            
            // value updating (learning) 
            //ev[choice[t]] += ev[choice[t]] + alpha_1 * PE; 
             ev[choice[t]] +=  alpha_1 * PE; 

            
    
  }
}

The error happens because exp() overflows and your softmax becomes infinity/infinity. You could use bernoulli_logit which tries to avoid numerical overflow. This should be equivalent to your code:

    current_choice[t] ~ bernoulli_logit(beta_1*current[2] - beta_1*current[1]);

Also I think lambda_1 needs a zero lower bound. If lambda_1 is negative it’s not a decay rate, it’s a growth rate.

Hi nhurre,

Thank you for the suggestion- if I implement it the way you suggest I get the following error:

Exception: bernoulli_logit_lpmf: Logit transformed probability parameter is nan, but must not be nan! (in 'model13b1660b69e39_StanR_1withdecay' at line 100)

Any idea what that means?

Thanks again.

It’s possible that you still have overflow problems.
You can try to subtract the minimum ev from both evs before calculating the choice probabilities. For example, if current[1]=100 and current[2]=101, make temporary variables tmpcurrent[1]=0 and tmpcurrent[2]=1.

Hi Guido,

Thanks for the suggestion. In implementing your suggestion there still seems to be an issue with the initial values

Chain 4: Rejecting initial value:
Chain 4:   Error evaluating the log probability at the initial value.
Chain 4: Exception: bernoulli_logit_lpmf: Logit transformed probability parameter is nan, but must not be nan!  (in 'model13b16165fc9ee_StanR_1withdecay' at line 107)

I’ve tried setting multiple initial ev, playing around with the distribution of the decay rate and constrainining its range, but nothing has helped.

Any ideas?

Thank you again.

So you have the constraint real<lower=0> lambda_1; in the parameters block? I don’t see any obvious problems with you model. How many trials do you have? You could try fitting a subset of your data and see if that works.

The error message says that the overflow happens at line 100 but your original code has only about 80 lines. Is there more to the model?

I specifically wanted to allow for a negative decay rate (essentially consolidation) and that’s why I’ve allowed lambda to be negative as well. When I tried to constrain it to be positive I still received the same error.

Current my full code is as follows:

data {
  int Trial;
  int outcome[Trial];
  int choice[Trial];
  int nonc[Trial]; 
  int stim1[Trial];
  int stim2[Trial];
  real times[Trial];
  int<lower=0, upper=1> current_choice[Trial] ;  
  vector[99] reward;



}

transformed data {
  vector<lower=0, upper=2>[99] initV;  // initial value for EV
  vector<lower=0,upper=2>[2] initVcur;
   initV = rep_vector(.1, 99);
   initVcur = rep_vector(0.1, 2);

 
}



parameters {
  real <lower=0, upper=1>  alpha_1; //the learning rate 
  real <lower=0, upper=10> beta_1; // the temperature 
  real lambda_1; 

}



model {
  real PE;
    // initial value for EV
  vector[99] ev;
  vector[2] current;
   current=initVcur; 
   ev=initV;

 //  lets put in the priors 
  alpha_1~ beta(1, 9);
  beta_1  ~ gamma(5, 1);
  lambda_1 ~ normal(0,1);
 
 for (t in 1:1){


          current[1] = ev[stim1[t]];
          current[2] = ev[stim2[t]];
           if(current[1]>current[2]){
                current[1]-=current[2];
                current[2]=0;
             }else {
                  current[2]-=current[1];
                  current[1]=0; 
             }     

        current_choice[t] ~ bernoulli_logit(beta_1*current[2] - beta_1*current[1]);

            // prediction error
            PE = outcome[t] - ev[choice[t]];

            // value updating (learning)
            //ev[choice[t]] += ev[choice[t]] + alpha_1 * PE;
            ev[choice[t]] +=  alpha_1 * PE;



 }
 
  for (t in 2:Trial){
    
            //decay rate with t in days
            ev *= exp(-lambda_1*((times[t]-times[t-1])/(1000*3600*24))); 

            
             current[1] = ev[stim1[t]];
             current[2] = ev[stim2[t]];
             if(current[1]>current[2]){
                current[1]-=current[2];
                current[2]=0;
             }else {
                  current[2]-=current[1];
                  current[1]=0; 
             }     
            current_choice[t] ~ bernoulli_logit(beta_1*current[2] - beta_1*current[1]);

            //current_choice[t] ~  bernoulli(exp(beta_1*current[2])/(exp(beta_1*current[1])+exp(beta_1*current[2]))); // get the log probability 
            
            // prediction error 
            PE = outcome[t] - ev[choice[t]];
            
            // value updating (learning) 
            //ev[choice[t]] += ev[choice[t]] + alpha_1 * PE; 
             ev[choice[t]] +=  alpha_1 * PE; 

            
    
  }
}

The reason I suggested constraining lambda_1 is that a negative value causes exponential growth in ev and I don’t see how else current could overflow. I assume times is strictly increasing and outcome is always either 0 or 1.

Looks like the initial value problem only affected one chain. You can set init=0 argument when you call stan() from R to turn off random initialization.

Setting init=0 seems to help. Thank you!!

Setting init=0 solves the immediate problem, but has the disadvantage to make the interpretation of the Rhat diagnostic difficult.
What is the range of values for your outcome and time variables? If those variables have large values, rescaling them could also help.

Of course, the best solution would be to do a prior predictive check, to see if the priors you specified generate plausible choice data given your other data (outcomes, times).

To do a prior predictive check you can simply run a modified model in which you comment out the line with the bernoulli_logit_lpmf and instead generate choices in the generated quantities block with bernoulli_logit_rng. If your priors are good, these sampled choices should generate choice data that is broadly similar to your observed choice data.

1 Like

Hi Guido, somehow I missed your message previously.

The times variable does have a larger range so i’ve tried to address it by normalizing it.
I don’t get ant divergent chains but without init=0, tI still manage to get the error;

Chain 2: Rejecting initial value:
Chain 2:   Log probability evaluates to log(0), i.e. negative infinity.

I still cannot understand why this would happen. Though I saw that when i ran the model without any lambda priors, some chains generated lambdas in the 500 range which would bring ev very close to 0.

In terms of testing the prior- I did have a generated code section i didn’t include above where i had

sim_choice_bin[t]= bernoulli_logit_rng(beta_1*cur_expected_value[2] - beta_1*cur_expected_value[1]);

Is this what you had in mind?
the data generated when lambda was normal did seem to correspond fairly well with observed data.

Thanks again for your help!

Hello,

is there a way to track what the STAN model is doing while its running? My RL model (modified from the above version) has been running for >48 hours and is definitely just stuck somewhere since its still not even started one of its chains.

Now, I generally know that its because I feed it very wide ranges for the decay rates that its getting stuck. But I do want to be able to explore those values as well.

I am attaching my initial command
f2X3 <- stan("2mem_2decay.stan", data = dataList1, warmup = 500, iter = 10000, chains = 4, cores = 2, thin = 1, control = list(adapt_delta = 0.9))

and for each chain the values look like this

Chain 3:  Elapsed Time: 4355.4 seconds (Warm-up)
Chain 3:                25143.4 seconds (Sampling)
Chain 3:                29498.8 seconds (Total)
Chain 3: 

I am getting a rejection of initial value error

Chain 4: Rejecting initial value:
Chain 4:   Log probability evaluates to log(0), i.e. negative infinity.
Chain 4:   Stan can't start sampling from this initial value.
Chain 4: 

But i thought STAN could still run even if the initial value is rejected.

Thanks!

Stan cannot start sampling if the initial value is rejected. What Stan can do – if initial value randomization is enabled – is to generate another initial value and hope that it won’t reject. Stan gives up if can’t find a good initial value after several tries.

So does the model work if you restrict it to a narrower range? What is the range where it starts failing?

During the run you should see messages like

Chain 1: Gradient evaluation took 6.3e-05 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.63 seconds.
Chain 1: Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 1: Iteration:  200 / 2000 [ 10%]  (Warmup)

You can change how often the progress is reported by setting refresh=N argument in your stan(...) call. I don’t think there’s anything else.
If your model doesn’t print anything for 48 hours there might be some problem when moving data from R to Stan. Can you run it with algorithm='Fixed_param'? That won’t produce a useful fit but would confirm that the problem is in Stan and not R.

You could add print("here log prob is ", target()); statements to your model to see when the log probability becomes infinite. It’s going to print a lot of values so do an iter=1 run if it’s just the initialization that fails.
In any case ev=0 should not be a problem; it just means that current_choice will have bernoulli(0.5) distribution. What I think could be a problem is the exponential growth implied by lambda<0 as it could lead to infinite ev. I don’t think it makes much sense to have ev go above 10 (given your prior on beta_1) so right after the line

ev *= exp(-lambda_1*((times[t]-times[t-1])/(1000*3600*24)));

you could regularize it like this

ev = -10*expm1(-0.1*ev);

which has little effect if ev is small but brings it below 10 when the value gets too large.