 # Reinforcement Learning- with Decay Rate

Hi all,

I am running a fairly standard reinforcement learning model with decay rate on a data set. I am fitting the participant’s choice (0 or 1- corresponding to the stim) using the softmax function which is then run as a bernoulli trial (1 if they select the second stimuli 0 if they select the first).

When I run a simple model that is identical but does not include the following decay line I don’t get any issues- but as soon as I introduce the decay rate to the expected value based on the time elapsed

`> ev *= exp(-lambda_1*((times[t]-times[t-1])/(1000*3600*24)));`

`current_choice[t] ~ bernoulli(exp(beta_1current)/(exp(beta_1current)+exp(beta_1*current)))

generates the error:

`> Exception: bernoulli_lpmf: Probability parameter is nan, but must be finite!`

I’m not able to figure out why a change in ev would generate this error . I am attaching the full code below.

Thank you!

``````data {
int Trial;
int outcome[Trial];
int choice[Trial];
int stim1[Trial];
int stim2[Trial];
real times[Trial];
int<lower=0, upper=1> current_choice[Trial] ;

}

transformed data {

vector<lower=0, upper=3> initV;  // initial value for EV
vector initVcur;
initV = rep_vector(.5, 99);
initVcur = rep_vector(0.5, 2);

}

parameters {
real <lower=0, upper=1>  alpha_1; //the learning rate
real <lower=0, upper=10> beta_1; // the temperature
real lambda_1;
}

model {
real PE;
vector ev;
vector current;
current=initVcur;
ev=initV;
// the prior distribution
alpha_1~ beta(1, 9);
beta_1  ~ gamma(5, 1);
lambda_1 ~ normal(0,1);

for (t in 1:1){

current = ev[stim1[t]];
current = ev[stim2[t]];
current_choice[t] ~ bernoulli(exp(beta_1*current)/(exp(beta_1*current)+exp(beta_1*current))); // fitting a softmax function to the actual choice

// prediction error
PE = outcome[t] - ev[choice[t]];

// value updating (learning)
//ev[choice[t]] += ev[choice[t]] + alpha_1 * PE;
ev[choice[t]] +=  alpha_1 * PE;

}

//introduce a decay rate
for (t in 2:Trial){

//decay rate with t in hours
ev *= exp(-lambda_1*((times[t]-times[t-1])/(1000*3600*24)));

current = ev[stim1[t]];
current = ev[stim2[t]];

current_choice[t] ~  bernoulli(exp(beta_1*current)/(exp(beta_1*current)+exp(beta_1*current)));

// prediction error
PE = outcome[t] - ev[choice[t]];

// value updating (learning)
//ev[choice[t]] += ev[choice[t]] + alpha_1 * PE;
ev[choice[t]] +=  alpha_1 * PE;

}
}``````

The error happens because exp() overflows and your softmax becomes infinity/infinity. You could use `bernoulli_logit` which tries to avoid numerical overflow. This should be equivalent to your code:

``````    current_choice[t] ~ bernoulli_logit(beta_1*current - beta_1*current);
``````

Also I think `lambda_1` needs a zero lower bound. If `lambda_1` is negative it’s not a decay rate, it’s a growth rate.

Hi nhurre,

Thank you for the suggestion- if I implement it the way you suggest I get the following error:

`Exception: bernoulli_logit_lpmf: Logit transformed probability parameter is nan, but must not be nan! (in 'model13b1660b69e39_StanR_1withdecay' at line 100)`

Any idea what that means?

Thanks again.

It’s possible that you still have overflow problems.
You can try to subtract the minimum ev from both evs before calculating the choice probabilities. For example, if current=100 and current=101, make temporary variables tmpcurrent=0 and tmpcurrent=1.

Hi Guido,

Thanks for the suggestion. In implementing your suggestion there still seems to be an issue with the initial values

``````Chain 4: Rejecting initial value:
Chain 4:   Error evaluating the log probability at the initial value.
Chain 4: Exception: bernoulli_logit_lpmf: Logit transformed probability parameter is nan, but must not be nan!  (in 'model13b16165fc9ee_StanR_1withdecay' at line 107)
``````

I’ve tried setting multiple initial ev, playing around with the distribution of the decay rate and constrainining its range, but nothing has helped.

Any ideas?

Thank you again.

So you have the constraint `real<lower=0> lambda_1;` in the parameters block? I don’t see any obvious problems with you model. How many trials do you have? You could try fitting a subset of your data and see if that works.

The error message says that the overflow happens at line 100 but your original code has only about 80 lines. Is there more to the model?

I specifically wanted to allow for a negative decay rate (essentially consolidation) and that’s why I’ve allowed lambda to be negative as well. When I tried to constrain it to be positive I still received the same error.

Current my full code is as follows:

``````data {
int Trial;
int outcome[Trial];
int choice[Trial];
int nonc[Trial];
int stim1[Trial];
int stim2[Trial];
real times[Trial];
int<lower=0, upper=1> current_choice[Trial] ;
vector reward;

}

transformed data {
vector<lower=0, upper=2> initV;  // initial value for EV
vector<lower=0,upper=2> initVcur;
initV = rep_vector(.1, 99);
initVcur = rep_vector(0.1, 2);

}

parameters {
real <lower=0, upper=1>  alpha_1; //the learning rate
real <lower=0, upper=10> beta_1; // the temperature
real lambda_1;

}

model {
real PE;
// initial value for EV
vector ev;
vector current;
current=initVcur;
ev=initV;

//  lets put in the priors
alpha_1~ beta(1, 9);
beta_1  ~ gamma(5, 1);
lambda_1 ~ normal(0,1);

for (t in 1:1){

current = ev[stim1[t]];
current = ev[stim2[t]];
if(current>current){
current-=current;
current=0;
}else {
current-=current;
current=0;
}

current_choice[t] ~ bernoulli_logit(beta_1*current - beta_1*current);

// prediction error
PE = outcome[t] - ev[choice[t]];

// value updating (learning)
//ev[choice[t]] += ev[choice[t]] + alpha_1 * PE;
ev[choice[t]] +=  alpha_1 * PE;

}

for (t in 2:Trial){

//decay rate with t in days
ev *= exp(-lambda_1*((times[t]-times[t-1])/(1000*3600*24)));

current = ev[stim1[t]];
current = ev[stim2[t]];
if(current>current){
current-=current;
current=0;
}else {
current-=current;
current=0;
}
current_choice[t] ~ bernoulli_logit(beta_1*current - beta_1*current);

//current_choice[t] ~  bernoulli(exp(beta_1*current)/(exp(beta_1*current)+exp(beta_1*current))); // get the log probability

// prediction error
PE = outcome[t] - ev[choice[t]];

// value updating (learning)
//ev[choice[t]] += ev[choice[t]] + alpha_1 * PE;
ev[choice[t]] +=  alpha_1 * PE;

}
}``````

The reason I suggested constraining `lambda_1` is that a negative value causes exponential growth in `ev` and I don’t see how else `current` could overflow. I assume `times` is strictly increasing and `outcome` is always either 0 or 1.

Looks like the initial value problem only affected one chain. You can set `init=0` argument when you call `stan()` from R to turn off random initialization.

Setting init=0 seems to help. Thank you!!

Setting init=0 solves the immediate problem, but has the disadvantage to make the interpretation of the Rhat diagnostic difficult.
What is the range of values for your outcome and time variables? If those variables have large values, rescaling them could also help.

Of course, the best solution would be to do a prior predictive check, to see if the priors you specified generate plausible choice data given your other data (outcomes, times).

To do a prior predictive check you can simply run a modified model in which you comment out the line with the bernoulli_logit_lpmf and instead generate choices in the generated quantities block with bernoulli_logit_rng. If your priors are good, these sampled choices should generate choice data that is broadly similar to your observed choice data.

1 Like

Hi Guido, somehow I missed your message previously.

The times variable does have a larger range so i’ve tried to address it by normalizing it.
I don’t get ant divergent chains but without init=0, tI still manage to get the error;

``````Chain 2: Rejecting initial value:
Chain 2:   Log probability evaluates to log(0), i.e. negative infinity.
``````

I still cannot understand why this would happen. Though I saw that when i ran the model without any lambda priors, some chains generated lambdas in the 500 range which would bring ev very close to 0.

In terms of testing the prior- I did have a generated code section i didn’t include above where i had

`sim_choice_bin[t]= bernoulli_logit_rng(beta_1*cur_expected_value - beta_1*cur_expected_value);`

Is this what you had in mind?
the data generated when lambda was normal did seem to correspond fairly well with observed data.

Thanks again for your help!

Hello,

is there a way to track what the STAN model is doing while its running? My RL model (modified from the above version) has been running for >48 hours and is definitely just stuck somewhere since its still not even started one of its chains.

Now, I generally know that its because I feed it very wide ranges for the decay rates that its getting stuck. But I do want to be able to explore those values as well.

I am attaching my initial command
`f2X3 <- stan("2mem_2decay.stan", data = dataList1, warmup = 500, iter = 10000, chains = 4, cores = 2, thin = 1, control = list(adapt_delta = 0.9))`

and for each chain the values look like this

``````Chain 3:  Elapsed Time: 4355.4 seconds (Warm-up)
Chain 3:                25143.4 seconds (Sampling)
Chain 3:                29498.8 seconds (Total)
Chain 3:
``````

`I am getting a rejection of initial value error`

``````Chain 4: Rejecting initial value:
Chain 4:   Log probability evaluates to log(0), i.e. negative infinity.
Chain 4:   Stan can't start sampling from this initial value.
Chain 4:
``````

But i thought STAN could still run even if the initial value is rejected.

Thanks!

Stan cannot start sampling if the initial value is rejected. What Stan can do – if initial value randomization is enabled – is to generate another initial value and hope that it won’t reject. Stan gives up if can’t find a good initial value after several tries.

So does the model work if you restrict it to a narrower range? What is the range where it starts failing?

During the run you should see messages like

``````Chain 1: Gradient evaluation took 6.3e-05 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.63 seconds.
Chain 1: Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 1: Iteration:  200 / 2000 [ 10%]  (Warmup)
``````

You can change how often the progress is reported by setting `refresh=N` argument in your `stan(...)` call. I don’t think there’s anything else.
If your model doesn’t print anything for 48 hours there might be some problem when moving data from R to Stan. Can you run it with `algorithm='Fixed_param'`? That won’t produce a useful fit but would confirm that the problem is in Stan and not R.

You could add `print("here log prob is ", target());` statements to your model to see when the log probability becomes infinite. It’s going to print a lot of values so do an `iter=1` run if it’s just the initialization that fails.
In any case `ev=0` should not be a problem; it just means that `current_choice` will have `bernoulli(0.5)` distribution. What I think could be a problem is the exponential growth implied by `lambda<0` as it could lead to infinite `ev`. I don’t think it makes much sense to have ev go above 10 (given your prior on `beta_1`) so right after the line

``````ev *= exp(-lambda_1*((times[t]-times[t-1])/(1000*3600*24)));
``````

you could regularize it like this

``````ev = -10*expm1(-0.1*ev);
``````

which has little effect if `ev` is small but brings it below 10 when the value gets too large.