Bugs from A dynamic reinforcement learning model

xhb120633 · January 6, 2020, 5:04am

I am trying a new reinforcement learning which indicates that learning rate will change with the reward that the agent received at trial t. The formula is like: a(t)=gamma*(reward(t)-sum(v))+(1-gamma)*a(t-1), where gamma is a weight factor that implies how much extenal feedback the agent will take on his learning process. The stan code is like below:

data {
int<lower=1> nSubjects;
int<lower=1> nTrials;
int<lower=1,upper=4> choice[nSubjects, nTrials];
real<lower=-1150, upper=100> reward[nSubjects, nTrials];
real<lower=-1150, upper=100> y_reward[choice[nSubjects,nTrials],nTrials];
}

transformed data {
vector[4] initV; // initial values for V
vector[4] initu;
real initlr;
initV = rep_vector(0.0, 4);
initu = rep_vector(0.0, 4);
initlr = 0.0;
}

parameters {
real<lower=0,upper=5> c[nSubjects];
real<lower=0,upper=5> loss_aversion[nSubjects];
real<lower=0,upper=1> A[nSubjects];
real<lower=0,upper=1> gamma[nSubjects];
}

model {
for (s in 1:nSubjects) {
vector[4] v;
vector[4] U;
real lr;
real pe;
real theta;
real lamda;
v = initV;
U = initu;
lr = initlr;
theta = pow(3,c[s])-1;
for (t in 1:nTrials) {
// complete the foo-loop here
if (reward[s,t]>=0) {
U[choice[s,t]] = pow(reward[s,t],A[s]);}
else {
U[choice[s,t]] = -loss_aversion[s]*pow(-reward[s,t],A[s]);}

  pe = U[choice[s,t]]-v[choice[s,t]];
  v[choice[s,t]]=v[choice[s,t]]+lr*pe;  
  lamda = U[choice[s,t]]-sum(v); 
  lr=gamma[s]*fabs(lamda)+(1-gamma[s])*lr;
  choice[s,t] ~ categorical_logit(theta*v);
}

}
}

generated quantities{
int<lower=1,upper=4> y_pred[nSubjects,nTrials];
int<lower=0> dA[nSubjects];
int<lower=0> dB[nSubjects];
int<lower=0> dC[nSubjects];
int<lower=0> dD[nSubjects];
real log_lik[nSubjects];
real<lower=-1150,upper=100> p_reward[nSubjects,nTrials];
vector[4] initev;
vector[4] initeu;
real initelr;
initev = rep_vector(0.0, 4);
initeu = rep_vector(0.0, 4);
initelr=0.0;

for (s in 1:nSubjects) {
vector[4] v;
vector[4]ev;
vector[4] U;
vector[4]eu;
real y_pe;
real pe;
real theta;
real lamda;
real y_lamda;
real lr;
real elr;
v = initV;
ev = initev;
U = initu;
eu = initeu;
lr = initlr;
elr = initelr;
dA[s] = 0;
dB[s] = 0;
dC[s] = 0;
dD[s] = 0;
log_lik[s]=0;
theta = pow(3,c[s])-1;
for (t in 1:nTrials) {
// complete the foo-loop here
y_pred[s,t] = categorical_logit_rng(theta*ev);

  if(y_pred[s,t]==1){
    dA[s]=dA[s]+1;
    p_reward[s,t]=y_reward[1,dA[s]];
  }else if(y_pred[s,t]==2){
    dB[s]=dB[s]+1;
    p_reward[s,t]=y_reward[2,dB[s]];
  }else if(y_pred[s,t]==3){
    dC[s]=dC[s]+1;
    p_reward[s,t]=y_reward[3,dC[s]];
  }else{
    dD[s]=dD[s]+1;
    p_reward[s,t]=y_reward[4,dD[s]];
  }
  
  if (reward[s,t]>=0) {
    U[choice[s,t]] = pow(reward[s,t],A[s]);}
  else {
    U[choice[s,t]] = -loss_aversion[s]*pow(-reward[s,t],A[s]);}
    pe = U[choice[s,t]]-v[choice[s,t]];
    v[choice[s,t]]=v[choice[s,t]]+lr*pe;  
    lamda = U[choice[s,t]]-sum(v); 
    lr=gamma[s]*fabs(lamda)+(1-gamma[s])*lr;
    log_lik[s]=log_lik[s]+categorical_logit_lpmf(choice[s,t]|theta*v);
   
   if(p_reward[s,t]>=0){
     eu[y_pred[s,t]]=pow(p_reward[s,t],A[s]);
   } else{
     eu[y_pred[s,t]]=-loss_aversion[s]*pow(-p_reward[s,t],A[s]);
   }
  y_pe=eu[y_pred[s,t]]-ev[y_pred[s,t]];
  ev[y_pred[s,t]]=ev[y_pred[s,t]]+elr*y_pe;
  y_lamda=eu[y_pred[s,t]]-sum(ev);
  elr=gamma[s]*fabs(y_lamda)+(1-gamma[s])*elr;
 
}

}
}

But there are some bugs:
Chain 1: Rejecting initial value:
Chain 1: Error evaluating the log probability at the initial value.
Chain 1: Exception: categorical_logit_lpmf: log odds parameter[2] is -inf, but must be finite! (in ‘modele87010bc1312_PVL_Delta_flexible_learning_Model’ at line 50)

the bug happens in:
choice[s,t]~categorical_logit(theta*v);
I don’t know what is wrong with this expression. Can someone help me to figure it out? Thanks!

stijn · January 6, 2020, 6:41am

The error message tells you that logit(theta * v) contains a -inf. So maybe, the utility of one of the four choices is very low compared to the other options and the Stan function overflows. You could use print(theta) and print(v) statements to dig into more detail about what is causing the problem.

xhb120633 · January 6, 2020, 7:02am

Thank you for your advice! But if I leave out lamda and the changing lr, the process will not bug anymore. Thus I assume the bugs come from the new things that I added, but not values or ultility themselves.

stijn · January 6, 2020, 7:10am

lamda (through lr) and lr have an impact on v. Maybe they are making v very large? I still think your best bet is to use print statements to figure out where the calculations veer off to infinity. There are a lot of intermediate calculations in the code and it is hard to keep track of them for me.

xhb120633 · January 6, 2020, 7:14am

where should I add the code print(theta) and print(v)? in Stan code or in R? For R it bugs so the calculatio n will not work out…

stijn · January 6, 2020, 7:20am

Sorry, that wasn’t clear. The print statements should be part of the stan code. You can put them just before the offending statement.

So something like

print(theta); print(v);
choice[s,t] ~ categorical_logit(theta*v);

xhb120633 · January 6, 2020, 7:47am

I did print theta and v, and it is like below:
Chain 1: Rejecting initial value:
Chain 1: Error evaluating the log probability at the initial value.
Chain 1: Exception: categorical_logit_lpmf: log odds parameter[4] is inf, but must be finite! (in ‘modele8702a9e320e_PVL_Delta_flexible_learning_Model’ at line 52)

Chain 1: 0.763081
[0,0,0,0]
0.763081
[0,5.01588,0,0]
0.763081
[4.50528,5.01588,0,0]
0.763081
[4.50528,5.01588,7.56475,0]
0.763081
[-115.887,5.01588,7.56475,0]
0.763081
[-115.887,5.01588,-198.368,0]
0.763081
[12437.2,5.01588,-198.368,0]
0.763081
[12437.2,5.01588,691631,0]
0.763081
[12437.2,-143849,691631,0]
0.763081
[12437.2,4.24751e+010,691631,0]
0.763081
[12437.2,4.24751e+010,691631,4.00019e+010]
0.763081
[12437.2,4.24751e+010,-2.14665e+016,4.00019e+010]
0.763081
[-7.30999e+019,4.24751e+010,-2.14665e+016,4.00019e+010]
0.763081
[-7.30999e+019,-8.50847e+029,-2.14665e+016,4.00019e+010]
0.763081
[-7.30999e+019,-8.50847e+029,5.00257e+045,4.00019e+010]
0.763081
[-7.30999e+019,-8.50847e+029,5.00257e+045,-5.48092e+055]
0.763081
[-7.30999e+019,-8.50847e+029,5.00257e+045,8.22787e+110]
0.763081
[-7.30999e+019,-8.50847e+029,5.00257e+045,-1.85419e+221]
0.763081
[-7.30999e+019,4.32101e+250,5.00257e+045,-1.85419e+221]
0.763081
[-7.30999e+019,4.32101e+250,5.00257e+045,inf]

Chain 1: Rejecting initial value:
Chain 1: Error evaluating the log probability at the initial value.
Chain 1: Exception: categorical_logit_lpmf: log odds parameter[4] is inf, but must be finite! (in ‘modele8702a9e320e_PVL_Delta_flexible_learning_Model’ at line 52)

Chain 1: 0.179094
[0,0,0,0]
0.179094
[0,309.129,0,0]
0.179094
[1990.62,309.129,0,0]
0.179094
[1990.62,309.129,9262.36,0]
0.179094
[-4.37212e+006,309.129,9262.36,0]
0.179094
[-4.37212e+006,309.129,-5.67346e+009,0]
0.179094
[3.47866e+015,309.129,-5.67346e+009,0]
0.179094
[3.47866e+015,309.129,2.76379e+024,0]
0.179094
[3.47866e+015,-1.01459e+026,2.76379e+024,0]
0.179094
[3.47866e+015,1.43604e+051,2.76379e+024,0]
0.179094
[3.47866e+015,1.43604e+051,2.76379e+024,5.29303e+051]
0.179094
[3.47866e+015,1.43604e+051,-3.08234e+075,5.29303e+051]
0.179094
[-1.50154e+090,1.43604e+051,-3.08234e+075,5.29303e+051]
0.179094
[-1.50154e+090,-3.01958e+140,-3.08234e+075,5.29303e+051]
0.179094
[-1.50154e+090,-3.01958e+140,1.30338e+215,5.29303e+051]
0.179094
[-1.50154e+090,-3.01958e+140,1.30338e+215,-9.66098e+265]
0.179094
[-1.50154e+090,-3.01958e+140,1.30338e+215,inf]

Chain 1: Rejecting initial value:
Chain 1: Error evaluating the log probability at the initial value.
Chain 1: Exception: categorical_logit_lpmf: log odds parameter[4] is inf, but must be finite! (in ‘modele8702a9e320e_PVL_Delta_flexible_learning_Model’ at line 52)

I guess it is probably because of the lamda and lr equation makes the value too large or too small as they change simulatenlously.

stijn · January 6, 2020, 7:57am

v[choice[s,t]]=v[choice[s,t]]+lr*pe;  
lamda = U[choice[s,t]]-sum(v); 
lr=gamma[s]*fabs(lamda)+(1-gamma[s])*lr

My guess would be that there is a mistake here somewhere. Since lr is defined recursively and gamma, (1-gamma), and fabs(lamda) > 0, once lr > 0, lr will be bigger and bigger every iteration.

Topic		Replies	Views
Stan code for fitting simple RL model Modeling cognitive-science	2	1557	December 12, 2018
Simple reinforcement learning with dynamic learning Modeling cognitive-science	4	1052	July 16, 2020
How can I simulate data for my model?(Reinforcement Learning Model) Modeling cognitive-science	7	1053	November 18, 2019
Asking for advice on making a more efficient RL code Modeling	4	515	July 17, 2021
Simple reinforcement learning model fails to initialize - Reproducible Example Modeling rstan , fitting-issues , cognitive-science	1	567	May 29, 2020

Bugs from A dynamic reinforcement learning model

Related topics