# Bugs from A dynamic reinforcement learning model

I am trying a new reinforcement learning which indicates that learning rate will change with the reward that the agent received at trial t. The formula is like: a(t)=gamma*(reward(t)-sum(v))+(1-gamma)*a(t-1), where gamma is a weight factor that implies how much extenal feedback the agent will take on his learning process. The stan code is like below:

data {
int<lower=1> nSubjects;
int<lower=1> nTrials;
int<lower=1,upper=4> choice[nSubjects, nTrials];
real<lower=-1150, upper=100> reward[nSubjects, nTrials];
real<lower=-1150, upper=100> y_reward[choice[nSubjects,nTrials],nTrials];
}

transformed data {
vector[4] initV; // initial values for V
vector[4] initu;
real initlr;
initV = rep_vector(0.0, 4);
initu = rep_vector(0.0, 4);
initlr = 0.0;
}

parameters {
real<lower=0,upper=5> c[nSubjects];
real<lower=0,upper=5> loss_aversion[nSubjects];
real<lower=0,upper=1> A[nSubjects];
real<lower=0,upper=1> gamma[nSubjects];
}

model {
for (s in 1:nSubjects) {
vector[4] v;
vector[4] U;
real lr;
real pe;
real theta;
real lamda;
v = initV;
U = initu;
lr = initlr;
theta = pow(3,c[s])-1;
for (t in 1:nTrials) {
// complete the foo-loop here
if (reward[s,t]>=0) {
U[choice[s,t]] = pow(reward[s,t],A[s]);}
else {
U[choice[s,t]] = -loss_aversion[s]*pow(-reward[s,t],A[s]);}

``````  pe = U[choice[s,t]]-v[choice[s,t]];
v[choice[s,t]]=v[choice[s,t]]+lr*pe;
lamda = U[choice[s,t]]-sum(v);
lr=gamma[s]*fabs(lamda)+(1-gamma[s])*lr;
choice[s,t] ~ categorical_logit(theta*v);
}
``````

}
}

generated quantities{
int<lower=1,upper=4> y_pred[nSubjects,nTrials];
int<lower=0> dA[nSubjects];
int<lower=0> dB[nSubjects];
int<lower=0> dC[nSubjects];
int<lower=0> dD[nSubjects];
real log_lik[nSubjects];
real<lower=-1150,upper=100> p_reward[nSubjects,nTrials];
vector[4] initev;
vector[4] initeu;
real initelr;
initev = rep_vector(0.0, 4);
initeu = rep_vector(0.0, 4);
initelr=0.0;

for (s in 1:nSubjects) {
vector[4] v;
vector[4]ev;
vector[4] U;
vector[4]eu;
real y_pe;
real pe;
real theta;
real lamda;
real y_lamda;
real lr;
real elr;
v = initV;
ev = initev;
U = initu;
eu = initeu;
lr = initlr;
elr = initelr;
dA[s] = 0;
dB[s] = 0;
dC[s] = 0;
dD[s] = 0;
log_lik[s]=0;
theta = pow(3,c[s])-1;
for (t in 1:nTrials) {
// complete the foo-loop here
y_pred[s,t] = categorical_logit_rng(theta*ev);

``````  if(y_pred[s,t]==1){
dA[s]=dA[s]+1;
p_reward[s,t]=y_reward[1,dA[s]];
}else if(y_pred[s,t]==2){
dB[s]=dB[s]+1;
p_reward[s,t]=y_reward[2,dB[s]];
}else if(y_pred[s,t]==3){
dC[s]=dC[s]+1;
p_reward[s,t]=y_reward[3,dC[s]];
}else{
dD[s]=dD[s]+1;
p_reward[s,t]=y_reward[4,dD[s]];
}

if (reward[s,t]>=0) {
U[choice[s,t]] = pow(reward[s,t],A[s]);}
else {
U[choice[s,t]] = -loss_aversion[s]*pow(-reward[s,t],A[s]);}
pe = U[choice[s,t]]-v[choice[s,t]];
v[choice[s,t]]=v[choice[s,t]]+lr*pe;
lamda = U[choice[s,t]]-sum(v);
lr=gamma[s]*fabs(lamda)+(1-gamma[s])*lr;
log_lik[s]=log_lik[s]+categorical_logit_lpmf(choice[s,t]|theta*v);

if(p_reward[s,t]>=0){
eu[y_pred[s,t]]=pow(p_reward[s,t],A[s]);
} else{
eu[y_pred[s,t]]=-loss_aversion[s]*pow(-p_reward[s,t],A[s]);
}
y_pe=eu[y_pred[s,t]]-ev[y_pred[s,t]];
ev[y_pred[s,t]]=ev[y_pred[s,t]]+elr*y_pe;
y_lamda=eu[y_pred[s,t]]-sum(ev);
elr=gamma[s]*fabs(y_lamda)+(1-gamma[s])*elr;

}
``````

}
}

But there are some bugs:
Chain 1: Rejecting initial value:
Chain 1: Error evaluating the log probability at the initial value.
Chain 1: Exception: categorical_logit_lpmf: log odds parameter[2] is -inf, but must be finite! (in ‘modele87010bc1312_PVL_Delta_flexible_learning_Model’ at line 50)

the bug happens in:
choice[s,t]~categorical_logit(theta*v);
I don’t know what is wrong with this expression. Can someone help me to figure it out? Thanks!

The error message tells you that `logit(theta * v)` contains a `-inf`. So maybe, the utility of one of the four choices is very low compared to the other options and the Stan function overflows. You could use `print(theta)` and `print(v)` statements to dig into more detail about what is causing the problem.

Thank you for your advice! But if I leave out lamda and the changing lr, the process will not bug anymore. Thus I assume the bugs come from the new things that I added, but not values or ultility themselves.

lamda (through lr) and lr have an impact on v. Maybe they are making v very large? I still think your best bet is to use print statements to figure out where the calculations veer off to infinity. There are a lot of intermediate calculations in the code and it is hard to keep track of them for me.

where should I add the code print(theta) and print(v)? in Stan code or in R? For R it bugs so the calculatio n will not work out…

Sorry, that wasn’t clear. The print statements should be part of the stan code. You can put them just before the offending statement.

So something like

``````print(theta); print(v);
choice[s,t] ~ categorical_logit(theta*v);
``````

I did print theta and v, and it is like below:
Chain 1: Rejecting initial value:
Chain 1: Error evaluating the log probability at the initial value.
Chain 1: Exception: categorical_logit_lpmf: log odds parameter[4] is inf, but must be finite! (in ‘modele8702a9e320e_PVL_Delta_flexible_learning_Model’ at line 52)

Chain 1: 0.763081
[0,0,0,0]
0.763081
[0,5.01588,0,0]
0.763081
[4.50528,5.01588,0,0]
0.763081
[4.50528,5.01588,7.56475,0]
0.763081
[-115.887,5.01588,7.56475,0]
0.763081
[-115.887,5.01588,-198.368,0]
0.763081
[12437.2,5.01588,-198.368,0]
0.763081
[12437.2,5.01588,691631,0]
0.763081
[12437.2,-143849,691631,0]
0.763081
[12437.2,4.24751e+010,691631,0]
0.763081
[12437.2,4.24751e+010,691631,4.00019e+010]
0.763081
[12437.2,4.24751e+010,-2.14665e+016,4.00019e+010]
0.763081
[-7.30999e+019,4.24751e+010,-2.14665e+016,4.00019e+010]
0.763081
[-7.30999e+019,-8.50847e+029,-2.14665e+016,4.00019e+010]
0.763081
[-7.30999e+019,-8.50847e+029,5.00257e+045,4.00019e+010]
0.763081
[-7.30999e+019,-8.50847e+029,5.00257e+045,-5.48092e+055]
0.763081
[-7.30999e+019,-8.50847e+029,5.00257e+045,8.22787e+110]
0.763081
[-7.30999e+019,-8.50847e+029,5.00257e+045,-1.85419e+221]
0.763081
[-7.30999e+019,4.32101e+250,5.00257e+045,-1.85419e+221]
0.763081
[-7.30999e+019,4.32101e+250,5.00257e+045,inf]

Chain 1: Rejecting initial value:
Chain 1: Error evaluating the log probability at the initial value.
Chain 1: Exception: categorical_logit_lpmf: log odds parameter[4] is inf, but must be finite! (in ‘modele8702a9e320e_PVL_Delta_flexible_learning_Model’ at line 52)

Chain 1: 0.179094
[0,0,0,0]
0.179094
[0,309.129,0,0]
0.179094
[1990.62,309.129,0,0]
0.179094
[1990.62,309.129,9262.36,0]
0.179094
[-4.37212e+006,309.129,9262.36,0]
0.179094
[-4.37212e+006,309.129,-5.67346e+009,0]
0.179094
[3.47866e+015,309.129,-5.67346e+009,0]
0.179094
[3.47866e+015,309.129,2.76379e+024,0]
0.179094
[3.47866e+015,-1.01459e+026,2.76379e+024,0]
0.179094
[3.47866e+015,1.43604e+051,2.76379e+024,0]
0.179094
[3.47866e+015,1.43604e+051,2.76379e+024,5.29303e+051]
0.179094
[3.47866e+015,1.43604e+051,-3.08234e+075,5.29303e+051]
0.179094
[-1.50154e+090,1.43604e+051,-3.08234e+075,5.29303e+051]
0.179094
[-1.50154e+090,-3.01958e+140,-3.08234e+075,5.29303e+051]
0.179094
[-1.50154e+090,-3.01958e+140,1.30338e+215,5.29303e+051]
0.179094
[-1.50154e+090,-3.01958e+140,1.30338e+215,-9.66098e+265]
0.179094
[-1.50154e+090,-3.01958e+140,1.30338e+215,inf]

Chain 1: Rejecting initial value:
Chain 1: Error evaluating the log probability at the initial value.
Chain 1: Exception: categorical_logit_lpmf: log odds parameter[4] is inf, but must be finite! (in ‘modele8702a9e320e_PVL_Delta_flexible_learning_Model’ at line 52)

I guess it is probably because of the lamda and lr equation makes the value too large or too small as they change simulatenlously.

``````v[choice[s,t]]=v[choice[s,t]]+lr*pe;
lamda = U[choice[s,t]]-sum(v);
lr=gamma[s]*fabs(lamda)+(1-gamma[s])*lr
``````

My guess would be that there is a mistake here somewhere. Since lr is defined recursively and gamma, (1-gamma), and fabs(lamda) > 0, once lr > 0, lr will be bigger and bigger every iteration.