Hi all,

I’m trying to fit parameters to a 2-armed bandit task model which implements q-learning along with an additional step which attempts to add some noise to the model. (only to the beginning of the model when shapes are being updated for the first time)

In this model:

- A pair of shapes (from a set of 12) are displayed to a subject
- The subject chooses the most valuable shape (either the shape on the left or right) through a softmax selection rule
- If the probability for a subject to learn the value of a specific shape is greater than some random value (noise), the selected shape’s value will be updated through q-learning from the current trial on wards.
- If the probability is lower than random noise, no q-learning happens for the current trial

Why I think the discrete rule used to switch between steps 3 and 4 might be causing issues:

- I have not run into convergence issues with a previous version of this model that implements q-learning every trial and doesn’t use a discrete rule to switch between whether or not to run q-learning.
- When I check the negative log likelihood of the posterior with print(target()) within the
*if block*(implementing q-learning) and*else if block*(resetting to initial shape value), lp__ is about -400 and -1000 respectively.

I’ve seen some discussions about marginalizing out discrete parameters but I’m not sure if that’s something I need to be looking at or if there are other things I can do to improve the model. I would appreciate any guidance on how to move forwards!

Here’s my stan code below:

```
data {
int T; // number of trials
int N; // number of subjects
int sub_selectedShape[T, N]; // 0: shape displayed on left selected, 1: shape
// displayed on right selected, 2: invalid response
real reward[T, N]; // displayed reward (from 1 to 12) for shape selected
// by subject per trial
int shapes[T, 2]; // values of displayed shape pair, from 1 to 12
real Q0; // initial value of all 12 shapes (value between 1 to 12)
}
transformed data {
real prand[T,N]; // random values between 0 and 1 for each subject and
// each trial (meant to add noise)
for (n in 1:N) {
for (t in 1:T) {
prand[t,n] = uniform_rng(0,1);
}
}
}
parameters {
// population parameters
real alpha_mu;
real <lower=0.001> alpha_sd;
real beta_mu;
real <lower = 0.001> beta_sd;
real pEncode_mu;
real <lower=0.001> pEncode_sd;
// subject parameters
real alpha_sub[N];
real beta_sub[N];
real pEncode_sub[N];
}
model {
alpha_mu ~ normal(0, 1);
alpha_sd ~ normal(0, 1);
beta_mu ~ normal(0, 1);
beta_sd ~ normal(0, 1);
pEncode_mu ~ normal(0, 1);
pEncode_sd ~ normal(0, 1);
for (n in 1:N) {
real alpha; // learning rate (between 0 and 1)
real beta; // softmax temperature (between 0 and 3)
real pEncode; // probability for updating the value of each shape for the first
// time
real pflag[12]; // flag to indicate whether or not each shape's value has
// already been updated
real q[12]; // current values of each shape in the current trial
int shape_chosen; // indicates whether shape on left or right was selected by
// the subject (left: 0, right: 1)
alpha_sub[n] ~ normal(alpha_mu, alpha_sd);
alpha = Phi_approx(alpha_sub[n]);
beta_sub[n] ~ normal(beta_mu, beta_sd);
beta = 3 * Phi_approx(beta_sub[n]);
pEncode_sub[n] ~ normal(pEncode_mu, pEncode_sd);
pEncode = Phi_approx(pEncode_sub[n]);
// Initialize shape values
for (i in 1:12){
q[i] = Q0;
pflag[i] = 1;
}
for (t in 1:T){
if (sub_selectedShape[t, n] != 2){
sub_selectedShape[t, n] ~ bernoulli_logit( beta * (q[shapes[t, 2]] - [shapes[t, 1]]) );
shape_chosen = shapes[t, sub_selectedShape[t, n] + 1];
if (pflag[shape_chosen]*prand[t,n] < pEncode) {
// if the probability of updating the chosen shape's value (pEncode) is
// greater than the noise, q-learning is carried out
pflag[shape_chosen] = 0;
q[shape_chosen] += alpha * (reward[t, n] - q[shape_chosen]);
}
else if (pflag[shape_chosen]*prand[t,n] > pEncode) {
// otherwise, chosen shape's value remains unchanged
q[shape_chosen] = Q0;
}
}
}
}
}
```

For data:

```
T = 1584 // Number of trials
N = 1 // Number of subjects
sub_selectedShape // 1584 x 1 array of 0s or 1s
reward // 1584 x 1 array of a value between 1 to 12
shapes // 1584 x 2 array of the displayed shape values between 1 to 12
Q0 = 6 // initial value of all shapes
```

Thank you!