Convergence issue due to discrete step in Q-learning model

Hi all,

I’m trying to fit parameters to a 2-armed bandit task model which implements q-learning along with an additional step which attempts to add some noise to the model. (only to the beginning of the model when shapes are being updated for the first time)

In this model:

  1. A pair of shapes (from a set of 12) are displayed to a subject
  2. The subject chooses the most valuable shape (either the shape on the left or right) through a softmax selection rule
  3. If the probability for a subject to learn the value of a specific shape is greater than some random value (noise), the selected shape’s value will be updated through q-learning from the current trial on wards.
  4. If the probability is lower than random noise, no q-learning happens for the current trial

Why I think the discrete rule used to switch between steps 3 and 4 might be causing issues:

  1. I have not run into convergence issues with a previous version of this model that implements q-learning every trial and doesn’t use a discrete rule to switch between whether or not to run q-learning.
  2. When I check the negative log likelihood of the posterior with print(target()) within the if block (implementing q-learning) and else if block (resetting to initial shape value), lp__ is about -400 and -1000 respectively.

I’ve seen some discussions about marginalizing out discrete parameters but I’m not sure if that’s something I need to be looking at or if there are other things I can do to improve the model. I would appreciate any guidance on how to move forwards!

Here’s my stan code below:

 data {
     int T;                                        // number of trials
     int N;                                       // number of subjects
     int sub_selectedShape[T, N];  // 0: shape displayed on left selected, 1: shape 
                                                     // displayed on right selected, 2: invalid response
     real reward[T, N];                    // displayed reward (from 1 to 12) for shape selected 
                                                    // by subject per trial
     int shapes[T, 2];             // values of displayed shape pair, from 1 to 12
     real Q0;                      // initial value of all 12 shapes (value between 1 to 12)

 transformed data {
     real prand[T,N];              // random values between 0 and 1 for each subject and 
                                           // each trial (meant to add noise)
     for (n in 1:N) {
         for (t in 1:T) {
             prand[t,n] = uniform_rng(0,1);

 parameters {
     // population parameters
     real alpha_mu;
     real <lower=0.001> alpha_sd;
     real beta_mu;
     real <lower = 0.001> beta_sd;
     real pEncode_mu;
     real <lower=0.001> pEncode_sd;
     // subject parameters
     real alpha_sub[N];
     real beta_sub[N];
     real pEncode_sub[N];

 model {
     alpha_mu ~ normal(0, 1);
     alpha_sd ~ normal(0, 1);
     beta_mu ~ normal(0, 1);
     beta_sd ~ normal(0, 1);
     pEncode_mu ~ normal(0, 1);
     pEncode_sd ~ normal(0, 1);
     for (n in 1:N) {
         real alpha;            // learning rate (between 0 and 1)
         real beta;             // softmax temperature (between 0 and 3)
         real pEncode;      // probability for updating the value of each shape for the first    
                                     // time
         real pflag[12];      // flag to indicate whether or not each shape's value has  
                                     // already been updated
         real q[12];               // current values of each shape in the current trial
         int shape_chosen;  // indicates whether shape on left or right was selected by
                                        // the subject (left: 0, right: 1)
         alpha_sub[n] ~ normal(alpha_mu, alpha_sd);
         alpha = Phi_approx(alpha_sub[n]);
         beta_sub[n] ~ normal(beta_mu, beta_sd);
         beta = 3 * Phi_approx(beta_sub[n]);
         pEncode_sub[n] ~ normal(pEncode_mu, pEncode_sd);
         pEncode = Phi_approx(pEncode_sub[n]);
         // Initialize shape values
         for (i in 1:12){
             q[i] = Q0;
             pflag[i] = 1;
         for (t in 1:T){
             if (sub_selectedShape[t, n] != 2){
                 sub_selectedShape[t, n] ~ bernoulli_logit( beta * (q[shapes[t, 2]] - [shapes[t, 1]]) );
                 shape_chosen = shapes[t, sub_selectedShape[t, n] + 1];
                 if (pflag[shape_chosen]*prand[t,n] < pEncode) {
                     // if the probability of updating the chosen shape's value (pEncode) is 
                     // greater than the noise, q-learning is carried out
                     pflag[shape_chosen] = 0;
                     q[shape_chosen] += alpha * (reward[t, n] - q[shape_chosen]);
                 else if (pflag[shape_chosen]*prand[t,n] > pEncode) {
                     // otherwise, chosen shape's value remains unchanged
                     q[shape_chosen] = Q0;

For data:

 T = 1584   // Number of trials
 N = 1        // Number of subjects
 sub_selectedShape  // 1584 x 1 array of 0s or 1s
 reward      // 1584 x 1 array of a value between 1 to 12
 shapes      // 1584 x 2 array of the displayed shape values between 1 to 12
 Q0 = 6      // initial value of all shapes 

Thank you!

1 Like

Welcome to the Stan forum.

One alternative way to include this noise is to multiply the prediction error with pencode. Then you still have a random influence on learning, but it will be smooth.

Thank you for the suggestion! I realize I left out one detail which is I only want to apply random noise to the beginning of the experiment i.e. it takes some attempts for a subject to make that first value update for each shape.

After the initial update for each shape, I was hoping to run q-learning as usual.

In that case you could just scale the noise by something like 1/trial_number.

An alternative (I think better) way is to directly model the learning rate as a function of time and uncertainty of the learned qs. (An adaptive learner should value information when uncertainty is high, i.e. also at the start of the experiment. It’s not clear to me why people should wait with learning. What they should be doing is to wight exploration over exploitation at the start of the experiment, but this would me modeled by modeling the dependency of choices on learned values -temperature- as choice dependent.)

More generally: The tricky thing is to parameterize the model correctly and to choose sensible priors that result in learning processes that are reasonable. Hence I recommend doing prior predictive checks on the model.

1 Like

Thank you for all the advice! Trying out different smoother functions resolved the convergence issues I was running into and I’ll think more about the alternatives you suggested.