Simple reinforcement learning model fails to initialize - Reproducible Example

We’ve got this dataset where different participants did a learning and decision making task based on symbols chosen, rewards and effort for each symbol option chosen, and we’re trying to assess their behaviour by considering several parameters such as learning rate.

I’ve attached an example dataset below with the first 10 values of each variable for the first participant master_result - Copy.csv (981 Bytes)

So we’re attempting to run this simple reinforcement learning model using rstan.


data {
  int ntr;          // number of trials
  int nsub;         // total number of participants 
  int ParticipantID[ntr];  // which subject (schedule) which data point belongs to
  real Effort_left[ntr]; //z-score all real variables
  real Effort_right[ntr];
  real Prob_left[ntr];
  real Prob_right[ntr];
  real Rew_left[ntr];
  real Rew_right[ntr];
  int Symbol_L[ntr];
  int Symbol_R[ntr];
  int Button[ntr];
  int Trial[ntr];//trial number
}

// The parameters accepted by the model.
parameters {
  real<lower=0> invTemp[nsub]; //inverse temperature describing noise for each participant
  real<lower=0,upper=1> b_learningRate_reward[nsub];
  real<lower=0,upper=1> b_learningRate_effort[nsub];
  
  // Transformed version of data variables; we want to know how much 
  // each individual takes into account each of the above variables relative
  // to one of the variables (so we don't get confounded by noise)
  // Will help us identify individual differences:
  real b_rewMag_pr[nsub];
  real b_effortMag_pr[nsub];
}

transformed parameters {
  real k[nsub]; // sum of all absolute regression weights
  real b_rewMag[nsub];
  real b_effortMag[nsub];
  real b_probability[nsub]; //this is our reference variable against which other variables are transformed/weighed

  for (is in 1:nsub){
    k[is] = fabs(1) + fabs(b_rewMag_pr[is])+fabs(b_effortMag_pr[is]);
    b_probability[is]          = -1/k[is]; 
    b_rewMag[is]               = b_rewMag_pr[is]/k[is];
    b_effortMag[is]            = b_effortMag_pr[is]/k[is];
  }
}

// The model to be estimated. We model the output
// 'y' to be normally distributed with mean 'mu'
// and standard deviation 'sigma'.
model {
  //Here the model will learn: 1. reward of symbol1, reward of symbol2, effort of symbol1, effort symbol2
  real predicted_reward_perSymbol[ntr,2];//matrix of size [ntr,c(symbol1, symbol2)]
  real predicted_effort_perSymbol[ntr,2];//matrix of size [ntr,c(symbol1, symbol2)]
  real prediction_error; //temporary variable we'll keep overwriting
  real util_L[ntr];
  real util_R[ntr];
  real invTxUtil[ntr];//taking into account the invTemp
  
  // Priors
  invTemp ~ cauchy(0,10);//cauchy(0,2)
  // b_learningRate_reward ~ cauchy(0.5,0.3);//center on 0.5 or don't give prior bc range already very constrained
  // b_learningRate_effort ~ cauchy(0.5,0.3);//center on 0.5 or don't give prior bc range already very constrained
  b_rewMag_pr ~ cauchy(0,2);
  b_effortMag_pr ~ cauchy(0,2);

  // Learning loop
  //Calculate prediction error for Reward and Effort
  for (itr in 1:ntr) {
    if (Trial[itr]<max(Trial)) { // if wo 0 which is the mean 
  
    //Compute prediction error for left option are not on the last trial
    if (Trial[itr]==1) { //if this is the first trial, reset the predicted reward
      predicted_reward_perSymbol[itr,1] = 0; //set first reward to 0 which is the mean 
      predicted_reward_perSymbol[itr,2] = 0; //set first reward t 
      predicted_effort_perSymbol[itr,1] = 0; //set first effort to 0 which is the mean 
      predicted_effort_perSymbol[itr,2] = 0; //set first effort t
    } 
    //Compute prediction error for Left Option
    if (Symbol_L[itr]<3) { //only do if symbol_L is 1 or 2, ignore if 3 or 4
        //Reward
        prediction_error = Rew_left[itr] - predicted_reward_perSymbol[itr,Symbol_L[itr]]; 
        predicted_reward_perSymbol[itr+1,Symbol_L[itr]] = predicted_reward_perSymbol[itr,Symbol_L[itr]]+b_learningRate_reward[ParticipantID[itr]]*prediction_error;
        //Effort
        prediction_error = Effort_left[itr] - predicted_effort_perSymbol[itr,Symbol_L[itr]]; 
        predicted_effort_perSymbol[itr+1,Symbol_L[itr]] = predicted_effort_perSymbol[itr,Symbol_L[itr]]+b_learningRate_effort[ParticipantID[itr]]*prediction_error;
      } else { //if Symbol_L[itr] is 3 or 4
        predicted_reward_perSymbol[itr+1,(3-Symbol_R[itr])] = predicted_reward_perSymbol[itr,(3-Symbol_R[itr])]; //no update
        predicted_effort_perSymbol[itr+1,(3-Symbol_R[itr])] = predicted_effort_perSymbol[itr,(3-Symbol_R[itr])]; //no update  
      }
     
      //Compute prediction error for right option
      if (Symbol_R[itr]<3){ //only do if symbol_L is 1 or 2, ignore if 3 or 4
        //Reward
        prediction_error = Rew_right[itr] - predicted_reward_perSymbol[itr,Symbol_R[itr]]; 
        predicted_reward_perSymbol[itr+1,Symbol_R[itr]] = predicted_reward_perSymbol[itr,Symbol_R[itr]]+b_learningRate_reward[ParticipantID[itr]]*prediction_error;
        //Effort
        prediction_error = Effort_right[itr] - predicted_effort_perSymbol[itr,Symbol_R[itr]]; 
        predicted_effort_perSymbol[itr+1,Symbol_R[itr]] = predicted_effort_perSymbol[itr,Symbol_R[itr]]+b_learningRate_effort[ParticipantID[itr]]*prediction_error;
      } else { //if Symbol_R[itr] is 3 or 4
        predicted_reward_perSymbol[itr+1,(3-Symbol_L[itr])] =  predicted_reward_perSymbol[itr,(3-Symbol_L[itr])]; //no update
        predicted_effort_perSymbol[itr+1,(3-Symbol_L[itr])] = predicted_effort_perSymbol[itr,(3-Symbol_L[itr])]; //no update 
      } 
    }
  }

  //Decision loop
  // compute utilities for choice_l and choice_r (remember special cases for options 3 and 4)
  for (itr in 1:ntr) {
    // Compute utiltiy for left option
    if (Symbol_L[itr]<3){ //only do if symbol_L is 1 or 2
      util_L[itr] = b_rewMag[ParticipantID[itr]]*predicted_reward_perSymbol[itr,Symbol_L[itr]] + b_effortMag[ParticipantID[itr]]*predicted_effort_perSymbol[itr,Symbol_L[itr]]
                  + b_probability[ParticipantID[itr]]*Prob_left[itr];
    } else { //if symbol is 3 or 4
      util_L[itr] = b_rewMag[ParticipantID[itr]]*Rew_left[itr] + b_effortMag[ParticipantID[itr]]*Effort_left[itr]
                  + b_probability[ParticipantID[itr]]*Prob_left[itr];
    }
    
    //Compute utility for right option
    if (Symbol_R[itr]<3) { //only do if symbol_R is 1 or 2
      util_R[itr] = b_rewMag[ParticipantID[itr]]*predicted_reward_perSymbol[itr,Symbol_R[itr]] + b_effortMag[ParticipantID[itr]]*predicted_effort_perSymbol[itr,Symbol_R[itr]]
        + b_probability[ParticipantID[itr]]*Prob_right[itr];
    } else {
      util_R[itr] = b_rewMag[ParticipantID[itr]]*Rew_right[itr] + b_effortMag[ParticipantID[itr]]*Effort_right[itr]
                          + b_probability[ParticipantID[itr]]*Prob_right[itr];
    } //if symbol is 3 or 4
    invTxUtil[itr]= invTemp[ParticipantID[itr]]*(util_R[itr]-util_L[itr]); //(or maybe L-R)
  // link utilities to actual choices
  Button ~ bernoulli_logit(invTxUtil);
}//end model section

The initialization errors:

Chain 1: Rejecting initial value:
Chain 1:   Error evaluating the log probability at the initial value.
Chain 1: Exception: bernoulli_logit_lpmf: n[1] is 2, but must be in the interval [0, 1]  (in 'model23f841761eee_RL_model1_version2' at line 185)

Chain 1: 
Chain 1: Initialization between (-2, 2) failed after 100 attempts. 
Chain 1:  Try specifying initial values, reducing ranges of constrained values, or reparameterizing the model.
Error in sampler$call_sampler(c(args, dotlist)) : Initialization failed.

We have:

  • checked all the values that go into the model, there are no NaN values
  • defined all our variables, and added proper priors and constraints to our variables

It looks like it’s computing the way it should but some reason the bernoulli_logit function won’t accept the vector in the final line, We want to know why it’s failing to initialize, though we’ve properly defined and set all of our parameters.

Any help would be greatly appreciated, thank you!

1 Like

It looks like ‘Button’ is coded as 1 or 2, but the bernoulli_logit needs a value between 0 and 1. You’ll want to recode that variable in that range (I think so the R choice is 1 and L is 0).

I am also a bit confused by what’s happening in the transformed parameters block, but the above suggestion should fix your posted issue.

3 Likes