Convergence issues in multilevel Q-learning model (epsilon-greedy)

Dear all
I am analyzing data from a study where two groups of children performed a short 2-armed bandit task game. I am analyzing the data with a Q-learning model and I am using a multilevel model to pool together data from different children.

In brief, the model (for a single individual) work as follow: at each trial t the subjects make a choice between two options c_t \in \left\{1,2 \right\} and observe a “reward” r_t \in \left\{-1,0,1 \right\} (i.e. they could either win or lose one point). Gain and losses are obtained in different trials, signalled by a contextual cue (e.g. in “loss” trials the possible ‘rewards’ ares r_t^- \in \left\{-1,0 \right\}; in gain trials the possible reward are r_t^+ \in \left\{0,1 \right\}). The model assumes that subjects maintain and update estimate of the value (the estimated expected reward) of each choice option (the so-called Q-values). The Q-values are updated according to

Q_{t+1} \left( c \right) = Q_{t} \left( c \right) + \eta \cdot \delta_t

where \eta is the learning rate and \delta_t is the reward prediction error, calculated as

\delta_t = r_t - Q_{t} \left( c \right)

The updating is done separately for positive and negative feedbacks with possibly different learning rates \eta^+ and \eta^-.
In a first version of the model a logistic sigmoid (softmax) is used to values to choice probabilities

p \left( c_t=1 \right) = \frac{1}{1+e^{-\beta \left[Q_t\left( 1\right) -Q_t\left( 2\right)\right]}}

where \beta is an “inverse temperature” parameter that control the randomness of the choices.
Thus, leaving the hyperparameters aside, the model has 4 free parameters per subject, \eta^+, \eta^-, \beta^+, \beta^- (two learning rates and two inverse temperatures).

I programmed a hierarchical version of this softmax model, which worked just fine and converged nicely (no divergent transitions, all \hat R resulted essentially 1). The model predictions match well the data, and the parameters are in agreement with what found in the literature.

I want now to test an alternative model, which differs only in the function that gives the choice probabilities: the softmax should be replaced with an \epsilon-greedy function:

p \left( c_t=1 \right) = \begin{cases} 1-\epsilon,& \text{if } Q_t\left( 1\right) -Q_t\left( 2\right)\geq 0\\ \epsilon, & \text{otherwise} \end{cases}

This model has also 4 parameters per subject \eta^+, \eta^-, \epsilon^+, \epsilon^- (learning rates and “randomness” parameters).
I have only changed few lines in the code for the softmax model, but I run into serious convergence issues: the chains do not converge, as indicated by the very large \hat R values. I don’t understand what I might be doing wrong. I paste the code below; the lines that I have changed with respect to the softmax model are marked with [###] in the comments. Please let me know if you spot any mistake or have any suggestions for how to fix this.



functions {
  real signum(real x) {
    return x < 0 ? -1 : x > 0;
data {
  int<lower=1> J;     //subjects
  int<lower=1> N[J];  // trials x subject
  int<lower=0,upper=1> group[J]; // group
  int<lower=1> maxN;    // max number of trials 
                        // (variable, some subject who did less trials
                        // hence the -99 values in the matrices below, indicating missing data)
  int<lower=-99,upper=2> type[maxN,J];   // context (1=win, 2=loss)
  int<lower=-99,upper=1> C[maxN,J];       // choice (of option with high reward prob)
  int<lower=-99,upper=1> R[maxN,J];       // feedback (positive or negative)

transformed data {
  real initQ[2,2];  // initial Q values for the 4 options
  int Nobs; // number of total observations
  initQ = rep_array(0, 2, 2);
  Nobs = sum(N);

parameters {
  vector[4] mu_p;     // mean parameters baseline group [eta+, eta-, epsilon+, epsilon-]
  vector[4] D_group; // group-differences [eta+, eta-, epsilon+, epsilon-]
  vector<lower=0>[4] sigma_u;   // random effects standard deviations
  cholesky_factor_corr[4] L_u;  // Choleski factor of the correlation matrix
  matrix[4,J] z_u;              // random effect matrix

transformed parameters {
  // calculate subject-level parameters
  matrix[4,J] u;
  matrix[4,J] par_i;
  u = diag_pre_multiply(sigma_u, L_u) * z_u;
  for (i in 1:J) {
    par_i[1,i] = Phi_approx(mu_p[1] + group[i]*D_group[1] + u[1,i]);
    par_i[2,i] = Phi_approx(mu_p[2] + group[i]*D_group[2] + u[2,i]);

    par_i[3,i] = Phi_approx(mu_p[3] + group[i]*D_group[3] + u[3,i]); // [###] THIS TWO lines changed
    par_i[4,i] = Phi_approx(mu_p[4] + group[i]*D_group[4] + u[4,i]); // [###] (contrary to the beta of the softmax the epsilon are bounded between 0 and 1)

model {
  // hyper parameters
  L_u ~ lkj_corr_cholesky(2); // LKJ prior for the correlation matrix
  mu_p[1:2]  ~ normal(0, 2);
  mu_p[3:4]  ~ normal(-2, 2); // [###]  updated prior for the epsilon parameters 
 sigma_u ~ cauchy(0, 1);     // SD of random effects (vectorized)
  // individual parameters
  to_vector(z_u) ~ normal(0,1); // before Cholesky, random effects are normal variates with SD=1
  // between-group differences
  D_group  ~ normal(0, 2);

  // subject loop and trial loop
  for (i in 1:J) {
    real Q[2,2];  // Q values (expected value)
    real VD;      // value difference
    real PE;      // prediction error
    Q = initQ;
    for (t in 1:N[i]) {    
      VD = Q[type[t,i],2] - Q[type[t,i],1];
      // prediction error
      PE = R[t, i] - Q[type[t,i], C[t,i]+1];
      // [###] choice probability (epsilon-greedy)
      C[t,i] ~ bernoulli( (signum(VD)+1)/2 -signum(VD)*par_i[type[t,i]+2,i]  );  // [###] 
      // I also tried a different implementation with the conditional operator: 
      //C[t,i] ~ bernoulli(VD>0 ? 1-par_i[type[t,i]+2,i] : par_i[type[t,i]+2,i] );
      // Q value updating 
      Q[type[t,i], C[t,i]+1] = Q[type[t,i], C[t,i]+1] + par_i[type[t,i],i] * PE; 

I think the problem is that the greedy function introduces a discrete jump in the log likelihood and Stan’s algorithm expects a smooth log likelihood. The discrete jump is around Q_t(1) - Q_t(2) = 0 where the bernouilli probability for p(c_t = 1) jumps from \epsilon to 1-\epsilon. Ideally, you would marginalize out this jump with a finite mixture model. See the user guide here and here.

I am currently not sure how to implement this because the mixture probability would have to be a smooth function of Q_t(1) - Q_t(2). A sigmoid with \beta >>>0 would be a continuous approximation of the discrete jump but then you are basically back to the original model.


Many thanks for the reply! That was also my suspicion, that the problem was related to the ‘discreteness’ of the greedy function. I’ll read through the links.

I also like the idea of using a fixed, very large value of \beta - if that works. It would not be exactly the same model as the original one because with the greedy function the probability of “exploratory” choices would remain constant, whereas it tends to decreases during learning with the softmax.

1 Like