Simple reinforcement learning with dynamic learning

Hi,

I’m new to stan (and modeling in general) and managed to create a simple reinforcement learning model with learning rate and inverse temperature, now i’m trying to make the learning rate dynamic based on some literature. I think I didn’t implement it correctly, could someone look at my code and see what I need to change? I wanted to have eta (learning rate) as an output and get a posterior for it, but I don’t know how to do that because right now I’m just computing it and not drawing it from any distribution. basically i want the learning rate to be computed based on eta = 1 / (epsilon + Nobservedoutcomes[t,stim]); where epsilon is a free parameter that indicated the initial learning rate and N is how many times an outcome of the stimulus was observed in previous trials.

Also, I get warning messages:
1: There were 2913 divergent transitions after warmup. Increasing adapt_delta above 0.9 may help. [[i’m trying to run the model right now with 0.99 but it’s taking ages]]
See
http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
2: Examine the pairs() plot to diagnose sampling problems
3: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
Running the chains for more iterations may help. See
http://mc-stan.org/misc/warnings.html#bulk-ess


data {
  int < lower = 0 > t;						          // number of trials
  int < lower = 0 > N;                      // number of stimuli
  int < lower = 0, upper = 1 > outcome[t];	// win or loss
  int < lower = 0, upper = 1 > choice[t];		// choice data
  real < lower = 0, upper = 1 > feedback[t];// feedback or no feedback
  int stim1[t];                             // left stimuli
  int stim2[t];                             // right stimuli
  int stimchosen[t];                        // chosen stimuli
  int Nobservedoutcomes[t,N];               // number of previously observed outcomes for the chosen image
}
parameters {
  real betatransformed;
  real epsilon;                             // initial value of learning rate
}
transformed parameters {
  real < lower = 0 > beta;
  beta = exp(betatransformed); 			        // inverse temperature
}
model {

 // int y[t] ; -- choice is the data(y) here so we don't need this
  real theta ;                              // probability of choosing the right stimulus
  real PE ;      							              // prediction error
  int stim ;                                // number of chosen stimulus
  real v[N] ;                               // stimulus value
  real eta ;                                // dynamic learning rate

  for (i in 1:N) {
    v[i] = 0.5 ;                            // initial stimulus value
  }

  // prior distribution
  betatransformed ~ normal(0,2) ;
  epsilon ~ gamma(1,1) ;

  // trial loop
  for (i in 1:t) {
    // decision probability — theta = prob of choosing the right stimulus
    theta = (exp (beta * v[stim2[i]])) / ( (exp (beta * v[stim2[i]]) ) + exp (beta * v[stim1[i]]) ) ;
    choice[i] ~ bernoulli(theta); 

    stim  = stimchosen[i] ;

    // only update model after feedback trials
    if (feedback[i] == 1) {
      // prediction error
      PE = outcome[i] - v[stim] ;

     // value updating (learning)
     eta = 1 / (epsilon + Nobservedoutcomes[t,stim]);  // dynamic learning rate
     v[stim] = v[stim] + eta * PE ;
    }
  }
}
1 Like

Sorry, short on time maybe @imadmali is not busy and can answer?

you can close the topic, i solved it. i didnt know how to delete it. thanks!

It’d be helpful if you could post the solution you found.

2 Likes

Hey Zora,

I have a similar problem and I am interested in your solution, too. If you read this it would be nice if you’ll add it to this post :)

Thank you!

1 Like