# Simple reinforcement learning with dynamic learning

Hi,

I’m new to stan (and modeling in general) and managed to create a simple reinforcement learning model with learning rate and inverse temperature, now i’m trying to make the learning rate dynamic based on some literature. I think I didn’t implement it correctly, could someone look at my code and see what I need to change? I wanted to have eta (learning rate) as an output and get a posterior for it, but I don’t know how to do that because right now I’m just computing it and not drawing it from any distribution. basically i want the learning rate to be computed based on eta = 1 / (epsilon + Nobservedoutcomes[t,stim]); where epsilon is a free parameter that indicated the initial learning rate and N is how many times an outcome of the stimulus was observed in previous trials.

Also, I get warning messages:
1: There were 2913 divergent transitions after warmup. Increasing adapt_delta above 0.9 may help. [[i’m trying to run the model right now with 0.99 but it’s taking ages]]
See
http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
2: Examine the pairs() plot to diagnose sampling problems
3: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
Running the chains for more iterations may help. See
http://mc-stan.org/misc/warnings.html#bulk-ess

``````data {
int < lower = 0 > t;						          // number of trials
int < lower = 0 > N;                      // number of stimuli
int < lower = 0, upper = 1 > outcome[t];	// win or loss
int < lower = 0, upper = 1 > choice[t];		// choice data
real < lower = 0, upper = 1 > feedback[t];// feedback or no feedback
int stim1[t];                             // left stimuli
int stim2[t];                             // right stimuli
int stimchosen[t];                        // chosen stimuli
int Nobservedoutcomes[t,N];               // number of previously observed outcomes for the chosen image
}
parameters {
real betatransformed;
real epsilon;                             // initial value of learning rate
}
transformed parameters {
real < lower = 0 > beta;
beta = exp(betatransformed); 			        // inverse temperature
}
model {

// int y[t] ; -- choice is the data(y) here so we don't need this
real theta ;                              // probability of choosing the right stimulus
real PE ;      							              // prediction error
int stim ;                                // number of chosen stimulus
real v[N] ;                               // stimulus value
real eta ;                                // dynamic learning rate

for (i in 1:N) {
v[i] = 0.5 ;                            // initial stimulus value
}

// prior distribution
betatransformed ~ normal(0,2) ;
epsilon ~ gamma(1,1) ;

// trial loop
for (i in 1:t) {
// decision probability — theta = prob of choosing the right stimulus
theta = (exp (beta * v[stim2[i]])) / ( (exp (beta * v[stim2[i]]) ) + exp (beta * v[stim1[i]]) ) ;
choice[i] ~ bernoulli(theta);

stim  = stimchosen[i] ;

// only update model after feedback trials
if (feedback[i] == 1) {
// prediction error
PE = outcome[i] - v[stim] ;

// value updating (learning)
eta = 1 / (epsilon + Nobservedoutcomes[t,stim]);  // dynamic learning rate
v[stim] = v[stim] + eta * PE ;
}
}
}
``````
1 Like

Sorry, short on time maybe @imadmali is not busy and can answer?

you can close the topic, i solved it. i didnt know how to delete it. thanks!

It’d be helpful if you could post the solution you found.

2 Likes

Hey Zora,

I have a similar problem and I am interested in your solution, too. If you read this it would be nice if you’ll add it to this post :)

Thank you!

1 Like