# How to avoid overestimation of parameter bounded to be bigger than 0?

Hello, I am trying to fit a parameter called the learning rate of a reinforcement learning model (Q-learning). The learning rate is bounded between 0 and 1.

I am using data from rats’ choices and replays (a brain phenomenon whose definition I am hoping I can skip for the purposes of this question) to fit the model parameter and check if the learning rate for replay is different than zero.
fit_replays_1subj_simple.stan (1.8 KB)

The problem is that when I check if the parameter recovery is working by generating data in Python and then trying to estimate it with stan, I see a consistent overestimation of the learning rate (alphaR), especially when the true learning rate used to generate the synthetic data is zero.

Am I doing something wrong? Please let me know if I should provide any further information.

I have tried both having a symmetric flat prior [-0.5, 0.5], and a non symmetric one [0, 1]. But both were still overestimating the parameter.

Having negative values in Q-learning may be problematic, because it introduces an exponential growth in Q, so I am not sure if the symmetric option is a good one.

Another idea is to have a mass of 50% at zero, and the rest from 0 to 1. Would that work better? If yes, how would I do it?

Thank you for your time, attention and energy.

Gratefully,
Homero

In general, regardless of the reinforcement learning framework here, how do we avoid overestimation of a parameter when its true value lies at the border of its range?

Hi Homero,
learning rate parameter is highly correlated with the inverse-temperature parameter(regression coefficient in logistic regression) in standard reinforcement learning model if the reward probability in your task is fixed. Does the inverse-temperature parameter is less estimated? If the reward probability reverses sometime in your design, it becomes much easier to distinguish these two parameters(verify the value of the two parameters, simulated data would be much different).
Reward probability in rats studies is pretty high and keep fixed in the whole study, also the trial number is much larger than human studies. In this case, you barely see any learning effect(learning only happens in the beginning of the training/task) which decrease the accuracy of the parameter estimation.

Welcome to the community!

To accompany @mingqian.guo 's more technical feedback, here are a few practical suggestions–from someone who is not familiar with these sorts of models.

First, could you share the data generation code? That might reveal some discrepancies between the data generation and model.

Second, I would suggest you start by simplifying the model code as much as possible to make it clearer where any issues may be arising. For example, it looks like the data are…

1. replays…
2. nested in trials…
3. nested in sessions…
4. nested in subjects.

It looks like you are already analyzing only the first subject by hard-coding 1 in several places (e.g. `c[1,ss,t] ~ categorical_logit(3 * Q);`). Could you start with a model that just looks at one subject and one session, iterating over trials and replays? I’ve tried this below, but double-check my work!

Third, the problem might be clearer if you’re able to see how `Q` changes across replays and trials. I’ve tried to do that in the code below (again, check my work).

Good luck!

``````data {
int<lower = 1> NT;                                            // Number of trials
int<lower = 1> max_NR;                                        // max number of replays across all trials
array[NT] int<lower = 1, upper = max_NR> NR;                  // NR number of replays
int<lower = 1> N_ACTIONS;
array[NT,max_NR] int<lower = 1, upper = N_ACTIONS> rp_arms;   // replayed arms {1,...,N_ACTIONS}
array[NT,max_NR] int rp_rwd;                                  // rwds assigned to replayed arms
array[NT] int<lower = 1, upper = N_ACTIONS> c;                // arm choices {1,...,N_ACTIONS}
array[NT] int<lower = 0, upper = 1> r;                        // reward {0,1}
}
transformed data{
int NR_total = 0;   // Total number of replays
int NQ;             // Number of sets of Q

for(t in 1:NT){
NR_total += NR[t];
}

NQ = NR_total + 1 + NT;
}
parameters {
real alphaRm;
}
model {
alphaRm ~ normal(0,1);
vector Q = rep_vector(0, 8); // Initialize Q-values for this subject with zero
real alphaR;
alphaR = Phi_approx(alphaRm)-0.5;

for (t in 1:NT) {       // Loop over trials
// Choice (softmax)
c[t] ~ categorical_logit(3 * Q);  // fixed beta=3 TO MAKE IT EASIER

for(rp_i in 1:NR[t]){
Q[rp_arms[t,rp_i]] += alphaR * (rp_rwd[t,rp_i] - Q[rp_arms[t,rp_i]]);
}

// Q-learning
Q[c[t]] += 0.5 * (r[t] - Q[c[t]]);  // fixed alphaD=0.5 TO MAKE IT EASIER
}
}
generated quantities {
real alphaRm_phied;
array[NQ] vector Q_set;
array[NQ] vector P_set;  // Probability of selecting each
alphaRm_phied = Phi_approx(alphaRm)-0.5;

{
vector Q = rep_vector(0, 8);
int count = 1;
real alphaR;
alphaR = Phi_approx(alphaRm)-0.5;
Q_set = Q;

for (t in 1:NT) {       // Loop over trials
for(rp_i in 1:NR[t]){
Q[rp_arms[t,rp_i]] += alphaR * (rp_rwd[t,rp_i] - Q[rp_arms[t,rp_i]]);
count += 1;
Q_set[count] = Q;
}

// Q-learning
Q[c[t]] += 0.5 * (r[t] - Q[c[t]]);  // fixed alphaD=0.5 TO MAKE IT EASIER
count += 1;
Q_set[count] = Q;
}
}

for(n in 1:NQ){
P_set[n] = softmax(Q_set[n]);
}
}
``````