# Parameter fitting issues with reinforcement learning modeling

Hi all,

I’m a newcomer to Stan and I’m currently trying to use Stan for reinforcement learning modeling. I’ve encountered a problem: the inverse temperature (tau) has a minimal magnitude!

Specifically:

(1) In the softmax choice function, the inverse temperature (tau) represents choice consistency. A higher tau indicates that subjects’ choices rely more on the value.

(2) Based on feedback from participants and GLMM analysis, it is clear that choices are indeed influenced by value. Therefore, I should expect to capture an appropriate tau value from the model.

(3) I have set the range for tau to be [0, 10]. However, the final fitted tau is significantly on the lower side, for example, 0.1. I’m quite puzzled and would like to seek your advice on where the issue might be.

(4) Current findings and attempts: Tau at the group-level initial value (tau_mu_raw) is negative. I tried setting “lower=0” for fitting, but the model faces convergence issues (Rhat > 1.1).

(5) Below is my code for the sake of readability, mainly to illustrate the modeling approach.I would greatly appreciate any other methods or references.

Stan Code:

``````data {
int<lower=1> nSubjects;                           // number of subjects
int<lower=1> nTrials;                             // number of trials
int<lower=1,upper=99> Choice[nSubjects,nTrials];  // 1 = Left Choice, 2 = Right Choice, 99 = Missing
real<lower=0> Reward_A[nSubjects,nTrials];        // experiment reward feedback A (RANGE:[1,250])
real<lower=0> Reward_B[nSubjects,nTrials];        // experiment reward feedback B (RANGE:[1,250])
}

transformed data {
vector[2] RAi = rep_vector(0,2); // initial expected values for Reward_A
vector[2] RBi = rep_vector(0,2); // initial expected values for Reward_B
vector[2] EVi = rep_vector(0,2); // initial expected values for EV,which EV = w * Reward_A + (1-w) * Reward_B
}

parameters {
// group level parameters
real alphaRA_mu_raw; // learning rate for Reward_A
real alphaRB_mu_raw; // learning rate for Reward_B
real w_mu_raw;       // a weighting parameter between Reward_A and Reward_B
real tau_mu_raw;     // inverse temperature in softmax choice function

real<lower=0> alphaRA_sd_raw;
real<lower=0> alphaRB_sd_raw;
real<lower=0> w_sd_raw;
real<lower=0> tau_sd_raw;

// individual level parameters - raw space
vector[nSubjects] alphaRA_raw;
vector[nSubjects] alphaRB_raw;
vector[nSubjects] w_raw;
vector[nSubjects] tau_raw;
}

transformed parameters {
vector<lower=0,upper=1>[nSubjects] alphaRA;
vector<lower=0,upper=1>[nSubjects] alphaRB;
vector<lower=0,upper=1>[nSubjects] w;
vector<lower=0,upper=10>[nSubjects] tau;

// convert to the actual parameter space
for (s in 1:nSubjects){
alphaRA[s]  = Phi_approx( alphaRA_mu_raw  + alphaRA_sd_raw * alphaRA_raw[s] );
alphaRB[s]  = Phi_approx( alphaRB_mu_raw  + alphaRB_sd_raw * alphaRB_raw[s] );
w[s]  = Phi_approx( w_mu_raw  + w_sd_raw * w_raw[s] );
tau[s]  = Phi_approx( tau_mu_raw  + tau_sd_raw * tau_raw[s] )*10;
}
}

model {
// priors
alphaRA_mu_raw  ~ std_normal();
alphaRB_mu_raw  ~ std_normal();
w_mu_raw  ~  std_normal();
tau_mu_raw ~ std_normal();

alphaRA_sd_raw  ~ cauchy(0,1);
alphaRB_sd_raw  ~ cauchy(0,1);
w_sd_raw  ~ cauchy(0,1);
tau_sd_raw ~ cauchy(0,1);

alphaRA_raw  ~ std_normal();
alphaRB_raw  ~ std_normal();
w_raw  ~ std_normal();
tau_raw  ~ std_normal();

// model
for (s in 1:nSubjects){
vector[2] RA = RAi;
vector[2] RB = RBi;
vector[2] EV = EVi;

for (t in 1:nTrials)  {
if (Choice[s,t] != 99) {
Choice[s,t] ~ categorical_logit(tau[s] * EV);                          // softmax choice function

EV[Choice[s,t]] = w[s] * RA[Choice[s,t]] + (1-w[s]) * RB[Choice[s,t]]; // calculate Expected Value

RA[Choice[s,t]] += alphaRA[s] * (Reward_A[s,t] - RA[Choice[s,t]]);     // Q—Learning:Reward_A
RB[Choice[s,t]] += alphaRB[s] * (Reward_B[s,t] - RB[Choice[s,t]]);     // Q—Learning:Reward_B
} // end of Choice
} // end of t
} // end of s
} // end of model

generated quantities {
// parameters
real<lower=0,upper=1> alphaRA_mu;
real<lower=0,upper=1> alphaRB_mu;
real<lower=0,upper=1> w_mu;
real<lower=0,upper=10> tau_mu;

real log_lik[nSubjects];                                            //choice log likelihood
int y_pred[nSubjects, nTrials] = rep_array(99, nSubjects, nTrials); //choice prediction

// convert to the actual parameter space
alphaRA_mu  = Phi_approx(alphaRA_mu_raw);
alphaRB_mu  = Phi_approx(alphaRB_mu_raw);
w_mu  = Phi_approx(w_mu_raw);
tau_mu  = Phi_approx(tau_mu_raw)*10;

// model
{
for (s in 1:nSubjects){
vector[2] RA = RAi;
vector[2] RB = RBi;
vector[2] EV = EVi;
log_lik[s]=0;

for (t in 1:nTrials)  {
if (Choice[s,t] != 99) {
log_lik[s] += categorical_logit_lpmf(Choice[s,t] | tau[s] * EV);       //choice log likelihood
y_pred[s,t] = categorical_logit_rng(tau[s] * EV);                      //choice prediction

EV[Choice[s,t]] = w[s] * RA[Choice[s,t]] + (1-w[s]) * RB[Choice[s,t]]; // calculate Expected Value

RA[Choice[s,t]] += alphaRA[s] * (Reward_A[s,t] - RA[Choice[s,t]]);     // Q—Learning:Reward_A
RB[Choice[s,t]] += alphaRB[s] * (Reward_B[s,t] - RB[Choice[s,t]]);     // Q—Learning:Reward_B
} // end of Choice
} // end of t
} // end of s
} // end of model

} // end of generated quantities

``````

Are you intentionally biasing the posteriors of your standard deviations toward zero? If not, you may want to use lognormal or gamma as your prior on your strictly positive parameters.

A cauchy centered over zero will have a central tendency of approximately zero. Since you are seemingly using uninformative priors, you may want the standard deviations of your normals to be centered over one with a standard deviation distribution that has maximum entropy over the supported domain. That’s lognormal(0,1).

1 Like

Hi,@Corey.Plate
My apologies for the delayed response~ Thank you very much for your answer!

As you mentioned, I want to fit the model using uninformative priors. For parameters that must be positive, I’m using ‘<lower=0>’ to constrain them,just like:

I’m not quite sure how to use prior distributions for constraints.

According to your advice, I set a lognormal(0,1) prior distribution for the standard deviation of each parameter,just like:

``````// group level parameter
real tau_mu_raw;
real<lower=0> tau_sd_raw;

// individual level parameter
vector[nSubjects] tau_raw;

// transformed parameter
vector<lower=0,upper=10>[nSubjects] tau;
tau[s]  = Phi_approx( tau_mu_raw  + tau_sd_raw * tau_raw[s] )*10;

// prior
tau_mu_raw ~ std_normal();
tau_sd_raw ~ lognormal(0,1);
tau_raw  ~ std_normal();
``````

but it seems that the program is not responding. Could it be due to my settings? I’m a bit confused. If you could provide the code for the prior settings section, I would greatly appreciate it.

Thank you once again for your assistance!

Are you receiving any error messages from the compiler or any warning messages from the sampler?

1 Like

Hi,@Corey.Plate
I apologize for my mistake,
Due to the lack of constraints on the parameter sampling range, I received the sampling error：

Chain 1: Error evaluating the log probability at the initial value.
Chain 1: Exception: lognormal_lpdf: Random variable is -0.221171, but must be nonnegative!

Subsequently, I tried to relaunch the program, rectified the errors, and gave it another run, and this time, it executed without any issues, resulting in successful sampling!

Unfortunately, tau still had a minimal magnitude, with an average of 0.04. While I suspected this might be related to the data, where participants’ choices might tend to be random, the value was so small that it seemed quite unusual.

I’m beginning to suspect it’s a prior-related issue. Based on previous literature( * 10.1016/j.neuron.2011.02.027), I’ve tried the following settings:

tau ~ gamma(1.2, 5)
alpha ~ beta(1.1, 1.1)
w ~ beta(1.1, 1.1)

Because of the necessity to satisfy sampling distribution requirements,
I’ve added range constraints at both the group level and the individual subject level:

// group level parameters
real<lower=0,upper=1> w_mu_raw;
real<lower=0> tau_mu_raw;

// individual level parameters
vector[nSubjects] <lower=0,upper=1> w_raw;
vector[nSubjects] <lower=0> tau_raw;

// priors
w_mu_raw ~ beta(1.1,1.1);
tau_mu_raw ~ gamma(1.2,5);

w_sd_raw ~ lognormal(0,1);
tau_sd_raw ~ lognormal(0,1);

w_raw ~ beta(1.1,1.1);
tau_raw ~ gamma(1.2,5);

the constraints on boundaries resulted in a less optimal fit,but this was not the main concern.The primary concern was the strange phenomenon in the results: all values were concentrated around the central range:

tau ≈ 5
alpha ≈ 0.5
w ≈ 0.5

Even though my current attempts haven’t been very successful, based on your suggestions, I believe I can try to adjust the prior distributions to achieve a better outcome.It would be greatly appreciated if you have any suggestions for the prior settings.

Once again, thank you for your assistance!

How many subjects are you fitting, and how many observations for each? If you have a lot of subjects with relatively small and noisy data sets per subject, you may want to consider using a non-centered parameterization for your priors so that you can take advantage of both the within-subject and between-subjects variance in the data set. That entails making a guess at the hyperprior over the parameters of the priors and letting those parameters vary for your prior distributions. For example:

``````model{
//hyperpriors
x_mu ~ normal(0,1);
x_sigma ~ lognormal(0,1);
y_mu ~ normal(0,1);
y_sigma ~ lognormal(0,1);

//priors
x ~ normal(x_mu,x_sigma);
y ~ lognormal(y_mu,y_sigma);

//likelihood
z ~ function(x,y);
}
``````

(These are dummy variables and distributions for demonstration purposes)

1 Like

Thank you!@Corey.Plate

The experiment used a within-subject design with 40 participants and 96 trials.
I agree that the non-central parameterization is promising, as I’ve been gradually approaching the ‘ideal’ parameters through iterative adjustments of the priors.

I apologize for not responding promptly.
Thank you very much for your patience and help!