Parameter fitting issues with reinforcement learning modeling

Hi all,

I’m a newcomer to Stan and I’m currently trying to use Stan for reinforcement learning modeling. I’ve encountered a problem: the inverse temperature (tau) has a minimal magnitude!

Specifically:

(1) In the softmax choice function, the inverse temperature (tau) represents choice consistency. A higher tau indicates that subjects’ choices rely more on the value.

(2) Based on feedback from participants and GLMM analysis, it is clear that choices are indeed influenced by value. Therefore, I should expect to capture an appropriate tau value from the model.

(3) I have set the range for tau to be [0, 10]. However, the final fitted tau is significantly on the lower side, for example, 0.1. I’m quite puzzled and would like to seek your advice on where the issue might be.

(4) Current findings and attempts: Tau at the group-level initial value (tau_mu_raw) is negative. I tried setting “lower=0” for fitting, but the model faces convergence issues (Rhat > 1.1).

(5) Below is my code for the sake of readability, mainly to illustrate the modeling approach.I would greatly appreciate any other methods or references.

Thank you in advance for your assistance!

(Main Reference: lei-zhang (Dr. Lei Zhang) · GitHub
(Code Reference: Testosterone eliminates strategic prosocial behavior through impacting choice consistency in healthy males | Neuropsychopharmacology )

Stan Code:


data {
  int<lower=1> nSubjects;                           // number of subjects
  int<lower=1> nTrials;                             // number of trials
  int<lower=1,upper=99> Choice[nSubjects,nTrials];  // 1 = Left Choice, 2 = Right Choice, 99 = Missing
  real<lower=0> Reward_A[nSubjects,nTrials];        // experiment reward feedback A (RANGE:[1,250])
  real<lower=0> Reward_B[nSubjects,nTrials];        // experiment reward feedback B (RANGE:[1,250])
} 

transformed data {
  vector[2] RAi = rep_vector(0,2); // initial expected values for Reward_A
  vector[2] RBi = rep_vector(0,2); // initial expected values for Reward_B
  vector[2] EVi = rep_vector(0,2); // initial expected values for EV,which EV = w * Reward_A + (1-w) * Reward_B
}

parameters {
  // group level parameters
  real alphaRA_mu_raw; // learning rate for Reward_A
  real alphaRB_mu_raw; // learning rate for Reward_B
  real w_mu_raw;       // a weighting parameter between Reward_A and Reward_B 
  real tau_mu_raw;     // inverse temperature in softmax choice function
  
  real<lower=0> alphaRA_sd_raw;
  real<lower=0> alphaRB_sd_raw;
  real<lower=0> w_sd_raw;
  real<lower=0> tau_sd_raw;

  // individual level parameters - raw space
  vector[nSubjects] alphaRA_raw;
  vector[nSubjects] alphaRB_raw;
  vector[nSubjects] w_raw;
  vector[nSubjects] tau_raw;
}

transformed parameters {
  vector<lower=0,upper=1>[nSubjects] alphaRA;
  vector<lower=0,upper=1>[nSubjects] alphaRB;
  vector<lower=0,upper=1>[nSubjects] w;
  vector<lower=0,upper=10>[nSubjects] tau;
 
  // convert to the actual parameter space
  for (s in 1:nSubjects){
    alphaRA[s]  = Phi_approx( alphaRA_mu_raw  + alphaRA_sd_raw * alphaRA_raw[s] );
    alphaRB[s]  = Phi_approx( alphaRB_mu_raw  + alphaRB_sd_raw * alphaRB_raw[s] );
    w[s]  = Phi_approx( w_mu_raw  + w_sd_raw * w_raw[s] );
    tau[s]  = Phi_approx( tau_mu_raw  + tau_sd_raw * tau_raw[s] )*10; 
  }
}

model {
  // priors
  alphaRA_mu_raw  ~ std_normal();
  alphaRB_mu_raw  ~ std_normal();
  w_mu_raw  ~  std_normal();
  tau_mu_raw ~ std_normal();
  
  alphaRA_sd_raw  ~ cauchy(0,1);
  alphaRB_sd_raw  ~ cauchy(0,1);
  w_sd_raw  ~ cauchy(0,1);
  tau_sd_raw ~ cauchy(0,1);
  
  alphaRA_raw  ~ std_normal();
  alphaRB_raw  ~ std_normal();
  w_raw  ~ std_normal();
  tau_raw  ~ std_normal();

  // model
  for (s in 1:nSubjects){
    vector[2] RA = RAi;
    vector[2] RB = RBi;
    vector[2] EV = EVi;
    
    for (t in 1:nTrials)  {
      if (Choice[s,t] != 99) {
        Choice[s,t] ~ categorical_logit(tau[s] * EV);                          // softmax choice function
        
        EV[Choice[s,t]] = w[s] * RA[Choice[s,t]] + (1-w[s]) * RB[Choice[s,t]]; // calculate Expected Value
        
        RA[Choice[s,t]] += alphaRA[s] * (Reward_A[s,t] - RA[Choice[s,t]]);     // Q—Learning:Reward_A
        RB[Choice[s,t]] += alphaRB[s] * (Reward_B[s,t] - RB[Choice[s,t]]);     // Q—Learning:Reward_B
      } // end of Choice
    } // end of t
  } // end of s  
} // end of model 

generated quantities {
  // parameters
  real<lower=0,upper=1> alphaRA_mu;
  real<lower=0,upper=1> alphaRB_mu;
  real<lower=0,upper=1> w_mu;
  real<lower=0,upper=10> tau_mu;
  
  real log_lik[nSubjects];                                            //choice log likelihood
  int y_pred[nSubjects, nTrials] = rep_array(99, nSubjects, nTrials); //choice prediction

  // convert to the actual parameter space
  alphaRA_mu  = Phi_approx(alphaRA_mu_raw);
  alphaRB_mu  = Phi_approx(alphaRB_mu_raw);
  w_mu  = Phi_approx(w_mu_raw);
  tau_mu  = Phi_approx(tau_mu_raw)*10;

  // model
  {
    for (s in 1:nSubjects){
      vector[2] RA = RAi;
      vector[2] RB = RBi;
      vector[2] EV = EVi;
      log_lik[s]=0;
      
      for (t in 1:nTrials)  {
        if (Choice[s,t] != 99) {
          log_lik[s] += categorical_logit_lpmf(Choice[s,t] | tau[s] * EV);       //choice log likelihood
          y_pred[s,t] = categorical_logit_rng(tau[s] * EV);                      //choice prediction
          
          EV[Choice[s,t]] = w[s] * RA[Choice[s,t]] + (1-w[s]) * RB[Choice[s,t]]; // calculate Expected Value
        
          RA[Choice[s,t]] += alphaRA[s] * (Reward_A[s,t] - RA[Choice[s,t]]);     // Q—Learning:Reward_A
          RB[Choice[s,t]] += alphaRB[s] * (Reward_B[s,t] - RB[Choice[s,t]]);     // Q—Learning:Reward_B
      } // end of Choice
    } // end of t
  } // end of s 
 } // end of model

} // end of generated quantities


Are you intentionally biasing the posteriors of your standard deviations toward zero? If not, you may want to use lognormal or gamma as your prior on your strictly positive parameters.

A cauchy centered over zero will have a central tendency of approximately zero. Since you are seemingly using uninformative priors, you may want the standard deviations of your normals to be centered over one with a standard deviation distribution that has maximum entropy over the supported domain. That’s lognormal(0,1).

1 Like

Hi,@Corey.Plate
My apologies for the delayed response~ Thank you very much for your answer!

As you mentioned, I want to fit the model using uninformative priors. For parameters that must be positive, I’m using ‘<lower=0>’ to constrain them,just like:

I’m not quite sure how to use prior distributions for constraints.

According to your advice, I set a lognormal(0,1) prior distribution for the standard deviation of each parameter,just like:

// group level parameter
real tau_mu_raw;
real<lower=0> tau_sd_raw; 

// individual level parameter
vector[nSubjects] tau_raw; 

// transformed parameter
vector<lower=0,upper=10>[nSubjects] tau;
tau[s]  = Phi_approx( tau_mu_raw  + tau_sd_raw * tau_raw[s] )*10;

// prior
tau_mu_raw ~ std_normal();
tau_sd_raw ~ lognormal(0,1);
tau_raw  ~ std_normal();

but it seems that the program is not responding. Could it be due to my settings? I’m a bit confused. If you could provide the code for the prior settings section, I would greatly appreciate it.

Thank you once again for your assistance!

Are you receiving any error messages from the compiler or any warning messages from the sampler?

1 Like

Hi,@Corey.Plate
I apologize for my mistake,
Due to the lack of constraints on the parameter sampling range, I received the sampling error:

Chain 1: Error evaluating the log probability at the initial value.
Chain 1: Exception: lognormal_lpdf: Random variable is -0.221171, but must be nonnegative!

Subsequently, I tried to relaunch the program, rectified the errors, and gave it another run, and this time, it executed without any issues, resulting in successful sampling!

Unfortunately, tau still had a minimal magnitude, with an average of 0.04. While I suspected this might be related to the data, where participants’ choices might tend to be random, the value was so small that it seemed quite unusual.

I’m beginning to suspect it’s a prior-related issue. Based on previous literature( * 10.1016/j.neuron.2011.02.027), I’ve tried the following settings:

tau ~ gamma(1.2, 5)
alpha ~ beta(1.1, 1.1)
w ~ beta(1.1, 1.1)

Because of the necessity to satisfy sampling distribution requirements,
I’ve added range constraints at both the group level and the individual subject level:

// group level parameters
real<lower=0,upper=1> w_mu_raw;
real<lower=0> tau_mu_raw;

// individual level parameters
vector[nSubjects] <lower=0,upper=1> w_raw;
vector[nSubjects] <lower=0> tau_raw;

// priors
w_mu_raw ~ beta(1.1,1.1);
tau_mu_raw ~ gamma(1.2,5);

w_sd_raw ~ lognormal(0,1);
tau_sd_raw ~ lognormal(0,1);

w_raw ~ beta(1.1,1.1);
tau_raw ~ gamma(1.2,5);

the constraints on boundaries resulted in a less optimal fit,but this was not the main concern.The primary concern was the strange phenomenon in the results: all values were concentrated around the central range:

tau ≈ 5
alpha ≈ 0.5
w ≈ 0.5

Even though my current attempts haven’t been very successful, based on your suggestions, I believe I can try to adjust the prior distributions to achieve a better outcome.It would be greatly appreciated if you have any suggestions for the prior settings.

Once again, thank you for your assistance!

How many subjects are you fitting, and how many observations for each? If you have a lot of subjects with relatively small and noisy data sets per subject, you may want to consider using a non-centered parameterization for your priors so that you can take advantage of both the within-subject and between-subjects variance in the data set. That entails making a guess at the hyperprior over the parameters of the priors and letting those parameters vary for your prior distributions. For example:

model{
//hyperpriors
x_mu ~ normal(0,1);
x_sigma ~ lognormal(0,1);
y_mu ~ normal(0,1);
y_sigma ~ lognormal(0,1);

//priors
x ~ normal(x_mu,x_sigma);
y ~ lognormal(y_mu,y_sigma);

//likelihood
z ~ function(x,y);
}

(These are dummy variables and distributions for demonstration purposes)

1 Like

Thank you!@Corey.Plate

The experiment used a within-subject design with 40 participants and 96 trials.
I agree that the non-central parameterization is promising, as I’ve been gradually approaching the ‘ideal’ parameters through iterative adjustments of the priors.

I apologize for not responding promptly.
Thank you very much for your patience and help!