Reinforcement learning model

janeg · June 21, 2017, 3:45am

Hi Stan

Based on the excellent hBayesDM package, I’m trying to set
up a Rescorla-Wagner learning model together with a linear
observation model, i.e., I’m trying to assess how skin conductance
data recorded during a Pavlovian fear-conditioning task is modulated
by punishment expectancy. There is one free parameter in my
model, the learning rate,which is in many cases estimated with MLE
and a nonlinear optimizer such as fmincon in Matlab. I’d like to
switch to hierarchical Bayes because the MLE estimates I
get appear to be too noisy.

I started with simulated data, but for some reason the 4 chains of
my stan model won’t mix. I attached the traceplots. I was
wondering if there is an obvious mistake in my model.

Thank you.

data {
  int<lower=1> N;
  int<lower=1> T;               
  int<lower=1, upper=T> Tsubj[N];                 
  real<lower=0> scr[T, N];     
  int<lower=0, upper=1> lambda[T, N];   
  int<lower=1, upper=2> stim[T, N];
}

transformed data {
  vector<lower=0, upper=1>[2] initV;  // initial value for EV
  initV = rep_vector(0.5, 2);
}

parameters {
// Declare all parameters as vectors for vectorizing
  // Hyper(group)-parameters  
  vector[4] mu_p;           
  // increase from 3 --> 4 to add "sigma_er mean"
  vector<lower=0>[4] sigma;
    
  // Subject-level raw parameters (for Matt trick)
  vector[N] A_pr;    // learning rate
  vector[N] alpha_pr;
  vector[N] beta_pr;

  // sigma_er will be the sd of the normal distribution. It is like an
  // inverse temperature --> higher = actual y values are "consistent"
  // with the y-hat values
  vector[N] sigma_er_pr;  
}

transformed parameters {
  // subject-level parameters
  vector<lower=0,upper=1>[N] A;
  vector[N] alpha;
  vector[N] beta;
  vector<lower=0>[N] sigma_er;
  
  for (i in 1:N) {
    A[i] = Phi_approx( mu_p[1]  + sigma[1]  * A_pr[i] );
  }
  // Using non-centered parameterization
  alpha = mu_p[2] + sigma[2] * alpha_pr;  
  beta  = mu_p[3] + sigma[3] * beta_pr;  
  sigma_er = exp(mu_p[4] + sigma[4] * sigma_er_pr);
}

model {
  // Hyperparameters
  mu_p  ~ normal(0, 1); 
  sigma ~ cauchy(0, 5);  
  
  // individual parameters
  A_pr     ~ normal(0, 1);
  alpha_pr ~ normal(0, 1);
  beta_pr  ~ normal(0, 1);
  sigma_er_pr ~ normal(0, 1);

  // subject loop and trial loop
  for (i in 1:N) {
    vector[2] ev; // expected value
    real PE;      // prediction error

    ev = initV;

    for (t in 1:(Tsubj[i])) {        
      scr[t, i] ~ normal(alpha[i] + beta[i] * ev[stim[t,i]], sigma_er[i]);

      // prediction error 
      PE = lambda[t, i] - ev[stim[t,i]];

      // value updating (learning) 
      ev[stim[t, i]] = ev[stim[t, i]] + A[i] * PE; 
    }
  }
}

generated quantities {
  // For group level parameters
  real<lower=0, upper=1> mu_A;
  real mu_alpha;
  real mu_beta;
  real<lower=0> mu_sigma_er;
  
  // For log likelihood calculation
  real log_lik[N]; 

  mu_A = Phi_approx(mu_p[1]);
  mu_alpha = mu_p[2];
  mu_beta  = mu_p[3];
  mu_sigma_er = exp(mu_p[4]);

  { // local section, this saves time and space
    for (i in 1:N) {
      vector[2] ev; // expected value
      real PE;      // prediction error

      // Initialize values
      ev = initV;
      
      log_lik[i] = 0;
      
      for (t in 1:(Tsubj[i])) {
        // compute action probabilities
        log_lik[i] = log_lik[i] + normal_lpdf(scr[t, i] | alpha[i] + beta[i] * ev[stim[t, i]], sigma_er[i]);
        
        // prediction error 
        PE = lambda[t, i] - ev[stim[t, i]];

        // value updating (learning) 
        ev[stim[t, i]] = ev[stim[t, i]] + A[i] * PE; 
      }
    }   
  }
}

traceplot.pdf (122.4 KB)

bbbales2 · June 21, 2017, 4:45am

It might seem weird, but remember to add noise to your simulated data. I’ve seen this issue if you fit a model to data that’s too perfect. That’s just a wild guess though. It could definitely be something else.

msspektor · June 21, 2017, 1:58pm

At first glance, I cannot find any obvious mistake in the code. Currently, you have the full package (i.e., hierarchical model, non-centered parameterization). Have you tried simplifying the model by starting with a simple individual-level model and adding the other parts sequentially? This should substantially narrow down the problem.

janeg · June 22, 2017, 3:35am

Thanks a lot, the reminder of adding noise turned out to be very helpful,
the model does converge now.

However, I still cannot recover the implanted learning rate (alpha), and
there is hardly any variance between indvidual learning rates (see
attached graph).

This snippet shows how I generate the simulated data in R:

for (i in 1:nsubs) {
  vm    <- 0.5
  vp    <- 0.5
  for (t in 1:ntrials) {
    if (stim[t] == 1) {
      v[t, i] <- vm
      vm <- vm + alpha[i] * (lambda[t] - vm)
    } else {
      v[t, i] <- vp
      vp <- vp + alpha[i] * (lambda[t] - vp)       
    } 
    mu <- beta0[i] + beta1[i] * v[t, i]
    Y[t, i] <- rnorm(1, mean=mu, sd=sigma[i])
  }
}

Thanks for your help!
bayessim.pdf (47.7 KB)

bbbales2 · June 22, 2017, 3:10pm

Hmm, wild guess number 2 would be that the variation you expect to see in alpha is getting absorbed by another parameter?

Try making pairplots of alpha, beta, sigma_er, and A for a few individuals together. Are there any big correlations? Maybe there’s an unidentifiability?

I assume that the big grid of plots in the attached PDF are the generated quantities EVs. Does your estimated noise parameter match the one you used to generate the data? If it’s larger, maybe your model has decided that the extra variation due to the per-person alphas is actually measurement noise.

janeg · June 24, 2017, 12:12am

Thanks so much, reducing sigma in the simulated data
did the trick; I can now recover the parameters nicely.

One more question though: as I am planning model
comparisons I also computed the log likelihood in stan
and used it for the calculation of LOOIC. Using loo, I
get the warning "Warning: Some Pareto k diagnostic values are too high."
Pareto k diagnostic values reveal that more than 60% of the
values are > 1.

Does this reflect overly influental observations in my
data or could it also point to a misspecified model? That
would be strange because I can recover the parameters,
the model converges and the Rhats look fine.

Thank you!

Also, this is how I compute the likelihood in my model:

model {
  // Hyperparameters
  mu_p  ~ normal(0, 1); 
  sigma ~ cauchy(0, 5);  
  
  // individual parameters
  A_pr     ~ normal(0, 1);
  alpha_pr ~ normal(0, 1);
  beta_pr  ~ normal(0, 1);
  sigma_er_pr ~ normal(0, 1);

  // subject loop and trial loop
  for (i in 1:N) {
    vector[2] ev; // expected value
    real PE;      // prediction error

    ev = initV;

    for (t in 1:(Tsubj[i])) {        
      scr[t, i] ~ normal(alpha[i] + beta[i] * ev[stim[t,i]], sigma_er[i]);

      // prediction error 
      PE = lambda[t, i] - ev[stim[t,i]];

      // value updating (learning) 
      ev[stim[t, i]] = ev[stim[t, i]] + A[i] * PE; 
    }
  }
}

generated quantities {
  // For group level parameters
  real<lower=0, upper=1> mu_A;
  real mu_alpha;
  real mu_beta;
  real<lower=0> mu_sigma_er;
  
  // For log likelihood calculation
  real log_lik[N]; 

  mu_A = Phi_approx(mu_p[1]);
  mu_alpha = mu_p[2];
  mu_beta  = mu_p[3];
  mu_sigma_er = exp(mu_p[4]);

  { // local section, this saves time and space
    for (i in 1:N) {
      vector[2] ev; // expected value
      real PE;      // prediction error

      // Initialize values
      ev = initV;
      
      log_lik[i] = 0;
      
      for (t in 1:(Tsubj[i])) {
        // compute action probabilities
        log_lik[i] = log_lik[i] + normal_lpdf(scr[t, i] | alpha[i] + beta[i] * ev[stim[t, i]], sigma_er[i]);
        
        // prediction error 
        PE = lambda[t, i] - ev[stim[t, i]];

        // value updating (learning) 
        ev[stim[t, i]] = ev[stim[t, i]] + A[i] * PE; 
      }
    }   
  }
}

bbbales2 · June 24, 2017, 12:54am

Good to hear! LOO is outta my realm though, haha. When you say you’re using loo, you mean https://cran.r-project.org/web/packages/loo/ right?

@jonah and @avehtari know about this stuff though. If they don’t respond after a while (they might have missed this), just remake a new thread with the LOO question.

Good luck!

avehtari · July 3, 2017, 11:41am

Although sometimes high khat-values indicate outliers or model misspecification, you can get high khat-values even if using the true model (with simulated data), if your model is flexible and observations are locally influential. If the model is true (or not true but close, ie, well specified) the observations are not “overly” influential, but still the posterior and loo-posterior can be too different for the importance sampling. This is really common, for example, with flexible GP and GMRF models having n observations and n latent values.

Aki

mattwilliamson13 · September 5, 2017, 10:53pm

Although sometimes high khat-values indicate outliers or model misspecification, you can get high khat-values even if using the true model (with simulated data), if your model is flexible and observations are locally influential. If the model is true (or not true but close, ie, well specified) the observations are not “overly” influential, but still the posterior and loo-posterior can be too different for the importance sampling.

What are the implications of this for interpreting loo output? Does this mean the K-fold cross-validation is the more appropriate choice or can the warning be ignored in certain instances (i.e. when k >0.7 for ~2.5% of the observations)?

avehtari · September 6, 2017, 4:52pm

In general you shouldn’t ignore warnings. Especially any k>1 means the error can be arbitrary large even with the number of draws going to infinity. khat is estimate of k, and thus fit with finite number of draws it’s better to not trust khat>0.7.
Instead of k-fold, you could also use PSIS-LOO+, which does MCMC just for the cases with khat>0.7. There are some special cases where in case of a small portion of khats>0.7 there is no need to recompute. For example if you compare linear model with Gaussian observation model to linear model with Student’s t observation model, and you get high khats only for the Gaussian model and you get much higher elpd for Student’s t model, then you it’s very likely that you should use Student’s t model, and you don’t need to get more accurate estimate of elpd difference.

mattwilliamson13 · September 6, 2017, 5:12pm

Thank you Aki - Definitely agree with

it just wasn’t clear to me how to deal with the existence of those warnings. I see that PSIS-LOO+ is implemented in rstanarm - is there a way to emulate that implementation using just Rstan? I can’t tell whether the bulk of the work being done by PSIS-LOO+ is happening in R or via the Stan call…
Thanks again for your response,
M

avehtari · September 6, 2017, 7:41pm

For RStan you can look at the Stan K-fold-CV code in the appendix of Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC | SpringerLink (preprint in [1507.04544] Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC). The same code can be used to compute LOO or just some elpd_i’s to be used in PSIS-LOO+. In addition you need some R code, but I don’t where you could find a complete example for PSIS-LOO+ for RStan although that would be very useful to have.

jflournoy · June 7, 2018, 3:52pm

I hope that it is not in bad form to resurrect old threads like this. Also, I think it would be impossible to fully express my extreme gratitude to all the experts who take time to respond to these posts.

I am having a similar issue with a model also based on the hBayesDM go-no-go reinforcement learning models. In my case, I modify it to estimate population learning parameters for each of several conditions.

When using the loo package on the log-likelihood (summed across all observations within each of 313 participants), I’m also obtaining mostly high k (>.7), as well as a lot of warnings during computation that In log(z) : NaNs produced.

The references you provide about k-fold cross-validation are really clear and helpful. However, the models I would like to compare each take about 3-5 days to run (and I would like to compare 4 different models). So I guess my question is whether anyone would recommend any other approaches?

Another more conceptual question is whether the unit of cross-validation is correct for these models. I think what this CV procedure is testing is how well the model, as fit to N-1 observations will predict the Nth case, but for that Nth case we can only use the population-level parameters, even though we assume a lot of heterogeneity in the individual-level parameters (which is one of the goals of fitting these models). It seems like the more appropriate CV question here is whether the estimated parameters for an individual allow good prediction of new data from that individual. I’m don’t fully understand how I would go about implementing this kind of CV in this framework without ending up with N LOOIC estimates, rather than a model-level LOOIC estimate (without averaging all those LOOICs, anyway).

Thanks for your help,
~j

Bob_Carpenter · June 7, 2018, 11:23pm

No worriesa bout resurrecting old threads. Some of us still have alerts on them. The main reason not to is that the subjects aren’t to the point of what you’re asking, so the relevant people might not spot them.

@jonah or @avehtari can probably answer about the loo package details and also make better general recommendations than me.

The second question you raise is a common one in hierarchical modeling. We can evaluate at two levels—we can predict new observations for the groups we have, or we can predict observations for new groups. How you define the folds will matter. You can hold out a whole group and predict it based on the others, or you can hold out an observation within a group and predict that.

avehtari · June 8, 2018, 6:29pm

We probably have something to recommend, but we need to run a few more tests before we can be certain that it works as intended, but likely we have something before the end of summer. Email me and I’ll contact you when we have something for you to test,

Sine LOO removes one observation and if that observation is one observation from many observations per individual, then LOO is what you want. Leave-one-individual-out is useful if you want to know the performance for a new individual.

jflournoy · June 11, 2018, 8:53pm

Thanks for the reply. I’ll email you, @avehtari. I also want to follow up to make sure I understand how the way I save the log_lik impacts my interpretation of the estimates from loo.

In both @janeg’s and my generated quantities section, there is a line that is roughly:

log_lik[i] = log_lik[i] + normal_lpdf(scr[t, i] | alpha[i] + beta[i] * ev[stim[t, i]], sigma_er[i]);

Where i is the group index (i.e., an individual participant in our case) and t is the observation index (with Tsubj[i] observations for the ith individual). This means, I think, that the matrix of samples x log likelihoods we’re passing to loo::loo are the log likelihood samples per-group, summed across all observations in that group. This also leads me to believe that we’re really computing the estimate for a leave-one-individual-out predictive fit by virtue of how we’re saving the log_lik vector. If that’s right, and I want a leave-one-observation-out predictive fit estimate, I should be saving the log posterior density for each observation instead of their within-group sums.

avehtari · June 12, 2018, 7:42pm

Yes (this was easy to answer, thanks to your clear explanation and question).

jason.webb · May 16, 2019, 3:47pm

Hello Stan,

First, thanks to the original poster and to everyone else for their advice. I am currently attempting to fit learning models to skin conductance data collected during Fear conditioning and have found this thread incredibly helpful. I am new to STAN, so my apologies if the answers to my questions are obvious.

I have successfully run the code posted above in RStan, and my question concerns how to modify the code to implement a hybrid Rescorla-Wagner/Pearce-Hall model, wherein the learning rate (A) varies trial by trial as a function of (ETA).

When I run the modified code- it generates >1000 lines of output in the Console window before running the model (see attached for Console output. Unfortunately, I cannot recover the first 135 lines). This did not happen when I ran the original code and I was wondering what this output signifies and whether anyone was aware of potential solutions to resolve this issue?

Additionally, I was wondering whether I should optimize both (A) and (ETA)? Or whether I should only optimize (ETA), as (A) depends on the (ETA) parameter?

Any advice or guidance would be appreciated. Thanks in advance.

data {
int<lower=1> N;
int<lower=1> T;
int<lower=1, upper=T> Tsubj[N];
real<lower=0> scr[T, N];
int<lower=0, upper=1> lambda[T, N];
int<lower=1, upper=2> stim[T, N];
}

transformed data {
vector<lower=0, upper=1>[2] initV; // initial value for EV
real<lower=0> A;
initV = rep_vector(0.5, 2);
A = 0.5;
}

parameters {
// Declare all parameters as vectors for vectorizing
// Hyper(group)-parameters
vector[4] mu_p;
// increase from 3 --> 4 to add “sigma_er mean”
vector<lower=0>[4] sigma;

// Subject-level raw parameters (for Matt trick)
vector[N] ETA_pr; // learning rate
vector[N] alpha_pr;
vector[N] beta_pr;

// sigma_er will be the sd of the normal distribution. It is like an
// inverse temperature --> higher = actual y values are “consistent”
// with the y-hat values
vector[N] sigma_er_pr;
}

transformed parameters {
// subject-level parameters
vector<lower=0,upper=1>[N] ETA;
vector[N] alpha;
vector[N] beta;
vector<lower=0>[N] sigma_er;

for (i in 1:N) {
ETA[i] = Phi_approx( mu_p[1] + sigma[1] * ETA_pr[i] );
}
// Using non-centered parameterization
alpha = mu_p[2] + sigma[2] * alpha_pr;
beta = mu_p[3] + sigma[3] * beta_pr;
sigma_er = exp(mu_p[4] + sigma[4] * sigma_er_pr);
}

model {
// Hyperparameters
mu_p ~ normal(0, 1);
sigma ~ cauchy(0, 5);

// individual parameters
ETA_pr ~ normal(0, 1);
alpha_pr ~ normal(0, 1);
beta_pr ~ normal(0, 1);
sigma_er_pr ~ normal(0, 1);

// subject loop and trial loop
for (i in 1:N) {
vector[2] ev; // expected value
real PE; // prediction error
real At;
ev = initV;
At = A;

for (t in 1:(Tsubj[i])) {        
  scr[t, i] ~ normal(alpha[i] + beta[i] * ev[stim[t,i]], sigma_er[i]);

  // prediction error 
  PE = lambda[t, i] - ev[stim[t,i]];

  // value updating (learning) 
  ev[stim[t, i]] = ev[stim[t, i]] + At * PE; 

  // alpha updating
  At = ETA[i]; 
}

}
}

generated quantities {
// For group level parameters
real<lower=0, upper=1> mu_ETA;
real mu_alpha;
real mu_beta;
real<lower=0> mu_sigma_er;

// For log likelihood calculation
real log_lik[N];

mu_ETA = Phi_approx(mu_p[1]);
mu_alpha = mu_p[2];
mu_beta = mu_p[3];
mu_sigma_er = exp(mu_p[4]);

{ // local section, this saves time and space
for (i in 1:N) {
vector[2] ev; // expected value
real PE; // prediction error
real At;
// Initialize values
ev = initV;
At = A;

  log_lik[i] = 0;
  
  for (t in 1:(Tsubj[i])) {
    // compute action probabilities
    log_lik[i] = log_lik[i] + normal_lpdf(scr[t, i] | alpha[i] + beta[i] * ev[stim[t, i]], sigma_er[i]);
    
    // prediction error 
    PE = lambda[t, i] - ev[stim[t, i]];

    // value updating (learning) 
    ev[stim[t, i]] = ev[stim[t, i]] + At * PE; 

    // alpha updating
    At = ETA[i]; 
  }
}

}
}

ConsoleOutput.txt (54.2 KB)

bbbales2 · May 18, 2019, 7:55pm

Huh, that looks like output from the C++ compiler. I’m not sure what it is. If the model runs and behaves as you expected, I wouldn’t worry too much about it. It’s probably just a warning if everything successfully compiled.

With regards to the second question:

I think the answer is, if you can estimate them both, do that.

It’s entirely possible with your data you can only identify one of them, or the ratio, or something like that.

The simplest way to test this is generate fake data where you know the values and then try to fit them. If you’re able to recover both things, good stuff. If not, you’ll probly have to add prior information, add some more model structure, or assume one of them is fixed or something.

Topic		Replies	Views
How can I solve the problem of high Pareto-k values? Modeling	3	1107	June 26, 2023
Two-armed bandit hierarchical reinforcement learning model - interpreting conflicting loo and posterior predictive check results Modeling specification , loo , posterior-predictive , hierarchical-model , reinforcement-learning	7	653	January 17, 2024
LOO Model Comparison Alternative Modeling rstan , techniques , loo , cmdstanr	3	94	March 27, 2025
High Pareto-k values for an IRT model with simulated datasets Modeling loo	5	63	April 2, 2025
A quick note what I infer from p_loo and Pareto k values Modeling loo	35	15884	August 21, 2022

Reinforcement learning model

Related topics