Reinforcement learning model

loo

#1

Hi Stan

Based on the excellent hBayesDM package, I’m trying to set
up a Rescorla-Wagner learning model together with a linear
observation model, i.e., I’m trying to assess how skin conductance
data recorded during a Pavlovian fear-conditioning task is modulated
by punishment expectancy. There is one free parameter in my
model, the learning rate,which is in many cases estimated with MLE
and a nonlinear optimizer such as fmincon in Matlab. I’d like to
switch to hierarchical Bayes because the MLE estimates I
get appear to be too noisy.

I started with simulated data, but for some reason the 4 chains of
my stan model won’t mix. I attached the traceplots. I was
wondering if there is an obvious mistake in my model.

Thank you.

data {
  int<lower=1> N;
  int<lower=1> T;               
  int<lower=1, upper=T> Tsubj[N];                 
  real<lower=0> scr[T, N];     
  int<lower=0, upper=1> lambda[T, N];   
  int<lower=1, upper=2> stim[T, N];
}

transformed data {
  vector<lower=0, upper=1>[2] initV;  // initial value for EV
  initV = rep_vector(0.5, 2);
}

parameters {
// Declare all parameters as vectors for vectorizing
  // Hyper(group)-parameters  
  vector[4] mu_p;           
  // increase from 3 --> 4 to add "sigma_er mean"
  vector<lower=0>[4] sigma;
    
  // Subject-level raw parameters (for Matt trick)
  vector[N] A_pr;    // learning rate
  vector[N] alpha_pr;
  vector[N] beta_pr;

  // sigma_er will be the sd of the normal distribution. It is like an
  // inverse temperature --> higher = actual y values are "consistent"
  // with the y-hat values
  vector[N] sigma_er_pr;  
}

transformed parameters {
  // subject-level parameters
  vector<lower=0,upper=1>[N] A;
  vector[N] alpha;
  vector[N] beta;
  vector<lower=0>[N] sigma_er;
  
  for (i in 1:N) {
    A[i] = Phi_approx( mu_p[1]  + sigma[1]  * A_pr[i] );
  }
  // Using non-centered parameterization
  alpha = mu_p[2] + sigma[2] * alpha_pr;  
  beta  = mu_p[3] + sigma[3] * beta_pr;  
  sigma_er = exp(mu_p[4] + sigma[4] * sigma_er_pr);
}

model {
  // Hyperparameters
  mu_p  ~ normal(0, 1); 
  sigma ~ cauchy(0, 5);  
  
  // individual parameters
  A_pr     ~ normal(0, 1);
  alpha_pr ~ normal(0, 1);
  beta_pr  ~ normal(0, 1);
  sigma_er_pr ~ normal(0, 1);

  // subject loop and trial loop
  for (i in 1:N) {
    vector[2] ev; // expected value
    real PE;      // prediction error

    ev = initV;

    for (t in 1:(Tsubj[i])) {        
      scr[t, i] ~ normal(alpha[i] + beta[i] * ev[stim[t,i]], sigma_er[i]);

      // prediction error 
      PE = lambda[t, i] - ev[stim[t,i]];

      // value updating (learning) 
      ev[stim[t, i]] = ev[stim[t, i]] + A[i] * PE; 
    }
  }
}

generated quantities {
  // For group level parameters
  real<lower=0, upper=1> mu_A;
  real mu_alpha;
  real mu_beta;
  real<lower=0> mu_sigma_er;
  
  // For log likelihood calculation
  real log_lik[N]; 

  mu_A = Phi_approx(mu_p[1]);
  mu_alpha = mu_p[2];
  mu_beta  = mu_p[3];
  mu_sigma_er = exp(mu_p[4]);

  { // local section, this saves time and space
    for (i in 1:N) {
      vector[2] ev; // expected value
      real PE;      // prediction error

      // Initialize values
      ev = initV;
      
      log_lik[i] = 0;
      
      for (t in 1:(Tsubj[i])) {
        // compute action probabilities
        log_lik[i] = log_lik[i] + normal_lpdf(scr[t, i] | alpha[i] + beta[i] * ev[stim[t, i]], sigma_er[i]);
        
        // prediction error 
        PE = lambda[t, i] - ev[stim[t, i]];

        // value updating (learning) 
        ev[stim[t, i]] = ev[stim[t, i]] + A[i] * PE; 
      }
    }   
  }
}

traceplot.pdf (122.4 KB)


#2

It might seem weird, but remember to add noise to your simulated data. I’ve seen this issue if you fit a model to data that’s too perfect. That’s just a wild guess though. It could definitely be something else.


#3

At first glance, I cannot find any obvious mistake in the code. Currently, you have the full package (i.e., hierarchical model, non-centered parameterization). Have you tried simplifying the model by starting with a simple individual-level model and adding the other parts sequentially? This should substantially narrow down the problem.


#4

Thanks a lot, the reminder of adding noise turned out to be very helpful,
the model does converge now.

However, I still cannot recover the implanted learning rate (alpha), and
there is hardly any variance between indvidual learning rates (see
attached graph).

This snippet shows how I generate the simulated data in R:

for (i in 1:nsubs) {
  vm    <- 0.5
  vp    <- 0.5
  for (t in 1:ntrials) {
    if (stim[t] == 1) {
      v[t, i] <- vm
      vm <- vm + alpha[i] * (lambda[t] - vm)
    } else {
      v[t, i] <- vp
      vp <- vp + alpha[i] * (lambda[t] - vp)       
    } 
    mu <- beta0[i] + beta1[i] * v[t, i]
    Y[t, i] <- rnorm(1, mean=mu, sd=sigma[i])
  }
}

Thanks for your help!
bayessim.pdf (47.7 KB)


#5

Hmm, wild guess number 2 would be that the variation you expect to see in alpha is getting absorbed by another parameter?

Try making pairplots of alpha, beta, sigma_er, and A for a few individuals together. Are there any big correlations? Maybe there’s an unidentifiability?

I assume that the big grid of plots in the attached PDF are the generated quantities EVs. Does your estimated noise parameter match the one you used to generate the data? If it’s larger, maybe your model has decided that the extra variation due to the per-person alphas is actually measurement noise.


#6

Thanks so much, reducing sigma in the simulated data
did the trick; I can now recover the parameters nicely.

One more question though: as I am planning model
comparisons I also computed the log likelihood in stan
and used it for the calculation of LOOIC. Using loo, I
get the warning "Warning: Some Pareto k diagnostic values are too high."
Pareto k diagnostic values reveal that more than 60% of the
values are > 1.

Does this reflect overly influental observations in my
data or could it also point to a misspecified model? That
would be strange because I can recover the parameters,
the model converges and the Rhats look fine.

Thank you!

Also, this is how I compute the likelihood in my model:

model {
  // Hyperparameters
  mu_p  ~ normal(0, 1); 
  sigma ~ cauchy(0, 5);  
  
  // individual parameters
  A_pr     ~ normal(0, 1);
  alpha_pr ~ normal(0, 1);
  beta_pr  ~ normal(0, 1);
  sigma_er_pr ~ normal(0, 1);

  // subject loop and trial loop
  for (i in 1:N) {
    vector[2] ev; // expected value
    real PE;      // prediction error

    ev = initV;

    for (t in 1:(Tsubj[i])) {        
      scr[t, i] ~ normal(alpha[i] + beta[i] * ev[stim[t,i]], sigma_er[i]);

      // prediction error 
      PE = lambda[t, i] - ev[stim[t,i]];

      // value updating (learning) 
      ev[stim[t, i]] = ev[stim[t, i]] + A[i] * PE; 
    }
  }
}

generated quantities {
  // For group level parameters
  real<lower=0, upper=1> mu_A;
  real mu_alpha;
  real mu_beta;
  real<lower=0> mu_sigma_er;
  
  // For log likelihood calculation
  real log_lik[N]; 

  mu_A = Phi_approx(mu_p[1]);
  mu_alpha = mu_p[2];
  mu_beta  = mu_p[3];
  mu_sigma_er = exp(mu_p[4]);

  { // local section, this saves time and space
    for (i in 1:N) {
      vector[2] ev; // expected value
      real PE;      // prediction error

      // Initialize values
      ev = initV;
      
      log_lik[i] = 0;
      
      for (t in 1:(Tsubj[i])) {
        // compute action probabilities
        log_lik[i] = log_lik[i] + normal_lpdf(scr[t, i] | alpha[i] + beta[i] * ev[stim[t, i]], sigma_er[i]);
        
        // prediction error 
        PE = lambda[t, i] - ev[stim[t, i]];

        // value updating (learning) 
        ev[stim[t, i]] = ev[stim[t, i]] + A[i] * PE; 
      }
    }   
  }
}

#7

Good to hear! LOO is outta my realm though, haha. When you say you’re using loo, you mean https://cran.r-project.org/web/packages/loo/ right?

@jonah and @avehtari know about this stuff though. If they don’t respond after a while (they might have missed this), just remake a new thread with the LOO question.

Good luck!


#8

Although sometimes high khat-values indicate outliers or model misspecification, you can get high khat-values even if using the true model (with simulated data), if your model is flexible and observations are locally influential. If the model is true (or not true but close, ie, well specified) the observations are not “overly” influential, but still the posterior and loo-posterior can be too different for the importance sampling. This is really common, for example, with flexible GP and GMRF models having n observations and n latent values.

Aki


#9

Although sometimes high khat-values indicate outliers or model misspecification, you can get high khat-values even if using the true model (with simulated data), if your model is flexible and observations are locally influential. If the model is true (or not true but close, ie, well specified) the observations are not “overly” influential, but still the posterior and loo-posterior can be too different for the importance sampling.

What are the implications of this for interpreting loo output? Does this mean the K-fold cross-validation is the more appropriate choice or can the warning be ignored in certain instances (i.e. when k >0.7 for ~2.5% of the observations)?


#10

In general you shouldn’t ignore warnings. Especially any k>1 means the error can be arbitrary large even with the number of draws going to infinity. khat is estimate of k, and thus fit with finite number of draws it’s better to not trust khat>0.7.
Instead of k-fold, you could also use PSIS-LOO+, which does MCMC just for the cases with khat>0.7. There are some special cases where in case of a small portion of khats>0.7 there is no need to recompute. For example if you compare linear model with Gaussian observation model to linear model with Student’s t observation model, and you get high khats only for the Gaussian model and you get much higher elpd for Student’s t model, then you it’s very likely that you should use Student’s t model, and you don’t need to get more accurate estimate of elpd difference.


#11

Thank you Aki - Definitely agree with

it just wasn’t clear to me how to deal with the existence of those warnings. I see that PSIS-LOO+ is implemented in rstanarm - is there a way to emulate that implementation using just Rstan? I can’t tell whether the bulk of the work being done by PSIS-LOO+ is happening in R or via the Stan call…
Thanks again for your response,
M


#12

For RStan you can look at the Stan K-fold-CV code in the appendix of https://link.springer.com/article/10.1007%2Fs11222-016-9696-4 (preprint in https://arxiv.org/abs/1507.04544). The same code can be used to compute LOO or just some elpd_i’s to be used in PSIS-LOO+. In addition you need some R code, but I don’t where you could find a complete example for PSIS-LOO+ for RStan although that would be very useful to have.


#13

I hope that it is not in bad form to resurrect old threads like this. Also, I think it would be impossible to fully express my extreme gratitude to all the experts who take time to respond to these posts.

I am having a similar issue with a model also based on the hBayesDM go-no-go reinforcement learning models. In my case, I modify it to estimate population learning parameters for each of several conditions.

When using the loo package on the log-likelihood (summed across all observations within each of 313 participants), I’m also obtaining mostly high k (>.7), as well as a lot of warnings during computation that In log(z) : NaNs produced.

The references you provide about k-fold cross-validation are really clear and helpful. However, the models I would like to compare each take about 3-5 days to run (and I would like to compare 4 different models). So I guess my question is whether anyone would recommend any other approaches?

Another more conceptual question is whether the unit of cross-validation is correct for these models. I think what this CV procedure is testing is how well the model, as fit to N-1 observations will predict the Nth case, but for that Nth case we can only use the population-level parameters, even though we assume a lot of heterogeneity in the individual-level parameters (which is one of the goals of fitting these models). It seems like the more appropriate CV question here is whether the estimated parameters for an individual allow good prediction of new data from that individual. I’m don’t fully understand how I would go about implementing this kind of CV in this framework without ending up with N LOOIC estimates, rather than a model-level LOOIC estimate (without averaging all those LOOICs, anyway).

Thanks for your help,
~j


#14

No worriesa bout resurrecting old threads. Some of us still have alerts on them. The main reason not to is that the subjects aren’t to the point of what you’re asking, so the relevant people might not spot them.

@jonah or @avehtari can probably answer about the loo package details and also make better general recommendations than me.

The second question you raise is a common one in hierarchical modeling. We can evaluate at two levels—we can predict new observations for the groups we have, or we can predict observations for new groups. How you define the folds will matter. You can hold out a whole group and predict it based on the others, or you can hold out an observation within a group and predict that.


#15

We probably have something to recommend, but we need to run a few more tests before we can be certain that it works as intended, but likely we have something before the end of summer. Email me and I’ll contact you when we have something for you to test,

Sine LOO removes one observation and if that observation is one observation from many observations per individual, then LOO is what you want. Leave-one-individual-out is useful if you want to know the performance for a new individual.


#16

Thanks for the reply. I’ll email you, @avehtari. I also want to follow up to make sure I understand how the way I save the log_lik impacts my interpretation of the estimates from loo.

In both @janeg’s and my generated quantities section, there is a line that is roughly:

log_lik[i] = log_lik[i] + normal_lpdf(scr[t, i] | alpha[i] + beta[i] * ev[stim[t, i]], sigma_er[i]);

Where i is the group index (i.e., an individual participant in our case) and t is the observation index (with Tsubj[i] observations for the ith individual). This means, I think, that the matrix of samples x log likelihoods we’re passing to loo::loo are the log likelihood samples per-group, summed across all observations in that group. This also leads me to believe that we’re really computing the estimate for a leave-one-individual-out predictive fit by virtue of how we’re saving the log_lik vector. If that’s right, and I want a leave-one-observation-out predictive fit estimate, I should be saving the log posterior density for each observation instead of their within-group sums.


#17

Yes (this was easy to answer, thanks to your clear explanation and question).