I wrote a blog post about a type of reinforcement learning (RL) called Q-learning with simulations of human data and non-hierarchical and hierarchical Bayesian models in Stan here.
I simulated rewards from a restless 2-arm bandit, basically 2 slot machines that change their rewards along the trials. Then I simulated one subject that use Q-learning to choose an arm along the trials and fit the data and recovered the parameters.
Finally I simulated some subjects with a hierarchical structure and again I fit a Bayesian model and recovered the parameters.
I hope it’s useful! I did it mainly to learn more about RL, and I’m not an expert so feel free to tell me if I messed it up somewhere.
Bruno