Stan adjust MCMC sampling internally on a parameter-by-parameter basis?

Daiki_TAOKA · December 9, 2024, 12:57pm

Question:
Does Stan really adjust the MCMC sampling from the posterior distribution internally on a parameter-by-parameter basis?
That is, given a parameter vector vector[2] theta, does Stan make separate MCMC sampling adjustments to the geometry of the posterior distribution for theta[1] and theta[2]? Or does Stan make a common adjustment for all theta (i.e., a parameter vector)?

Background:

I am currently trying to fit a policy-based reinforcement learning model to human (agent) behavior data during a gambling task, in which agents are presented different winning probabilities of gamble trial by trial and asked to place a bet as fraction of current chips they have at the trial.
The data are obtained as t-length vectors (arrays) of the presented winning probability of gamble p_t [0, 1], amount of current chips c_t [0, ], the betting amount y_t [0, 1] (as a percentage of the current chips c_t), and series of w_t and l_t for t = 60 or 100 trials per participant (here, the agent assumed by the model), respectively.

When I ran parameter recovery of the five model parameters individually on the artificial data generated from the model (i.e., estimating the model one agent at a time), I found relatively good recovery for the complexity of the model. Hence, I believe that there is no problem with the model and the gambling task set-up (and the data obtained from it). Therefore, I would like to refrain from disclosing the unpublished model here.

However, depending on the series of presented winning probabilities p_t and parameter combinations, the posterior distribution may be multimodal, and there seem to be cases in which the search for parameter space by Stan does not work well. I attribute this in itself to both the randomness of the gambling task and the presence of unidentifiable model parameter combinations. That is, the geometry of the posterior distribution may differ from agent to agent depending on the data set and model parameters.

The problem I’m really facing is that MCMC sampling fails with warnings that max_treedepth was exceeded when performing batch model fitting on multiple agents’ data (Note that if the number of agents to be batch-estimated is about 3, the recovery can be done without problems.).
Looking at the trace plot, I found no trace mixing for most of the agents’ parameters (even for agents for whom the separate fitting did a good job of estimating parameters). For 4000 iterations after warmup, the effective sample sizes are at most 10, which I believe suggests that the max_treedepth may have been exceeded without sufficient exploration due to the extremely small stepsize.

Based on the results at hand as described above, I am guessing that of the two possibilities mentioned in the preface question, Stan is taking the latter adjustment procedure. Is this understanding correct?
Also, is it difficult to implement batch estimation and hierarchical modeling with Stan when the geometry of the posterior distribution varies significantly depending on the dataset and parameters, as in the problem setting I am dealing with?

Perhaps it might be possible to declare explicitly separate model parameters for each agent in the Stan model code, but this is not realistic as the number of agents increases. Or is it possible if the parameters are declared as arrays instead of vectors?

I would be very grateful if someone could tell me what is going on.

WardBrian · December 9, 2024, 7:31pm

Stan’s algorithms cannot tell the difference between two parameters declared as reals and one that is a vector of length 2. They work over R^d, where d is the total number of scalar values in your parameters block.

The adaptation of the mass matrix has at least d free parameters (unless you pick the unit option), so it can adapt for each parameter, but the stepsize is just a single number. It’s not hard to find examples of models where a global mass matrix or a global step size will fail in some regions

Daiki_TAOKA · December 10, 2024, 9:08am

Thank you for your reply and clarification.

I see, so it doesn’t matter whether the parameters are declared as real-valued vectors or individually as real-valued scalars.

The model I am currently trying to recover parameters sometimes shows a nice Gaussian-like shape of posterior distribution, and sometimes there seems to be some correlation between posterior distributions depending on the combination of data and model parameters (in addition, very few divergent transitions in most cases).
For this reason, I specify metric = “dense_e” for the mass matrix when sampling with cmdstanr in reference to this.
Indeed, this setting appears to make sense, as it reduces the number of divergent transitions compared to the default “diag_e”.

If the mass matrix is optimized for each model parameter separately in the warmup period, then why does the batch estimation have an extremely low ESS for all parameters and cannot recover them even for combinations of data and parameters of agents that were successfully recovered in the separate estimation?
I expected to find a mix of parameters that could be successfully estimated and those that could not.

If it is not due to the optimization of the mass matrix, I suspect that it is due to the handling of the acceptance or rejection of MCMC samples when the max_treedepth is exceeded.
More specifically, is it possible that the exceeding of max_treedepth that occurred for some hard-to-estimate (multimodality, complex posterior distribution shape, bad local identifiability) parameters and datasets affected the acceptance or rejection of MCMC samples for other parameters?
I am assuming that this does not happen because I assume that the individual parameters are independent, is this not the case?

Bob_Carpenter · December 10, 2024, 7:47pm

We describe how it works in the Stan Reference Manual. Short story is that during Phase II of the warmup period, we estimate the mass matrix as a regularized sample covariance (regularized to diagonal, then to unit). This is done for every parameter at once since it needs covariances.

Often you find parameters that are slower to mix than others in a model (e.g., hierarchical population parameters), but either way, the sampling only works if all the parameters are mixing at some rate.

Again, I would urge you to read up on how the no-U-turn sampler works. It builds a Hamiltonian trajectory randomly forward and backward in time, doubling its size each time until it hits a U-turn (where an infinitesimal extension in time brings the ends closer together). Each doubling is one tree depth and we set a max so it doesn’t go into an infinite loop. With very bad geometry, you sometimes need Hamiltonian trajectories of 2^10 states. Sometimes, though very rarely, you can need more. Usually if you hit max tree-depth limits, it means you’re using a sub-optimal parameterization of the model and the problem can often be fixed.

Stan does not make any assumption about parameter independence.

Topic		Replies	Views
How to speed up sampling in rstan? Modeling rstan , performance , hierarchical-model	6	6043	July 24, 2020
Parameter scaling and hitting the maximum tree depth Modeling rstan , fitting-issues	2	550	March 8, 2023
Model gets stuck - but not quite always Modeling	7	1362	July 18, 2018
What's the highest dimensional model Stan can fit using NUTS? Modeling bioinformatics	13	3645	July 27, 2020
The role of "max_treedepth" in No-U-Turn? General fitting-issues	9	5921	September 6, 2021

Stan adjust MCMC sampling internally on a parameter-by-parameter basis?

Related topics