Suggestions to justify use of stan/bayesian MLM (to journal editors/reviewers)


I recently started experimenting with stan, leading to a submitted manuscript ( PsyArXiv Preprints | Bayesian multilevel modelling of simultaneity judgements discriminates between observer models and reveals how participants’ strategy influences the window of subjective synchrony)). It has just collected its first set of reviews (and a rejection, boo hoo). The paper uses stan to compare multilevel models of a psychology/psychophysics task in which people make judgements about whether two events happen at the same time.

I’m thinking about the various changes I will make before resubmitting elsewhere, and was wondering if anyone had thoughts about improving my rationale for the general stan / bayesian MLM approach, for example good citations to make it stronger or more compelling. I confess that I was quite lazy in my engagement with this literature. Of course comments on all other aspects of the paper/modelling would also be very welcome if anyone wants to take a look, but my specific question is about responding to the following comments from the editor (I will copy my text from the paper’s introduction that they are commenting on underneath):

“Another reaction, shared to a fair extent with the reviewers, was that the model implementation and analysis requires so many auxiliary assumptions, choices, and approximations (even including reliance on recommendations of STAN developers)…It is true that some motivations for this complex approach were offered, but these were a bit slippery. For example, the sentence starting on line 189 is very likely accurate, but “can be” is not probative here. The sentence is logically a true statement about Technique A if A works better than Technique B in X% of cases and works worse in (100-X)% of cases. But an argument in favour of A also needs to establish something about the value of X (e.g., it is greater than 50%) and probably even about the sizes of the winning margin for each technique when it wins. The sentence starting on line 208 sounds good, but it is really only valid if one already accepts the Bayesian premise that parameters have distributions of values. If parameters are instead fixed constants, as frequentists maintain, then MLE is demonstrably optimal. The next sentence that starts on line 210 makes a further impressive claim. But this claim must surely be specific to certain classes of models, and what is the evidence that the models being considered here fall into one of these classes? In the end, it was difficult to see a strong justification for this complex approach in comparison with the simpler approach of fitting each model by maximum likelihood to each participant’s data and comparing the fits. Fitting the data of each participant separately seems especially attractive when one considers that different models may be appropriate for different participants.”

169 1.2 Potential benefits of multilevel models
170 If lack of access to proper model-fitting software (and/or understanding of plausible models) has
171 been problematic for researchers using the SJ task, this problem is only exacerbated when we
172 consider multilevel approaches to data analysis (Goldstein & McDonald, 1988). As we have noted, a
173 two-step process has often been applied by experimental psychologists to the analysis of
174 psychophysical data. First, a function is fit to data for each participant and condition separately, then
175 group-level estimates of derived parameters are assessed using inferential approaches, such as the
176 t-test. However, it is often no longer necessary to separate these steps. Instead, all participants and
conditions can be fitted at the same time within 177 a model that acknowledges the clustering of
178 individual data points (here, responses within participants) and explicitly models random variation
179 across clusters (here, differences in participant-level parameters across the group). In recent years,
180 such multilevel models have seen widespread advocacy and adoption across diverse fields including
181 neuroscience (Aarts et al., 2014) and psychology (Barr et al., 2013). This includes the active
182 promotion of their use to analyse data from psychophysical tasks (e.g. Moscatelli et al., 2012).
183 Indeed, for standard (sigmoidal) psychometric functions, packages such as the Palamedes toolbox
184 (Prins & Kingdom, 2018) now offer multilevel approaches “off the shelf”. However, we are not aware
185 of any such option for those interested in modelling SJs.
186 This is a shame, because multilevel models have advantages over a two-stage analysis. Perhaps most
187 importantly, by fitting all participants at once, multilevel models can generate “shrinkage”, whereby
188 well-estimated participants help constrain parameter estimates for less well-estimated participants
189 (Lambert, 2018). The result can be more powerful, robust and reliable estimation that generally
190 performs better in out-of-sample prediction (Aarts et al., 2014; Lambert, 2018; Moscatelli et al.,
191 2012). Shrinkage also seems to have considerable practical value in a field where it is common to
192 reject participants on the basis that their data are inadequate to generate reliable parameter
193 estimates (and in which pre-registration of exclusion criteria is not yet the norm). If there are ways
194 to reduce the number of participants who have to be excluded, we should probably adopt them (but
195 see the discussion for caveats).
196 Multilevel models come in both frequentist and Bayesian flavours. Here we adopt the latter
197 approach, for several reasons. One is practical. We utilise the open-source Bayesian modelling
198 programming language “Stan” and associated R packages for our analyses (Stan Development Team
199 2020; 2022) which are free, relatively fast to execute, and offer a good balance between what they
200 expose to the programmer (i.e. flexibility to implement a wide range of bespoke models) and what
201 they hide (the implementation of state-of-the-art Hamiltonian Monte-Carlo no U-turn sampling, to
estimate the posterior distribution of model parameters). Although 202 by no means trivial to learn, the
203 Stan/R combination is easier than coding properly functioning frequentist (i.e. maximum likelihood
204 estimation; MLE) searches from scratch, particularly for complex models with very large numbers of
205 parameters. Other reasons for favouring the Bayesian approach are more conceptual. Specifically,
206 Bayesian models encourage the use of sensible priors, or rather hyper-priors in the case of multilevel
207 models (which when used judiciously, should further enhance the reliability of recovered
208 parameters). They also make use of the full distribution of plausible parameter values from the
209 posterior when assessing the goodness of a model’s fit, rather than relying exclusively on the mode
210 of the posterior, as per MLE. Compared to popular metrics like the Akaike information criterion
211 (AIC), Bayesian metrics (e.g. estimation of leave-one-out cross validation via Pareto smoothed
212 importance sampling; Vehtari et al., 2017) are likely to provide a better estimate of a model’s out-of213
sample predictive accuracy, and thus a fairer means of comparing models with different
214 architectures (Lambert, 2018).
215 Here, we will describe, in some detail, an analysis using Bayesian multilevel models for SJ data. We
216 also share our commented code which implements that analysis as a potential template for other
217 researchers interested in developing bespoke multilevel analyses of their own data. However, before
218 we can move to a multilevel approach, we must decide which participant-level models to build upon,
219 a question to which we turn next.

As usual there’s a lot going on with an editorial reply, and though this isn’t my field at all I’d guess that there’s a good chance that politics and gatekeeping are playing a disproportional role in the final assessment and decision – i.e. if you used an objectively inferior model that is common for that kind of research nobody would question it, but if you try to use a more sophisticated one you would have to justify it, and even if it would be done weakly it’s still better than no justification. (The exception is if the paper is actually about the method, in which case a weak justification wouldn’t be acceptable, but the inferior model would not be a paper at all.)

Part of the problem is what you are up against, and from the statement above it is probably researchers who are experts in their fields but who lack the expertise to assess statistical methods but feel entitled to do so anyway. There is no such thing as “The Bayesian Premise”, the only technical different between an MLE and a MAP estimate is that the former must assume flat priors (see for instance a recent discussion here). Parameters do have distributions, you just have to compute them, that is true of frequentist approaches as well, that is how confidence intervals are established, and nobody rejects the use of CIs on the basis of it requiring accepting the bayesian premise.

On the other hand, it is true that you could put some figures into the performance of multilevel models vs something else, and justify other aspects in a more objective way, but again the truth is you shouldn’t have to justify the choice of bayesian vs frequentist inference any more than you have to justify the model itself (up to the priors, but if anyone insists on the discussion that “priors are subjective” were kind of back in the 80s).

All of that said, my personal opinion is that you shouldn’t promote your method as “Bayesian” or the use of Stan in a way that requires explicit justification, but rather describe well and justify the model itself regardless of the inference method, because in principle you could do it within either a bayesian or frequentist formulation (again, up to the priors) and using any package. By downplaying the Bayesian aspect of your analysis and deemphasizing the specific software I think you may fly below the radar of nonexpert nitpicking, but by still stating that the analysis is indeed bayesian and that you are using Stan anyone who has some expertise in statistics will know that you are using a well-justified approach with a state-of-the-art inference implementation, so you’d get the best of both worlds. I’d leave the Bayesian-Frequentist wars for other settings where it won’t happen behind closed doors with an arbitrary power asymmetry and affect publication decisions.

But that’s just like, my opinion, you or others may want to take this on in every possible battefield.


Thanks Caesoma. I think you are probably right - in this paper I got a little stranded between different messages (promoting a different approach to fitting this kind of data vs. testing substantive hypotheses) and I think the safer course is to focus more on the latter.

1 Like