Hi,

I recently started experimenting with stan, leading to a submitted manuscript ( PsyArXiv Preprints | Bayesian multilevel modelling of simultaneity judgements discriminates between observer models and reveals how participants’ strategy influences the window of subjective synchrony)). It has just collected its first set of reviews (and a rejection, boo hoo). The paper uses stan to compare multilevel models of a psychology/psychophysics task in which people make judgements about whether two events happen at the same time.

I’m thinking about the various changes I will make before resubmitting elsewhere, and was wondering if anyone had thoughts about improving my rationale for the general stan / bayesian MLM approach, for example good citations to make it stronger or more compelling. I confess that I was quite lazy in my engagement with this literature. Of course comments on all other aspects of the paper/modelling would also be very welcome if anyone wants to take a look, but my specific question is about responding to the following comments from the editor (I will copy my text from the paper’s introduction that they are commenting on underneath):

“Another reaction, shared to a fair extent with the reviewers, was that the model implementation and analysis requires so many auxiliary assumptions, choices, and approximations (even including reliance on recommendations of STAN developers)…It is true that some motivations for this complex approach were offered, but these were a bit slippery. For example, the sentence starting on line 189 is very likely accurate, but “can be” is not probative here. The sentence is logically a true statement about Technique A if A works better than Technique B in X% of cases and works worse in (100-X)% of cases. But an argument in favour of A also needs to establish something about the value of X (e.g., it is greater than 50%) and probably even about the sizes of the winning margin for each technique when it wins. The sentence starting on line 208 sounds good, but it is really only valid if one already accepts the Bayesian premise that parameters have distributions of values. If parameters are instead fixed constants, as frequentists maintain, then MLE is demonstrably optimal. The next sentence that starts on line 210 makes a further impressive claim. But this claim must surely be specific to certain classes of models, and what is the evidence that the models being considered here fall into one of these classes? In the end, it was difficult to see a strong justification for this complex approach in comparison with the simpler approach of fitting each model by maximum likelihood to each participant’s data and comparing the fits. Fitting the data of each participant separately seems especially attractive when one considers that different models may be appropriate for different participants.”

169 1.2 Potential benefits of multilevel models

170 If lack of access to proper model-fitting software (and/or understanding of plausible models) has

171 been problematic for researchers using the SJ task, this problem is only exacerbated when we

172 consider multilevel approaches to data analysis (Goldstein & McDonald, 1988). As we have noted, a

173 two-step process has often been applied by experimental psychologists to the analysis of

174 psychophysical data. First, a function is fit to data for each participant and condition separately, then

175 group-level estimates of derived parameters are assessed using inferential approaches, such as the

176 t-test. However, it is often no longer necessary to separate these steps. Instead, all participants and

8

conditions can be fitted at the same time within 177 a model that acknowledges the clustering of

178 individual data points (here, responses within participants) and explicitly models random variation

179 across clusters (here, differences in participant-level parameters across the group). In recent years,

180 such multilevel models have seen widespread advocacy and adoption across diverse fields including

181 neuroscience (Aarts et al., 2014) and psychology (Barr et al., 2013). This includes the active

182 promotion of their use to analyse data from psychophysical tasks (e.g. Moscatelli et al., 2012).

183 Indeed, for standard (sigmoidal) psychometric functions, packages such as the Palamedes toolbox

184 (Prins & Kingdom, 2018) now offer multilevel approaches “off the shelf”. However, we are not aware

185 of any such option for those interested in modelling SJs.

186 This is a shame, because multilevel models have advantages over a two-stage analysis. Perhaps most

187 importantly, by fitting all participants at once, multilevel models can generate “shrinkage”, whereby

188 well-estimated participants help constrain parameter estimates for less well-estimated participants

189 (Lambert, 2018). The result can be more powerful, robust and reliable estimation that generally

190 performs better in out-of-sample prediction (Aarts et al., 2014; Lambert, 2018; Moscatelli et al.,

191 2012). Shrinkage also seems to have considerable practical value in a field where it is common to

192 reject participants on the basis that their data are inadequate to generate reliable parameter

193 estimates (and in which pre-registration of exclusion criteria is not yet the norm). If there are ways

194 to reduce the number of participants who have to be excluded, we should probably adopt them (but

195 see the discussion for caveats).

196 Multilevel models come in both frequentist and Bayesian flavours. Here we adopt the latter

197 approach, for several reasons. One is practical. We utilise the open-source Bayesian modelling

198 programming language “Stan” and associated R packages for our analyses (Stan Development Team

199 2020; 2022) which are free, relatively fast to execute, and offer a good balance between what they

200 expose to the programmer (i.e. flexibility to implement a wide range of bespoke models) and what

201 they hide (the implementation of state-of-the-art Hamiltonian Monte-Carlo no U-turn sampling, to

9

estimate the posterior distribution of model parameters). Although 202 by no means trivial to learn, the

203 Stan/R combination is easier than coding properly functioning frequentist (i.e. maximum likelihood

204 estimation; MLE) searches from scratch, particularly for complex models with very large numbers of

205 parameters. Other reasons for favouring the Bayesian approach are more conceptual. Specifically,

206 Bayesian models encourage the use of sensible priors, or rather hyper-priors in the case of multilevel

207 models (which when used judiciously, should further enhance the reliability of recovered

208 parameters). They also make use of the full distribution of plausible parameter values from the

209 posterior when assessing the goodness of a model’s fit, rather than relying exclusively on the mode

210 of the posterior, as per MLE. Compared to popular metrics like the Akaike information criterion

211 (AIC), Bayesian metrics (e.g. estimation of leave-one-out cross validation via Pareto smoothed

212 importance sampling; Vehtari et al., 2017) are likely to provide a better estimate of a model’s out-of213

sample predictive accuracy, and thus a fairer means of comparing models with different

214 architectures (Lambert, 2018).

215 Here, we will describe, in some detail, an analysis using Bayesian multilevel models for SJ data. We

216 also share our commented code which implements that analysis as a potential template for other

217 researchers interested in developing bespoke multilevel analyses of their own data. However, before

218 we can move to a multilevel approach, we must decide which participant-level models to build upon,

219 a question to which we turn next.