How do I test if a manipulation reduces variance with stan?

Here’s a thought experiment, not too far from what I actually planning to test. Say that I can present sound in 360 degrees surrounding the participant. I then ask the participant a question about the presented sound.

In a classic setup, all sounds would be played equally loud, but I have found that participants are on average better at some positions rather than others. If I staircase this, it seems that on average sounds should be reduced most close to the ears, slightly in front but not at all behind the head. Something like this:

The staircase already seems convincing to me, but I further wish to test this against a classic setup in which all sounds are equally loud - like this:

So each participant would try 2 conditions: classic and balanced, and I believe that the performance of each position should be more comparable in the balanced setup. But is it practically measurable?

My question is how I formally test whether the performance is more “even” in my balanced setup. One option would be to split the data of the two conditions and then run a model that does not know position vs a model knowing the position and compare their performance with loo. Something like:

# Create models
m0 <- stan_lmer(accuracy~-1+(1|participant))
m1a <- stan_lmer(accuracy~-1+factor(position)+(1|participant))

# Compare models
loo_m0 <- loo(m0)
loo_m1a <- loo(m1a)

And I would then guess that the “m1a” model for the classic condition would be relatively better than “m0” in the classical condition. Say, it is m1a is 5 standard deviations better than m0 in the classical setup but only 3 standard deviations better in the balanced setup. That implies that the variance for each position was lower in the balanced setup.

This test does not seem very stringent to me (and probably not to a reviewer either), so I wonder if there is a more formal way to test my question?

All feedback would be highly appreciated.

What does “staircase” mean as a verb here? I got confused about what you were actually presenting to subjects and what they were reporting and what those dots in circles meant. What is accuracy? Why is volume on the y axis and position on the x axis? I’m also not sure what’s being balanced in the balanced set up (distance from listener?).

Usually we just directly model the thing we care about and then do posterior inference for that rather than trying to build two models and compare. That is, butild the more complicated model and see if the position makes a difference. Usually we don’t care about comparison with 0, but you could evaluate \mathrm{Pr}[\mathrm{position}[i] > 0]. Or just measure accuracy in both situations, build a hierarchical model to smooth, and compare.