Testing consistency of observers ordinal ratings across multiple experiments

Hi. I have an experiment where people rate a bunch of images. Some of these images have been run through filters (think Instagram) and I have found that some observers are particularly bothered by some types of filters. To replicate this effect, I created a synthetic dataset and ran the following model:

brm(rating ~ filter + (filter|id), data = results,
             family = cumulative(threshold = "flexible"),
             chains = 2, cores = 2, iter=3000)

In this example, the first 8 observers are particularly bothered by the “Bad filter” whereas the following 8 observers are equally bothered by the bad and medium filter. In reality, there will be more than just these two patterns in my data.

This is already an interesting result to me, but I now have the question of whether observers are robust in their ratings. So if I invited the same observers to a new session later are the pattern in their ratings then similar to the current or not? For example, will ID 1 again be particularly bothered by the bad filter?

I am not quite sure of how to proceed if I collected such data. Should I, for instance, model the two experiments separately and then subtract the posteriors for each individual? Should I somehow create one model that includes both experiments? I have also considered if I should use some kind of stratified cross-validation as introduced here by Aki Cross-validation for hierarchical models

It would be great if I somehow end up with a procedure that could both show if the ratings are consistent per observer or if they fluctuate.

I have written the code to simulate an experiment 1 and 2 here. It also includes the model I have used and the code to generate the plot attached.

Thanks for reading so far. I would really appreciate input on how to proceed from here.

I’m at work and unfortunately the website is blocked for your experiment and data, but I have a few ideas.
If the second experiment is basically a replication of the first, with the same observers, you might just consider adding experiment into the hierarchical model as another level. For example, you could do something like (filter|experiment/id), such that you have id’s nested within the different experiments. You need a reasonable prior on the sd for experiment, especially with only 2. This might work particularly well if you ran several more experiments as well.
You would obtain varying intercepts and slopes for each experiment and each id within experiment.
You might take a look at this thread. It’s a little lengthy and not exactly the same topic, but I ran a model for data with the design of matched pairs within experiment within object type. In that thread Aki links to examples of using CV to evaluate hierarchical models and has code to do integrated LOO, for example here, if you want to do leave-group-out CV.

1 Like

Actually if you have the same observers, you may want (filter|id/experiment) as you have multiple experiments per observer.

Thank you for your answer. It helped me, but I still have some questions.

First of all, I have never seen model definitions of the following kind (filter|id/experiment)
Do you know if there are any vignettes or guides that explain exactly what the / does? I have searched for a while with no luck.

Second, I went ahead and defined the model as you proposed:

brm(rating ~ filter + (filter|id/experiment), data = data,
                   family = cumulative(threshold = "flexible"),
                   chains = 2, cores = 2, iter=3000)

This seems to give a model that does what I want. At least, it seems to understand that each id is being run through two experiments. However, I am not quite sure where to go from here?

First of all, I can plot the model and it sure seems like there are no differences, but “seeming” is probably not the ideal way to evaluate a model:

# Look at categorical model
consub <- make_conditions(data, vars = c("id", "experiment"))
  conditions = consub,
  re_formula = NULL, categorical = F

My second idea would be to extract the random effects estimated from my model like so:

, , Intercept

     Estimate Est.Error   Q2.5 Q97.5
1_1    -0.024     0.060 -0.183 0.069
1_2     0.014     0.058 -0.093 0.153
10_1    0.020     0.061 -0.078 0.178
10_2   -0.017     0.058 -0.162 0.083
11_1   -0.006     0.057 -0.138 0.107
11_2   -0.010     0.056 -0.145 0.101

, , filter1:Bad

     Estimate Est.Error   Q2.5 Q97.5
1_1    -0.008     0.141 -0.319 0.293
1_2    -0.031     0.146 -0.359 0.262
10_1    0.037     0.143 -0.229 0.370
10_2   -0.002     0.138 -0.297 0.294
11_1    0.055     0.155 -0.217 0.433
11_2   -0.018     0.139 -0.320 0.263

, , filter2:Medium

     Estimate Est.Error   Q2.5 Q97.5
1_1     0.012     0.088 -0.157 0.225
1_2     0.036     0.095 -0.127 0.268
10_1   -0.036     0.099 -0.275 0.126
10_2    0.010     0.092 -0.180 0.224
11_1   -0.035     0.096 -0.288 0.129
11_2   -0.009     0.088 -0.217 0.166

I could then show that all the Q2.5 are negative whereas all Q97.5 which shows that the model does not change its estimates between the two experiments. However, this approach is likely too strict as it would not make sense to conclude that the observers are not consistent if 1 single observer changes behaviour in the second experiment?

I will think more about this, but I will appreciate any input that can help me move forward.

See the brms multilevel model vignette page 4. This is typical notation for nested group-level effects in brms and lme4 (I believe).

I like the plots. It shows what you want to know - if there was high variability between experiments within persons, then you would see substantial differences. What you want to report and how you want to display the results maybe depends on your interest, too. You have modeled the two experiments and the variation both between persons and between experiments within persons, so if really what you are interested in is ‘population-level’ effect of filter then you have an estimate of that. You also have the sd’s for the group-level effects, so you can report the variability at the different levels.