I am comparing 3 (quite different) models of cognitive mechanisms underlying the performance of clinical patients (schizophrenia) in an experimental task.
I have a few options as to how to perform model comparison. Here the pros and cons as I understand them, but I’d appreciate some feedback.

Ideally I would do a mixture multilevel model where the theta is conditioned on participant. This is in practice impossible to properly fit.

I can run loo and stacking weights on the 3 multilevel models. Pros: pooling. Cos: many in the field would object to participants being assumed as similar.

I can run the models on each individual separately and do loo and stacking weights at the individual level. Pros: individual weights, so different models can be better for different individuals. no Pooling. Cons. No pooling.

I could extract pointwise loo scores, consider them by participant and do a posthoc individual level model comparison. Pros: pooling and individual weights. Cons: Pooling. A bit convoluted.
At the moment I have implemented 2 and 3, with partially complementary results.
I think you’ve summarised the situation quite well. If I understand 4 correctly, it feels wrong (maybe @avehtari could provide some more thoughts).
At what level is this a concern? I see two possibilities:
 The pooling would put similar model coefficients to different patients, which is problematic (IMHO easy to argue against as you can put a wide prior on the betweenpatient variability)
 The idea of one model explaining all patients is problematic. In this case, you could do posterior predictive checks to see if this is actually the case. The fact that the model fits well obviously does not necessarily imply that the patients are similar, just that your model is useful.
On a more philosophical note, I like the idea of checking whether the qualitative features of the model fit the data. In this sense loo, stacking or whatever you use may actually be misleading  some nice discussion of this is by Danielle Navarro: https://psyarxiv.com/39q8y/.
Finally, are you sure you can’t fit the 1. model? Maybe that’s something people here can help you with…
1 Like
I have read your question few times, and I don’t understand what you are trying to do. Can you explain a bit more? Meanwhile see also tutorial for CV for hierarchical models https://avehtari.github.io/modelselection/rats_kcv.html
Aki
Thanks for the answer! The question is:
Can the point wise (by data point) loo estimates be interpreted? E.g. if I extract the loo estimates from model 1 for all the datapoints concerning participant 1 and the same for model 2, can I then compare the models only for participant 1 using those estimates? McElreath in his new version of the Statistical Rethinking book explores which datapoints are easier to explain in one model compared to another, but does so only qualitatively.
I know this is not an ideal procedure, but it is still interesting to know whether and how the point wise loo estimates can be interpreted and used to compare models for single datapoints or clusters of datapoints (e.g. 1 participant, or one stimulus).
Thanks, this clarified the question.
Yes, you can do this comparison. The comparison is conditional on the model and other data used to update the posterior, thus it is indicative which model would be good for each individual but if you then create another model which uses model 1 for some participants and model 2 for other participants the predictive performance for participant 1 can change. Further complication arises if the differences between the predictive performance estimates are small and there are several choices to be made. Then it would be better to integrate instead of selecting.
Thanks! This is very helpful and makes perfect sense.
Hi, I have a related situation, so I thought maybe it makes most sense to tag on to this (sorry for the naive questions).
 Can I use the weights of the ‘stacking’ as my reported model comparison in a paper? I.e. is the advantage over just summing up the loo scores for each model that it takes into account that for some people one model might fit better and for other another model. I am having the problem that in general for most people model 1 fits better, but for some people model 2 fits hugely better, so just summing up the loo scores it looks like model 2 is better.
 If my two models have the same number of parameters with exactly the same priors, do I still need to compute loo or can I use the log likelihood? (I was thinking of this as e.g. for AIC the difference to log likelihood is just correcting for the number of parameters) Or is this wrong for a hierarchical model as even though the models are set up to be the same they might have different numbers of effective parameters because of the hierarchy?