Running bayes factor on measurement error models is producing weird results

One possibility would be to focus on the largest model, because (I think) the other models are special cases of the largest model. Then the posterior distribution of the largest model would seem to provide information about the plausibility of those special cases.

The loo metrics have a similar issue here, where your comparison depends on whether or not the random effects are treated as parameters. It is possible to reach different conclusions depending on how the random effects are handled.

I think this is one of those situations that often arise in applications, where the time/effort required to get an accurate computation maybe outweighs the need for the computation. If you went with bayes_factor(), you might at least run it a few times to see how much the value varies. Though I realize that might take awhile.