Confused with different ways to show contrasts on outcome scale

Hi everyone,
I have another question about model interpretation. I have my final models built and thought that I understood how to interpret them but played around a little bit and now I am confused again.

The model I have is this using a beta likelihood with a logit link:

EXAM25 ~ 1 + Algorithm + LOC + (1|Project) + (1|Language)

Algorithm has 2 levels and LOC is continuous.

mcmc_areas gives this for the logit scale which shows a clear difference between the two algorithms on the logit scale:

marginal_effects shows this picture which looks a lot less certain in the difference, although Linespots seems to have lower EXAM25 than Bugspots:

Now there are two ways to calculate the contrasts that I have seen. One is based on the posterior_sample and one on the posterior_predict functions.
The posterior_sample based one looks like this (for the mean LOC):

post = posterior_predict(model)
contrast = inv_logit_scaled(post$b_Intercept) -
                 inv_logit_scaled(post$b_Intercept + post$b_AlgorithmBugspots )

and looks like this:


Again, a clear difference between both algorithms on the outcome scale.

The posterior_predict one looks like this (I have a full factorial design so both subsets look the same besides results and Algorithm):

l =  posterior_predict(model, newdata = subset(data, d$Algorithm == "Linespots"))
b = posterior_predict(model, newdata = subset(data, d$Algorithm == "Bugspots"))
contrast = l - b

This contrast however looks very different from the ones before:

Now I am wondering what is going on here. Is this due to the posterior_sample contrast only looking at mean LOC and mean project and language (as in 0) while the posterior_predict contrast aggregates accross all LOO, project and language values? Or am I doing something else wrong.

I assume that there is no “right” way to do this and it depends on what exactly I want to show as it always seems. However I am not sure what I should conclude from this now.
Would it be fair to say that Linespots has lower EXAM25 for mean LOC (and I guess I could just test for some range of LOC) with differences in projects and languages skewing the results?

Not 100% sure that your code does what you think it does in

l =  posterior_predict(model, newdata = subset(data, d$Algorithm == "Linespots"))
b = posterior_predict(model, newdata = subset(data, d$Algorithm == "Bugspots"))
contrast = l - b

is l and b really what you want? Those are matrices with a row (or column? cant remember) for each row in the dataset, so you would be subtracting the corresponding rows in the filtered datasets. If the ordering of data is not very neat, you might be subtracting predictions for very different LOC, Project and Language - which means you basically just get noise.

Alernatively, maybe the remaining beta uncertainty is very high, so although you are quite certain there is a difference in means, the observation noise is so big, you get a high a posterior probability for differences in either direction.

Hope that makes sense.