Selecting the best BRMS model

Hi all,

I’m looking at the influence of several variables on a response variable and to do this I’ve created different models. I then used the loo function to observe the best model, but there was no significant difference between my models.
So I don’t know how I can choose which variables have the greatest influence on my dependent variable.
I’ve attached the code for my 4 compared models as well as the R output.

Model1= brm(nestedness.2~modularity_mean+within_mod + (1|gr(apes, cov = A)) + (1|Subpopulation),
 data = DATAA, family = gaussian(), data2 = list(A = A), prior =  c(prior(inv_gamma(0.002,0.002), "sd"),
prior(inv_gamma(0.002,0.002), "sigma")), sample_prior = TRUE, chains = 4, cores = 2, iter = 125000, 
warmup = 62500, thin=25, save_pars = save_pars(all = TRUE))

model2 = brm(nestedness.2~density_mean+within_den + modularity_mean+within_mod
      +sex_ratio_mean+within_SR + (1|gr(apes, cov = A)) + (1|Subpopulation),data = DATAA,family = gaussian(),data2 = list(A = A), prior = c(prior(inv_gamma(0.002,0.002), "sd"),prior(inv_gamma(0.002,0.002), "sigma")
      ), sample_prior = TRUE, chains = 4, cores = 2, iter = 125000, warmup = 62500, thin=25)

model3 = brm(nestedness.2~modularity_mean+within_mod + size_mean + within_size + (1|gr(apes, cov = A)) + (1|Subpopulation), data = DATAA, family = gaussian(), data2 = list(A = A), prior = c(prior(inv_gamma(0.002,0.002), "sd"),prior(inv_gamma(0.002,0.002), "sigma") ), sample_prior = TRUE, chains = 4, cores = 2, iter = 125000, 
        warmup = 62500, thin=25, save_pars = save_pars(all = TRUE))

model4 = brm(nestedness.2~density_mean+within_den + modularity_mean+within_mod
+sex_ratio_mean+within_SR+size_mean+within_size + (1|gr(apes, cov = A)) + (1|Subpopulation),
data = DATAA, family = gaussian(), data2 = list(A = A), prior = c(prior(inv_gamma(0.002,0.002), "sd"),prior(inv_gamma(0.002,0.002), "sigma") ), sample_prior = TRUE, chains = 4, cores = 2, iter = 125000, 
warmup = 62500, thin=25, save_pars = save_pars(all = TRUE))

loo4.4 = loo(modelBRMS4.4,moment_match = TRUE)
loo3.3 = loo(modelBRMS3.3,moment_match = TRUE) 
loo2.2 = loo(modelBRMS2.2)
loo1.1 = loo(modelBRMS1.1) 

loo_compare(loo1.1,loo2.2,loo3.3,loo4.4)

#.            elpd_diff   se_diff    elpd_loo    p_loo     looic
model1             0.0           0.0             11.2          7.2     -22.4
model3           -0.2           1.4             11.0          8.4     -22.0
model2           -0.9           3.1             10.3         11.3     -20.6
model4           -1.5           3.5               9.7         12.7     -19.4

In addition, I produced the same analysis with the MCMCglmm package and compared these models with each other using the DIC criterion: the best model corresponds to the worst model with the BRMS package (model1) whereas the best model here (model4) corresponds to the worst model with the MCMCglmm package. I don’t really understand how this is possible? Could it be due to the difference in the selection criteria?

Thank you in advance for your feedback

LOO is telling you that the data do not contain enough information to say which model is best in terms of its expected leave-one-out predictive error.

Yes. If the various criteria consistently gave the same answers, then we wouldn’t have so many of them. But also, there’s no real conflict here, since LOO is telling you that it doesn’t know which model is best. DIC probably doesn’t know either.

1 Like

Thank you very much for your reply.

LOO is telling you that the data do not contain enough information to say which model is best in terms of its expected leave-one-out predictive error.

I didn’t quite understand how you came to see that?

Yes. If the various criteria consistently gave the same answers, then we wouldn’t have so many of them. But also, there’s no real conflict here, since LOO is telling you that it doesn’t know which model is best. DIC probably doesn’t know either.

So I guess I don’t really have a solution here as to which model I can use to best explain my data?

I just have one last question. I noticed very recently that the distribution of my dependent variable didn’t follow a normal distribution. So I tested a few post-hoc functions to see if my models fitted my data and the graphs were fairly consistent apart from the fitted-resid graph which shows a violation of the homoscedasticity assumption (attached).

I transformed my data with the sqrt function (logarithmic transformation impossible because of the presence of 0) but it made absolutely no difference.
I’ve also attached a graph of the distribution of my response variable.

Would you have a solution as to which family to indicate in the brm model to best fit the model to the data?

Thank you in advance,

The lack of large differences in elpd between the candidate models.

I can’t just pick the best family by eyeballing the residuals. If the responses are all non-negative, you could experiment with a zero-hurdle gamma distribution. From your plot of the data it looks like the data are all less than one. If that’s structurally guaranteed by the setup, then you could experiment with a zero-inflated beta. Finally, you could experiment with distributional regression, for example by using predictors for sigma in a gaussian regression. However, if the data are guaranteed to be nonnegative, this wouldn’t be my first approach.

Note that loo is tricky for zero-hurdle continuous responses, because you have to choose how to weight the discrete and continuous parts. Not sure what the brms default for the hurdle_gamma family is; I would guess that it just uses the normalized mass (for zeros) and the normalized density (for nonzeros), which may or may not be a sensible way to do comparisons for your model.

1 Like

Thank you very much for your answer !

I can’t just pick the best family by eyeballing the residuals. If the responses are all non-negative, you could experiment with a zero-hurdle gamma distribution. From your plot of the data it looks like the data are all less than one. If that’s structurally guaranteed by the setup, then you could experiment with a zero-inflated beta. Finally, you could experiment with distributional regression, for example by using predictors for sigma in a gaussian regression. However, if the data are guaranteed to be nonnegative, this wouldn’t be my first approach.

I haven’t given you much information about the variable itself, sorry.
The values of this response variable are between 0 and 1 (values shown below). However, only 3 values are equal to 0 (values of the attached variable for 51 populations of 21 different primate species (presence of several populations for certain species)).

Data$Nestedness

[1] 0.34951856 0.25666667 0.28632479 0.46828704 0.53333333 0.56378601 0.79343434 0.05555556 0.36190476
[10] 0.35119048 0.19753086 0.28571429 0.73015873 0.50610777 0.56628093 0.56438345 0.51700680 0.49106481
[19] 0.79828188 0.54916450 0.16896825 0.23807462 0.70555556 0.55962963 0.74711911 0.61085175 0.53331349
[28] 0.73518518 0.35218651 0.67286109 0.53333333 0.25915751 0.08881674 0.68780296 0.37000000 0.80000000
[37] 0.35867347 0.75402237 0.00000000 0.47272727 0.00000000 0.55555556 0.00000000 0.53333333 0.71428571
[46] 0.70740741 0.64774733 0.30000000 0.43939394 0.41851852 0.26923077

As you advised, I tested a zero-inflated-beta distribution and, as you mentioned below, when I compare my models with each other again, I obtain an elpd_diff of 0 between the models.
As this type of distribution is very useful when you have a lot of 0’s in your distribution, which is not necessarily my case, how can I check that this type of model is well suited to my data?

I apologise for all my questions so far, but I have to admit that I’m fairly new to this type of analysis.