Model Selection in BRMS

I want to test variable importance with my model.

I have two predicotors and I want to see which one is more influential.

My complete model is:

brm(num|trials(denom)~ status+male_rivals+female_promiscuity+ date+status:male_rivals+(1|mm(ID1,ID2)))

Model Variables:
status is a categorical variable comprising type of male (territorial or sneaker)
male_rivals is number of rivals
female_promiscuity is how promiscuous the female is.
date is a control variable
status:male_rivals is a interaction term

I took a bottom up approach:

Null model:
brm(num|trials(denom)~ 1+(1|mm(ID1,ID2)))

Just status and control variable model:
brm(num|trials(denom)~ status+ date+(1|mm(ID1,ID2)))

Male_rival and interaction model:

Full model:
brm(num|trials(denom)~ status+male_rivals+female_promiscuity+ date+status:male_rivals+(1|mm(ID1,ID2)))

When I did leave-one-out comparison the best model was with male_degree and interaction term which was significantly different from Null model and status and date model.

The addition of female_promiscuity wasn’t significantly different.

With these results could I make the argument that male_rivals is more important than female_promiscuity as a predictor?

Is there another way besides WAIC, perhaps a type of variable importance algorithm to determine variable importance?


I’m not sure your post is specific enough to answer your Question 1.

  • Do you have a specific notion of “important” that you want to try to get inference on?
  • Do you want to distinguish between the “importance” of a main effect and the “importance” of an interaction, or do you want the “importance” of a variable to include its influence both via main effects and interactions?
  • Does the distribution of covariate values over which you are interested in “importance” match the distribution of covariate values that you have in your training data?

For question 2, the answer is definitely yes, but what to choose isn’t straightforward without a clearer notion of what “importance” means. Options range from cross validation to the magnitudes of the standardized slopes of the regression coefficients, and probably a bunch more.

Question 1:
I want to try to get inference on what variables effect size is most important, so I wanted to see if there was a variable importance algorithm approach that I could implement.

I want to first look at just the main effect and then look at the most important predictor for all predictors (both main effect and interaction).

Question 2:
What would you suggest to do?

Q1: if you show the loo output I can say more

Q2: You could try projection predictive model selection with projpred package, which works with brms.


status= male type (Resident or Bachelor)
male_degree=number of males rivals met on previous day
Date=day of breeding season
non_focal=non_exclusivity looks at promiscuity of female with other males

Intercept Model
model 12: bf(num|trials(denom)~status+male_degree+non_focal+poly(Date,2)+status:male_degree+status:non_focal+(1|mm(male_id,female_id)),phi~male_degree)

@avehtari Does this mean that including non_focal doesn’t significantly improve the model, so male_degree is a better predictor of asssociation strength compared to non_focal?

Would you follow this up with a k-fold?

Does the propred package have something similar to variable importance algoirthm?

Looking forward to your suggestions @avehtari

All the elpd differences are small (either with magnitude less than 4 or small compared to se_diff). Thus there is not much to say about differences. If you also show the loo output for model2 (which is now shown as the best) and Intercept (shown as the worst) I can further comment on the reliability of these comparisons.

Polynomial is in most cases a bad choice. The regularized thin plate spline which is the default spline is brms is much better, and you can use that with s(Date)


this looks like a term that can make all models to overfit, but if you show the loo output for the models I mentioned, I can say more.

What are typical values for num and trials(denom)?

As you have so few models, you probably don’t need projpred

Model Comparison
I ran my models again due to some high pareto k values, so this is now the model comparison output.

When I do s(Date) it gives me divergent transition warnings after warmup, so this is why I did polynomial

This is to account for the dyadic dependencies between male and female interactions.

Here is the loo output for model 2:

Is there overfitting? How can you tell?

loo output for model 1

loo output for Intercept

I am really interested in how you get your interpretation from these outputs

Does the high se_diff just mean there is a lot of uncertainty, so I need more data?

In model 2 there is significance for predictor variables but this model is not significantly different than the other models. So can I still pick the model with significance?


num: mean=19.5, sd=24.1 range=1:179
denom: 288 for all values

288 is how many 5 minute intervals are in a day, so the data resolution is every 5 minutes

How many?

I did not suggest removing it, I just mentioned that it can be the part that makes efficient elpd computation more difficult, but I can say more based on the loo output for the individual models.

I look at the number of observations (226 is shown in the first line), the effective number of parameters, p_loo (15.8), and the number of high Pareto k’s. These are also discussed in LOO package glossary — loo-glossary • loo and Cross-validation FAQ • loo

All Pareto k’s are good, so there are no highly influential individual observations, which is good. This indicates that loo computation is reliable and the term (1|mm(male_id,female_id))) is not causing loo computation problems.

p_loo for all models is much smaller than the number of observations, which indicates that models are not very flexible and there is problematic overfitting.

The number of observations is not very big, and it seems you have quite noisy data, and I guess you are using very wide priors, which can explain the increase in p_loo without much increase in elpd_loo, so you are likely overfitting just a little.

More data helps. It might help to understand what is going on by checking if there are some observations that have big difference in elpd. See, e.g. Fig 10 in

You can also look at the posteriors of the quantities of interest directly. They can sometimes tell more than loo comparisons.

Based on the values model 2 is likely to be best, but the expected difference is small and there is clearly non-zero probability that some other model would be better for the future data from the same process.

Have you looked at posterior predictive checking and LOO-PIT calibration values (Figs 6-9 in


It was 4 divergent transitions, but when I increased the adapt_delta there were none, so it worked.

So if p_loo is lower than 226 than it is over-fitting? I also read that if p_loo is lower than than model parameters than it is problematic.

What are these?

For observed values I did num/denom.

How would I interpret this?


Model Mean

predictive distibution

I’m curious, why do you change my text you quote to images?

I wrote:

p_loo for all models is much smaller than the number of observations, which indicates that models are not very flexible and there is problematic overfitting.

Your p_loos were around 11 to 15, which are much smaller than 226, and thus there is no overfitting.

Where did you read that? It can be problematic if p_loo is larger than the number of model parameters. This what is written e.g. in LOO package glossary — loo-glossary • loo

Are you asking what is posterior or what are the quantities you are interested in?

Can you tell what is shown in the figures? E.g. show the code you used to generate them? Especially the first figure seems to have something wrong. Others look fine.


I think I figured out how to do it without images, sorry.

Ok so I read LOO package glossary — loo-glossary • loo, so since p_loo < N (number of observations) and p_loo < p (parameters of the model), my model is well behaving.

Model Interpretation: My model is well behaving, but since it is not significantly different than my other models because of the large se_diff than there is alot of uncertainty with my model either because of low sample size, uniformed default priors, etc. Is this correct?

How do you determine if p_loo << (much greater) p (parameters)

Ok I must have misinterpreted what I saw before.

Yeah I am asking how can the quantities of interest tell me more, for example my male_degree posterior. What would I be looking for?

Figure code:

#First Figure
                  data = Data,
                  iter = 4000,
                  chains = 5,
                  init = "random",
                  cores = parallel::detectCores(),
                  family =  beta_binomial(link="logit"),
                  control = list(adapt_delta=0.99,max_treedepth=15),

y = Data$num/Data$denom

loo1 <- loo(My_Model, save_psis = TRUE, cores = 4)
psis1 <- loo1$psis_object
lw <- weights(psis1) #normalized log weights

ppc_loo_pit_overlay(y, yrep, lw = lw)

#Other figures:


The se_diff’s are not big and elpd_diff’s are not big. It is possible that with more data the differences would stay similarly small. Low sample size makes it more difficult to learn the parameters well from the data and to see differences between the models. Weak priors on parameters may make the models with more parameters to perform worse than what would happen with priors that would provide consistent prior predictive distribution no matter how many parameters (although such priors exists, unfortunately there is not yet an easy way to add such priors for your models in brms). You could try with proper priors and then use priorsense package to test prior and likelihood sensitivity.

There is no strict threshold, e.g. p_loo < 0.4*p

What is the range of values with the most of the posterior mass? Is the most of the mass on positive values or negative values? What is the probability that the quantity is positive/negative? What is the probability that the magnitude of the quantity is practically interesting? What is the interpretation of those values? Post a plot and tell about the quantity and I can help to interpret what we can read from the posterior plot

Ah, LOO-PIT can fail for discrete data, although it seems to failing in a way that shouldn’t happen.

As p_loo << n, these are also quite safe, especially the histogram one.


(I used a polynomial function instead of a spline because the spline was giving me 6 divergent transitions)
The main reason I am using a polynomial function is because first I expected the association rates to be similar at the beginning and end of my sampling period, second because when I do it without a polynomial function there is a very high std error for the intercept.

I am mainly interested in the interaction term “b_statusResident:male_degree”.
The probability that this interaction term is negative is 0.9986667 and as you can see in the output above the (for the interaction term statusResident:male_degree) credible interval doesn’t contain 0.

So what does this mean or what would you suggest to do? I calculated it right?

So it seems you can infer with high probability the sign of statusResident, male_degree, statusResident:male_degree, phi_male_degree. To trust these, you need to check that the model predictions are well calibrated, and they were wuite reasonable, although but there was something strange with the loo-pitt plot.
polyDate21 and polyDate22 are likely to be highly correlating (indicated by much wider marginals than for other coefficients), and you might want to look at the joint marginal for those (e.g. with mcmc_scatter()).

6 is not much, and might go away with a small increase in adapt_delta option

I didn’t understand this

This is likely due to one of the spline basis functions and intercept terms being collinear, but it shouldn’t affect the other parts.

That’s good. It’s sufficent to report this as 0.999 (see Bayesian workflow book - Digits)

If you can share the example, you can make an issue in bayesplot git repo, and it would eventually be fixed.

There is no need to tag the person you are replying to, they will get notified just because of the reply. Tagging is useful when you start a new thread or when you want to notify someone who wasn’t yet part of the discussion.

Here is the scatterplot. It doesn’t seem correlated.

I increased the adapt_delta and then compared it to the intercept model and now I believe it is significantly different then the Intercept model or almost.


This is why I originally did the polynomial function.

No I was referring to when I only add Date as itself (withouth a spline or polynomial function), so what would be the reasoning for this?

Also when I did the spline it took away almost all the std error.

Where would I share this? On the Stan forum for bayesplots?

Wait so do you think this is a coding error or model error?

So what exactly does the loo-pitt plot show? Is says it compares the density of the computed LOO probability integral transform versus 100 simulated data sets from a standard uniform distribution.

What does this tell me about my model?

(The y here is the numerator/denominator)

Also just wondering, based on the diagnostics, would you trust this model?

Great. Then the interpretation is easier.

Yes, barely.

Great that you showed also the data. It seems there is a lot of variation, and the polynomial is not doing much.

Date 0 being far from the range of dates in the data?

You can create an issue at Issues · stan-dev/bayesplot · GitHub
or if you prefer, you can share the example here in the discourse

As the current plot is showing garbage, we can’t make any interpretation about what it says about your model.

Based on what I have seen so far, I would say that I would trust that the data are informative about the parameters, there is no bad model misspecification, but the additional predictive accuracy is low as the distribution has a long tail.

Is there a way to estimate relative variable importance from a single model fit? For example for a model like y ~ x1 * s(x2) + x3 I’d like to estimate the importance of x2 which would be the combined importance of x2 and of the x1:x2 interaction effect. I could imagine an index like corr(partial linear predictor from all terms involving x2, complete linear predictor) but that may not perform well with certain collinearities.

I posted a similar question at regression - Relative variable importance/explained variation from a single model fit - Cross Validated