Interpreting and presenting effects in multilevel beta regression

I have a model that predicts fault prediction performance using the weighting function, the lines of code the project has at the revision and a varying intercept for each project:
(I found it a lot easier to work with the 0 + ...notation and the models didn’t seem to differ in loo performance)

 formula = EXAM ~ 0 + Weight + LOC + (1|Project),
 prior = c(
   prior(normal(0,1), class=Intercept),
   prior(normal(0,0.5), class=b),
   prior(cauchy(0,0.5), class=sd),
   prior(gamma(0.1, 0.1), class=phi)

There are 3 different weighting functions and 23 projects. LOC is centered and scaled to sd.
The question I want to answer is which weighting function produces the lowest (best) EXAM score and the differences in performance between the three weighting functions.
Here is what I have so far:
Using stanplot(m3.1, type="areas", pars="b_Weight") I get this plot for the coefficients:

It seems pretty clear from this plot, that there is almost no difference between the three weighting functions.
I can also use hypothesis from brms to check if two coefficients are the same but that again is on the logit scale so maybe the difference on the output scale is different?

  1. In case the coefficients are “the same” on the logit scale, is that sufficient to declare that there is no relevant difference between them or do I always have to analyze on the outcome scale?

But for the sake of better understanding, let’s assume that there is a difference and I would like to put numbers on it and maybe make a recommendation based on it.
I can calculate the difference between two weighting functions from the posterior sample like this:

inv_logit_scaled(post$w1) - inv_logit_scaled(post$w2)

I could then do that for all three weighting functions, add some 95% lines and plot them like this eg:

We can read from this, that the median (dotted line) difference between W2 and W1 is positive, but the 95% percentiles have a big overlap with 0. So when one would have to choose, w1 might be the better choice but ultimately it kind of doesn’t matter.

Up to this point, I have ignored the loc and varying intercepts for the 23 different projects. I am unsure how to handle them.
As the differences on the outcome scale are not linear with the differences on the logit scale I have to include the entire model when calculating the differences. So for each project, over the span of all loc.
This seems like a lot of work and very hard to summarize in a way that a reader will understand easily.
I found this thread that tackles a similar problem but lacks the varying intercepts so I am not sure how to apply this to my case.
Do I take the mean of all project intercepts or is that mean maybe just 0 by design?

  1. How do I include the varying project intercepts in the “around the mean” analysis of differences in performance?

  2. How would one go about calculating and presenting the differences away from the mean?

Since nobody answered, I will give it a try.

One IMHO good way to interpret more complex models is via their predictions. You could for example use the model to plot counterfactual predictions - how the model would expect the outcome to change for each individual actual project when varying the weight function. For example, in the figure below, each line connects counterfactual predictions (vertical axis) from one posterior sample for a single individual as I change a single categorical predictor (horizontal axis), other predictors stay the same as in the original data:

The vertical bars are then 50 and 95% quantiles of the predictions. The color indicates whether the outcome has risen or decreased between the neighbouring levels of the categorical variables and the green points (which are a bit of a clutter) show average outcomes for individuals with the given predictor level across the original dataset.

Some similar notes in my earlier post:

Alternative approach is to predict for a new, unseen project (drawn from the same population of projects) - you just make the data for prediction include a new project ID/name and set allow_new_levels = TRUE. Then you can predict for various weighting schemes and LOC and summarise - this gives you a grid of estimates.

Finally a minor correction (but maybe you are already aware of this, so sorry if this is obvious):

Not really. The plot is consistent with a difference of ~0.3 - this might be non-negligible difference for many applications so saying there is no difference is not warranted. Maybe the parameters are highly correlated and if you plot the posterior of the differences it is more narrow, but this plot does not warrant the conclusion you stated. (later you show that the model is not consistent with a difference larger than 0.015 on the outcome scale which might plausibly be negligible).

Hope that helps.

Thank you for your answer.

How do you reach that conclusion? My way of interpreting it was:
All Means are very close and the 50% intervals have almost perfect overlap. Based on that, there seems to be no difference between the weighting functions. They however seem to have an impact on the outcome.
Although it is a little tricky because the model has no global intercept.
Do you disagree with my reasoning and if so, can you explain?

I’ll add a plot and reasoning for a model using a global intercept, just in case to show the difference:
Using the model EXAM ~ 1 + Weight + LOC + (1|Project) gives this for the mcmc_areas:

Now one weighting function is part of the global intercept and the other two are put in contrast to it. Here I would say that both means are below 0 so there seems to be a difference between the intercept weighting function and the two in the plot but with the big overlap with 0, I expect it to be rather small and inconsistent. It seems however, that the top one is slightly better than the bottom one.

Here is a brushed up version of the posterior predict plot from before.

I used the posterior_predict function with the respective subsets of my data, one for each weighting function, calculated the differences and plotted the densities. Here the mean effect on the outcome scale is almost perfectly 0. This could mean “no effect” but it also reaches around 0.09 in the 95% interval so I could also say "effects between -0.09 and 0.09 are reasonably probable, which feels like saying we don’t really know a lot besides it is close to 0 and mostly under 0.1
How would you interpret the effect on the outcome scale? And are there any “rules” for this? It feels a little hand wavy to me.

Assuming the marginal posterior distributions are independent (which they aren’t), it is not implausible that one coefficient lies in (say) above the 0.75 quantile and the other below the 0.25 quantile (this has a probability of 0.0625 assuming independence, so it is not a stretch). To take into account the dependence, you would need to plot the posterior of the difference, which might be both narrower (correlation is positive) or wider (correlation is negative).

It is not clear to me what you want to achieve by omitting intercept - at the very least, it is usually harder to justify priors for a model without an intercept. Also note that if you really want a coefficient for each weighting function, it might make sense to have your model as 1 + (1 | Weight), preserving the intercept.

That is IMHO a correct interpretation.

I believe your best bet is through prediction. don’t think there are any rules as different prediction tasks correspond to different questions. So taking differences predictions for your actual data tells you (roughly) what differences would you expect if you repeated the exact same experiment. Taking counterfactual predictions with changing the weights for your actual data (as I suggested above) tells you what would you expect if the effect was fully causal and you manipulated your original data. You can think of other prediction tasks (say a new unobserved projects or taking several specific LOC values of interest) and measuring differences there would tell you different things. You alone need to decide which question/prediction task is the most relevant for you which is probably why it feels hand-wavy.

Does that make sense?

1 Like

One more thing: I just notice your priors are quite likely getting in the way of your inference, at least in the model without intercept. You have prior(normal(0,0.5), class=b), but your posteriors are centered below -1, where the prior probability is < 0.05. In other words, the prior pushes the coefficients strongly towards zero and may thus dampen possible differences (as values further from zero are pushed much more strongly because they have much lower prior probability)

I hadn’t thought about this. Thank you for the pointer!

Initially it felt like it would make interpretation easier, as it gave me a coefficient for each weighting function. But I agree, that a varying intercept would be the better solution. I ended up just using an intercept. Growing pains of understanding how to interpret models I would call it.

I think it does :)
Does the counterfactual plot you used earlier have a name (aka is that a “standard” thing) or could you maybe share the code for it?

That is due to the formula being the wrong one… The code actually is for a model with an intercept and I just changed the 1+… to 0+… instead of using the code for the model without an intercept. Good catch though :)

I don’t think it has a name - I totally made it up because I though it is a good way to summarize the data. Since you only have two factors entering the equation, it might be a bit more sensible to plot predictions over a 2D grid of those values. (in my case, there were dozens of predictors so this was not an option). Nevertheless, the code is below, but it is very much tailored to my specific case and not documented much, so not sure it will help you a lot:

effects_by_prediction <- function(fit, orig_data, var_name, var_values, var_labels = var_values,
                                  response_title, response_to_value_func, n_samples = 500, max_lines = 200, plot_type = "all") {

  data_for_prediction_all_values <- var_values %>% imap_dfr(function(var_value, var_value_index) {
    data_for_prediction <- orig_data
    data_for_prediction[, var_name] <- var_value
    data_for_prediction$var_value_index <- var_value_index

  prediction_array <- posterior_linpred(fit, newdata = data_for_prediction_all_values, nsamples = n_samples, transform = TRUE)
  prediction <- apply(prediction_array, MARGIN = c(1,2), FUN = response_to_value_func)

  predicted_data <- prediction %>% t() %>% as.tibble() %>%
    mutate(id = data_for_prediction_all_values$id,
           var_value = factor(data_for_prediction_all_values[[var_name]], levels = var_values, labels = var_labels),
           var_value_index = data_for_prediction_all_values$var_value_index) %>%
    gather("sample", "response", -id, -var_value, -var_value_index) %>%
    mutate(sample = factor(sample), id_sample = factor(paste(id, sample, sep = "__")))

  predicted_data_for_lines <- predicted_data %>%
    filter(id_sample %in% sample(unique(id_sample), max_lines))

  positive_label = "kladná"
  negative_label = "záporná"

  data_sign <- predicted_data_for_lines %>% filter(var_value_index < length(var_values)) %>%
    mutate(next_var_value_index = var_value_index + 1) %>%
    inner_join(predicted_data_for_lines %>% select(id_sample, response, var_value_index) %>% rename(next_response = response, next_var_value_index = var_value_index),
               by = c("id_sample" = "id_sample", "next_var_value_index" = "next_var_value_index")) %>%
    mutate(sign = if_else(response < next_response, positive_label,negative_label) %>%
             factor(levels = c(negative_label, positive_label))) %>%
    select(var_value_index, sign, id_sample)

  predicted_data_for_lines_sign <- predicted_data_for_lines %>%
    left_join(data_sign, by = c("var_value_index","id_sample")) %>%
    mutate(sign = if_else(, factor(positive_label, levels = c(negative_label,positive_label)),sign))

  orig_data$var_value = orig_data[[var_name]]
  if(is.numeric(orig_data$var_value)) {
    orig_data <- orig_data %>%
      rowwise() %>%
      mutate(var_value = var_values[which.min(abs(var_values-var_value))]) %>% #Find the nearest value

  orig_data_for_plot <- orig_data %>%
    mutate(var_value = factor(var_value, levels = var_values, labels = var_labels)) %>%
    group_by(var_value) %>%
    summarise(p_vetsi = mean(response_positive),
              pocet = length(response_positive))

  if(plot_type == "all") {
    sample_geom = geom_line(data = predicted_data_for_lines_sign,
                            mapping = aes(x = var_value, y = response, group = id_sample, color = sign),
                            inherit.aes = FALSE,
                            alpha = 0.2)
  } else {
    sample_geom = NULL

  if(plot_type == "all" || plot_type == "orig_estimate") {
    estimate_geom1 =  geom_linerange(aes(ymin = lower50, ymax = upper50), size = 2)
    estimate_geom2 =  geom_linerange(aes(ymin = lower, ymax = upper))
  } else {
    estimate_geom1 = NULL
    estimate_geom2 = NULL

  predicted_data %>%
    group_by(var_value) %>%
    summarise(Estimate = median(response),
              lower = quantile(response, 0.025),
              upper = quantile(response, 0.975),
              lower50 = quantile(response, 0.25),
              upper50 = quantile(response, 0.75)
    ) %>%
    ggplot(aes(x = var_value)) +
    sample_geom +
    geom_point(data = orig_data_for_plot, mapping = aes(y = p_vetsi, size = pocet), color = "#2ca25f",
               position = position_nudge(x = 0.05) ) +
    estimate_geom1 + estimate_geom2 +
    scale_color_discrete("Asociace", drop = FALSE) +
    scale_y_continuous(response_title, limits = c(0,1)) +
    scale_x_discrete(var_name) +
    scale_size_continuous(range = c(2,8)) +
    guides(color = guide_legend(override.aes = list(alpha=1))) +
    if(length(var_values) > 4 && max(nchar(var_labels) > 4)) {
      theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))
    } else {


1 Like