Improving predictive accuracy of model

General Question: How can you improve a model after seeing (via the posterior predictive distribution), that it poorly recovers your data? Most of the recommendations I hear are to change the distribution of the likelihood. In this case (as I explain below) my y variable is on the proportion scale, so I think my choice of a beta distribution for y is vaild, especially considering how flexible the beta is.

Data/Model: I am fitting a multilevel beta regression model with brms (version 2.17.0) using continuous proportion data as an outcome that have been transformed (per Smithson & Verkuilen, 2006: y*(n - 1) + 0.5)/n to fit the open (0, 1) interval. 6% of the un-transformed data were zeros, and 0.6% were ones. The observations are the amount of daily energy derived from one species of plant by a foraging animal that generally eats many species plants throughout the day. Predictor variable x is a continuous variable ranging from about 0-14 (a metric on the percentage scale estimating forest productivity). The data are collected over about 15 years and all observations in a given month will have the same x variable. There are 114 id groups, which are individual animals. sex is a binary variable with 1 coded for females and the time_feed variable is the total amount of time spent feeding in a given day (scaled). Number of observations is over 5,700.

Here is my model structure:

beta_fit <- brm(formula = y ~ x + 
                         sex + 
                         I(time_feed * x) +
                         (x | id), 
       data = data,
       family ="beta",
       prior = c(prior(student_t(3, 0, 2.5), class = "Intercept"), 
                 prior(normal(0, 1), class = "b").
                 prior(gamma(0.01, 0.01, class = "phi")),
       chains = 4, 
       iter = 4000, 
       warmup = 1000,
       cores = 2, 
       seed = 1234,
       control = list(adapt_delta = 0.95))

The model converges, there are no alarming Pareto k values, but when looking at the posterior predictive distribution, it seems like the model predictions don’t recover the actual data very well, as seen here:

Seeking Advice: Any suggestions to getting a better fit? Or is it okay to move forward given that the general shape of the posterior predictive distribution follows my data, and that the direction and magnitude of the relationship are of most interest to me? Could my data just be too noisy? Here is a plot with 100 draws from the expectation of the posterior predictive distribution overlain on a scatterplot of the data (ignoring the interaction for now):

image

Thanks for your help!

You could also specify it differently. Why use I(time_feed * x) instead of just time_feed * x ? What about splines?

So you could also have used a zero-one-inflated beta model on the un-transformed data. This is a family in brms already. I’m not sure how different that would be, but you could try it.

Edit: for advice on using histograms in pp_check I realized I was looking at x in the plot. So deleted that.

If it’s multiple species of plant, why a beta regression rather than a Dirichlet regression? (I don’t even know if that’s supported by brms.)

That scatterplot is really showing how poor the linear relationship is between x and y, so you might not be able to do a lot better with a linear model. Does this look different for individual animals?

What doe the I(time_feed * x) do?

Did you really mean to include both a fixed x and a varying x | id (assuming that the latter means to have a slope that varies by id)?

I assume a global intercept is implicit, but with varying slopes, I’d have expected to see a varying intercept (plain id) term in regression formula is how I’m guessing that’d be coded.

Boundary values 0 and 1 are tricky as they’re usually considered out of support because they require exponentiating a 0 value (on the linear scale) or multiplying by log(0) on the log scale. Do you know how brms is finessing these so that they’re not out of support?

Thanks for your replies @Bob_Carpenter and @jd_c .

I have tried a few zoib models but am finding them quite clunky and hard to troubleshoot since they take so long to run. I could keep going with this but I am not super confident that it will help much. I got a similarly poor fit with a zoib model (albeit one with high r-hat values). I have come across “ordered beta regression” as a possible alternative that might be sleeker. As you mention, splines are also a possibility, but I have quite minimal experience with them.

As for the syntax for the interaction term: I reran it with the more conventional y ~ sex + time_feed * x + (x | id) but this didn’t seem to make too much of a difference either.

A Dirichlet regression does seem reasonable, but my focus is really on one particular species of plant. There are over 180 plant species in the diet so dirichlet might get messy. As far as I know this isn’t supported by brms either.

Perhaps I am thinking about this wrong, but my intention, to let slopes for x and intercepts vary by individual ‘id’. Maybe it is worth noting that even when trying a more simple model of just varying intercepts and one predictor y ~ x + (1|id) I still get a very similar posterior predictive distribution.

Here are the scatterplots of y and x for each individual.

As for the boundary issue, I wrote this function to squeeze the y into the open (0,1) interval, which was to my knowledge standard practice. With 0 or 1, I don’t think the model will run.

p_star <- function(proportion, n) {
    output <- (proportion*(n - 1) + 0.5)/n 
    return(output)
}

Thanks again. I would greatly appreciate any other feedback you have to offer.

Splines? Eye-balling your y vs x plot it seems it would be helpful

Thanks so much @avehtari

Per your and @jd_c 's suggestion, I gave splines a shot:

beta_spl <- brm(formula = y ~ s(x, k = 5) + (1 | id), 
       data = data,
       family ="beta",
       prior = c(prior(student_t(3, 0, 2.5), class = "Intercept"), 
                 prior(normal(0, 2), class = "b")),
       chains = 4, 
       iter = 4000, 
       warmup = 1000,
       cores = 2,  
       seed = 1234,
       control = list(adapt_delta = 0.95))

pp_check(beta_spl, type = "dens_overlay", ndraws = 80)

Unfortunately, it still doens’t look like there is much improvement in the predictive accuracy and the estimates at the high end of x look a little off.


I also got a zero-one-inflated beta working,

zoib <- brm(
  bf(
    y ~ x + (1 | id),  
    zoi ~ x + (1 | id),  
    coi ~ x + (1 | id) 
  ),
  family = zero_one_inflated_beta(), 
       data = data,
       prior = c(prior(student_t(3, 0, 2.5), class = "Intercept"), # default
                 prior(normal(0, 1), class = "b")),
       iter = 7000, 
       warmup = 2000,
       cores = 2,  # my comp has 2 cores
       seed = 1234)  

But still not much improvement (besides getting the 0s right)…

Could it be that the data just do not want to be fit to a beta distribution, or perhaps x and y are just not linearly related?

Instead of having the data as a proportion, is there a way to model y on its natural units (kilocalories) while including some sort of offset term (like total calories consumed in the day)?

On the other hand, perhaps the misfit isn’t too egregious and it would be safe to move forward, while being up front about the limitations of the model?

Thank you all for your generosity.

Will

Based on you second plot, spline is clearly modeling some non-linearity , but based on the first plot that is not yet affecting much. How did you choose k=5 which limits the fexibility of the spline? In theory wihtout that limit and that model, the spline should be able to almost perfectly.

I ran it again as :

beta_sp2 <- brm(formula = y ~ s(x) + (1 | id), 
       data = data,
       family ="beta",
       prior = c(prior(student_t(3, 0, 2.5), class = "Intercept"), 
                 prior(normal(0, 2), class = "b")),
       chains = 4, 
       iter = 4000, 
       warmup = 1000,
       cores = 2,  
       seed = 1234,
       control = list(adapt_delta = 0.97)) 
pp_check(beta_sp2, type = "dens_overlay", ndraws = 100)

and wound up with a very similar plot.
image

Is there something I am doing wrong?

Try specifying a higher k. Is x discrete? Are there only 14 values of x? If so, try k=14. Probably won’t help much, but I don’t think the default for k is a very flexible spline.

What exactly is the outcome? Does the outcome arise from kcal / kcal consumed? That’s the proportion that you are modeling? Is there some sort or rounding errors or something whereby the values of the outcome are inflated at certain values? If so, try viewing posterior predictive checks using histograms pp_check(beta_fit, type='hist')

I would try the spline with the zoib with a higher k and also look at pp_check with histograms.
Also, what happened to sex and time_feed in the model? You could also try a splines by sex for both like s(x, by=sex, k=140) + s(time_feed, by=sex, k=14) along with the varying intercept. You can even make the model more flexible. I always feel like this becomes a lot of curve fitting, so it always seems best to try to have some sort of generative scientific model in mind and then try to fit it, maybe using non-linear syntax in brms. Practically, I find that I rarely have everything I need to do this, but if you do, it would likely be best.

Thanks for offering some more suggestions @jd_c. To answer your questions: x is not discrete, but it might look that way because it was a metric calculated monthly. Every observation in a given month will have the same value of x. There are no rounding errors that are inflating certain values of y, which is the proportion of kcal derived from plant species of interest in a day/total kcal in a day.

I tried upping k to 20 in a regular beta model as well as a zoib and it still seems like there is still little improvement (plus it took 8 hours for the zoib to run). The rather poor fit is still evident in the density and histograms. (the following images are from the zoib model).



It seems like no matter how I look at it, the beta is a poor fit. Perhaps it would be worth asking again: Instead of modeling the data as a proportion, is there a way to model y on its natural units (kilocalories) while including some sort of offset term (like total calories consumed in the day)? Or on the other hand, is the fit not actually that bad and it would be acceptable to move forward, while being up front about the limitations of the model?

I’m truly perplexed as to why nothing is improving the fit.

kcal is a count, right? so you could use something like a Poisson or NB model and model the rate using brm(kcal | rate(tot_kcal_consumed) ~ with total kcal consumed as the denominator, perhaps? Like:

bf(kcal | rate(tot_kcal_consumed) ~ 1 + x + sex + time_feed + time_feed:sex + (x | id)) + poisson()

or same thing with negbinomial()

This is interesting, thanks for reporting these different results. Currently you were using a model assuming the same dispersion parameter phi everywhere. How about modeling the phi also as a function of x and id, which would add flexibility to the model
formula = bf(y ~ s(x) + (1 | id), phi ~ s(x) + (1 | id)))

I think with x + ... + x | id you get both a fixed effect for x and a random effect that varies by id. That is, the linear predictor for item n will be alpha * x[n] + beta[id[n]] * x[n] where you get a fixed effect slope alpha and a varying slope beta.

If you want varying slopes and varying intercepts, I think that’s be x | id + 1 | id. That’d give you alpha[id[n]] + beta[id[n]] * x[n].

If you just want a slope varying across IDs, that’d be x | id.

Someone else should verify who knows this regression better than me.

Plotting the raw data is a good first step if you can evaluate this.

In brms, (x|id) includes both varying intercepts and varying slopes, and is short for (1 + x | id)

Aha, I think this might be the issue. I reran the zoib with phi as a function of x and id and it is looking a bit better.

zoib_spl_phi <- brm(
  bf(
    y_untrans ~ s(x) + (1 | id),
    phi ~ s(x) + (1 | id),
    zoi ~ s(x) + (1 | id),  
    coi ~ s(x) + (1 | id) 
  ),
  family = zero_one_inflated_beta(), 
       data = data,
       prior = c(prior(student_t(3, 0, 2.5), class = "Intercept"), 
                 prior(normal(0, 1), class = "b")),
       iter = 7000, 
       warmup = 2000,
       cores = 2,  

       seed = 1234,
       save_pars = save_pars(all = TRUE),
  control = list(adapt_delta = 0.90))   


I did however get a warning that there were 5 divergent transitions (Rhat and ESS look ok though).

summary(zoib_spl_phi)

Smooth Terms: 
                Estimate Est.Error l-95% CI u-95% CI
sds(sX_1)         4.97      1.27     3.11     8.03
sds(phi_sX_1)     6.19      1.66     3.78    10.23
sds(zoi_sX_1)     5.61      2.15     2.35    10.74
sds(coi_sX_1)     2.00      1.82     0.08     6.63
                Rhat Bulk_ESS Tail_ESS
sds(sX_1)     1.00     5872     9832
sds(phi_sX_1) 1.00     8159    12065
sds(zoi_sX_1) 1.00     5884     7601
sds(coi_sX_1) 1.00    11520    10986

Group-Level Effects: 
~id (Number of levels: 114) 
                  Estimate Est.Error l-95% CI
sd(Intercept)         0.45      0.05     0.36
sd(phi_Intercept)     0.41      0.05     0.31
sd(zoi_Intercept)     0.97      0.16     0.68
sd(coi_Intercept)     3.34      0.80     2.07
                  u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)         0.56 1.00     5027     9084
sd(phi_Intercept)     0.53 1.00     5328     9540
sd(zoi_Intercept)     1.31 1.00     5372     9436
sd(coi_Intercept)     5.20 1.00     7727    12710

Population-Level Effects: 
              Estimate Est.Error l-95% CI u-95% CI
Intercept        -0.69      0.06    -0.80    -0.58
phi_Intercept     1.09      0.06     0.98     1.20
zoi_Intercept    -3.19      0.17    -3.54    -2.88
coi_Intercept    -3.31      0.75    -4.98    -2.02
sX_1           -0.72      0.90    -2.49     1.03
phi_sX_1       -8.37      2.99   -14.49    -2.73
zoi_sX_1        0.62      4.70    -7.86    10.66
coi_sX_1      -10.99      7.39   -24.79     4.73
              Rhat Bulk_ESS Tail_ESS
Intercept     1.00     3280     6645
phi_Intercept 1.00     3897     7344
zoi_Intercept 1.00     6748    10405
coi_Intercept 1.00     9073    12244
sx_1        1.00    26854    15069
phi_sx_1    1.00    12709    14460
zoi_sx_1    1.00    10423    14116
coi_sx_1    1.00    11693    10298

I’m wondering if this has to do with the flat b priors. I’m not sure why there are 6 priors of class = “b” that were not assigned the normal(0,1) priors… Did I specify my priors wrong?

prior_summary(zoib_spl_phi) 

                prior     class      coef group resp
         normal(0, 1)         b                     
         normal(0, 1)         b    sX_1           
               (flat)         b                     
               (flat)         b    sX_1           
               (flat)         b                     
               (flat)         b    sX_1           
               (flat)         b                     
               (flat)         b    sX_1                  

Thanks so much for all of your help.

Great!

Given a large number of iterations and excellent ESS, these are not probably worrysome, but of course you can try increasing adapt_delta a bit or think more carefully of priors

And here you are thinking! Flat priors can be problematic, but unfortunately I don’t have enough experience with brms priors especially on splines.

1 Like

Cool. @will That doesn’t look bad.
You can also try adapt_delta closer to 1, like 0.99.
Are you sure you want normal(0, 1) priors on the spline coefficients? See here and particularly the link to the Tristan Mahr article for how splines work in brms.
Just curious, but did you ever try the count model with rate?

Thanks so much to all of you for taking the time to help me out. I really do appreciate it.

Setting adapt_delta to 0.98 did the trick and I am no longer getting the warning. I’m still not sure why it says there are flat beta priors, but I guess this isn’t an issue.

@jd_c Thanks for sharing the link. I don’t know what a better prior would be in this case. I am open to suggestions if you have any! As for the count models, they don’t work because my outcome values are not strictly integers, rather, they are the sums of calories eaten in a day, not counts. It is completely normal to have a value of something like 1,002.4kcal.

2 Likes

I’ve actually always left the flat priors on the spline coefficients but if wanted (other than brms default) set a prior on the ‘wiggliness’ of the spline by setting a prior on the sds() (in a similar way as setting a prior on the sd of varying slopes or intercepts).

Oh interesting. I may play around with the priors and if I find anything worth sharing I will let you know.