Beta regression with bounded predictor

shirdekel · November 12, 2024, 4:57am

I’d like to model data that has a bounded outcome variable and a bounded predictor. These are both “factor scores” that I would like to present as percentiles (for easier interpretation). Since percentiles are bounded by 0 and 1, I modelled the outcome using a zero-inflated beta distribution. However, the predictions imply that the predictor can go past 0 or 1.

df <-
  data.frame(
    predictor = rnorm(1000),
    outcome = rnorm(1000)
  )

transform_ecdf_x <- \(x) ecdf(df$predictor)(x)
transform_ecdf_y <- \(x) ecdf(df$outcome)(x)

df$predictor <- transform_ecdf_x(df$predictor)
df$outcome <- transform_ecdf_y(df$outcome)

model <-
  brms::brm(
    outcome ~ predictor,
    family = brms::zero_one_inflated_beta(),
    backend = "cmdstanr",
    data = df
  )

model |>
  marginaleffects::plot_predictions(condition = "predictor")

^{Created on 2024-11-12 with reprex v2.1.0}

Session info

sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.3.2 (2023-10-31)
#>  os       macOS Sonoma 14.7.1
#>  system   aarch64, darwin23.0.0
#>  ui       unknown
#>  language (EN)
#>  collate  en_AU.UTF-8
#>  ctype    en_AU.UTF-8
#>  tz       Australia/Sydney
#>  date     2024-11-12
#>  pandoc   2.19.2 @ /opt/homebrew/bin/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  ! package         * version     date (UTC) lib source
#>    abind             1.4-8       2024-09-12 [1] CRAN (R 4.3.2)
#>    backports         1.5.0       2024-05-23 [1] CRAN (R 4.3.2)
#>    bayesplot         1.11.1.9000 2024-10-24 [1] https://stan-dev.r-universe.dev (R 4.3.2)
#>  P bridgesampling    1.1-2       2021-04-16 [?] CRAN (R 4.3.2)
#>    brms              2.22.2      2024-10-24 [1] Github (paul-buerkner/brms@f024203)
#>  P Brobdingnag       1.2-9       2022-10-19 [?] CRAN (R 4.3.2)
#>    checkmate         2.3.2       2024-07-29 [1] CRAN (R 4.3.2)
#>    cli               3.6.3       2024-06-21 [1] CRAN (R 4.3.2)
#>    cmdstanr          0.8.1       2024-10-24 [1] https://stan-dev.r-universe.dev (R 4.3.2)
#>    coda              0.19-4.1    2024-01-31 [1] CRAN (R 4.3.2)
#>  P codetools         0.2-19      2023-02-01 [?] CRAN (R 4.3.2)
#>  P collapse          1.9.5       2023-04-07 [?] CRAN (R 4.3.2)
#>    colorspace        2.1-1       2024-07-26 [1] CRAN (R 4.3.2)
#>    curl              5.2.3       2024-09-20 [1] CRAN (R 4.3.2)
#>    data.table        1.16.2      2024-10-10 [1] CRAN (R 4.3.2)
#>    digest            0.6.37      2024-08-19 [1] CRAN (R 4.3.2)
#>    distributional    0.5.0       2024-09-17 [1] CRAN (R 4.3.2)
#>    dplyr             1.1.4       2023-11-17 [1] CRAN (R 4.3.2)
#>  P evaluate          0.21        2023-05-05 [?] CRAN (R 4.3.2)
#>    fansi             1.0.6       2023-12-08 [1] CRAN (R 4.3.2)
#>    farver            2.1.2       2024-05-13 [1] CRAN (R 4.3.2)
#>  P fastmap           1.1.1       2023-02-24 [?] CRAN (R 4.3.2)
#>  P fs                1.6.2       2023-04-25 [?] CRAN (R 4.3.2)
#>  P generics          0.1.3       2022-07-05 [?] CRAN (R 4.3.2)
#>    ggplot2           3.5.1       2024-04-23 [1] CRAN (R 4.3.2)
#>    glue              1.8.0       2024-09-30 [1] CRAN (R 4.3.2)
#>  P gridExtra         2.3         2017-09-09 [?] CRAN (R 4.3.2)
#>    gtable            0.3.6       2024-10-25 [1] CRAN (R 4.3.2)
#>  P highr             0.10        2022-12-22 [?] CRAN (R 4.3.2)
#>  P htmltools         0.5.5       2023-03-23 [?] CRAN (R 4.3.2)
#>  P inline            0.3.19      2021-05-31 [?] CRAN (R 4.3.2)
#>    insight           0.20.5      2024-10-02 [1] CRAN (R 4.3.2)
#>    jsonlite          1.8.9       2024-09-20 [1] CRAN (R 4.3.2)
#>  P knitr             1.43        2023-05-25 [?] CRAN (R 4.3.2)
#>  P labeling          0.4.3       2023-08-29 [?] CRAN (R 4.3.2)
#>  P lattice           0.21-8      2023-04-05 [?] CRAN (R 4.3.2)
#>    lifecycle         1.0.4       2023-11-07 [1] CRAN (R 4.3.2)
#>    loo               2.8.0.9000  2024-10-24 [1] https://stan-dev.r-universe.dev (R 4.3.2)
#>  P magrittr          2.0.3       2022-03-30 [?] CRAN (R 4.3.2)
#>  P marginaleffects   0.23.0      2024-10-05 [?] CRAN (R 4.3.2)
#>  P Matrix            1.5-4       2023-04-04 [?] CRAN (R 4.3.2)
#>    matrixStats       1.4.1       2024-09-08 [1] CRAN (R 4.3.2)
#>    mvtnorm           1.3-1       2024-09-03 [1] CRAN (R 4.3.2)
#>  P nlme              3.1-162     2023-01-31 [?] CRAN (R 4.3.2)
#>  P pillar            1.9.0       2023-03-22 [?] CRAN (R 4.3.2)
#>    pkgbuild          1.4.4       2024-03-17 [1] CRAN (R 4.3.2)
#>  P pkgconfig         2.0.3       2019-09-22 [?] CRAN (R 4.3.2)
#>    posterior         1.6.0       2024-07-03 [1] CRAN (R 4.3.2)
#>    processx          3.8.4       2024-03-16 [1] CRAN (R 4.3.2)
#>    ps                1.8.0       2024-09-12 [1] CRAN (R 4.3.2)
#>    purrr             1.0.2       2023-08-10 [1] CRAN (R 4.3.2)
#>  P QuickJSR          1.0.7       2023-10-15 [?] CRAN (R 4.3.2)
#>  P R.cache           0.16.0      2022-07-21 [?] RSPM (R 4.3.2)
#>  P R.methodsS3       1.8.2       2022-06-13 [?] RSPM (R 4.3.2)
#>  P R.oo              1.26.0      2024-01-24 [?] RSPM (R 4.3.2)
#>  P R.utils           2.12.3      2023-11-18 [?] RSPM (R 4.3.2)
#>  P R6                2.5.1       2021-08-19 [?] CRAN (R 4.3.2)
#>  P RColorBrewer      1.1-3       2022-04-03 [?] CRAN (R 4.3.2)
#>    Rcpp              1.0.13      2024-07-17 [1] CRAN (R 4.3.2)
#>  P RcppParallel      5.1.7       2023-02-27 [?] CRAN (R 4.3.2)
#>  P reprex            2.1.0       2024-01-11 [?] RSPM (R 4.3.2)
#>    rlang             1.1.4       2024-06-04 [1] CRAN (R 4.3.2)
#>  P rmarkdown         2.22        2023-06-01 [?] CRAN (R 4.3.2)
#>    rstan           * 2.32.5      2024-01-10 [2] CRAN (R 4.3.2)
#>    rstantools        2.4.0.9000  2024-10-24 [1] https://stan-dev.r-universe.dev (R 4.3.2)
#>    scales            1.3.0.9000  2024-10-30 [1] Github (r-lib/scales@4a95f27)
#>    sessioninfo       1.2.2       2021-12-06 [2] CRAN (R 4.3.2)
#>    StanHeaders     * 2.32.5      2024-01-10 [2] CRAN (R 4.3.2)
#>    stringi           1.8.4       2024-05-06 [1] CRAN (R 4.3.2)
#>    stringr           1.5.1       2023-11-14 [1] CRAN (R 4.3.2)
#>  P styler            1.10.2      2023-08-29 [?] RSPM (R 4.3.2)
#>    tensorA           0.36.2.1    2023-12-13 [1] CRAN (R 4.3.2)
#>  P tibble            3.2.1       2023-03-20 [?] CRAN (R 4.3.2)
#>    tidyselect        1.2.1       2024-03-11 [1] CRAN (R 4.3.2)
#>  P utf8              1.2.4       2023-10-22 [?] CRAN (R 4.3.2)
#>  P V8                4.3.0       2023-04-08 [?] CRAN (R 4.3.2)
#>    vctrs             0.6.5       2023-12-01 [1] CRAN (R 4.3.2)
#>    withr             3.0.2       2024-10-28 [1] CRAN (R 4.3.2)
#>  P xfun              0.39        2023-04-20 [?] CRAN (R 4.3.2)
#>  P xml2              1.3.3       2021-11-30 [?] CRAN (R 4.3.2)
#>  P yaml              2.3.7       2023-01-23 [?] CRAN (R 4.3.2)
#> 
#>  [1] /Users/sdekel002/Library/CloudStorage/OneDrive-pwc/Documents/analysis/ceo-28/renv/library/R-4.3/aarch64-apple-darwin23.0.0
#>  [2] /opt/homebrew/Cellar/r/4.3.2/lib/R/library
#> 
#>  P ── Loaded and on-disk path mismatch.
#> 
#> ──────────────────────────────────────────────────────────────────────────────

sjp · November 13, 2024, 8:14pm

I’m confused. I don’t see anything plotted past 0 or 1 on either axis. Can you maybe highlight what your problem is with the plot exactly?

shirdekel · December 2, 2024, 3:37am

Right, but whereas for the outcome variable, values beyond 0-1 are impossible, I can easily get predictions for predictor values beyond 0-1. For instance:

model |>
  marginaleffects::predictions(
    newdata = marginaleffects::datagrid(predictor = 2)
  )
#> 
#>  predictor Estimate 2.5 % 97.5 %
#>          2    0.489 0.402  0.575
#> 
#> Type:  response 
#> Columns: rowid, estimate, conf.low, conf.high, predictor, outcome

sjp · December 2, 2024, 8:40am

Oh, I see. So your outcome variable is bounded between zero and one, so you used zero-one-inflated Beta regression to capture this property in your model. No matter what values of predictors you pass, your outcome variable will always be bounded.

In contrast, these models (and most models that I am familiar with) don’t make any assumptions about the distribution of your predictor variables. In this case your predictor is also bounded between zero and one, but the model doesn’t know that. And, since you are fitting straight lines in the model space, you can ask your model what the value of the outcome variable would be if the predictor had a value of two.

Below I code up a quick example of predicting race time in seconds by a person’s height in centimeters. Even though there is no height data that is negative, and theoretically we know that you can’t have a negative height, we can still predict what someone’s racetime might be if they had a height of negative forty centimeters. This is why the plot_predictions() function only visualizes the regression line for where you actually have data. There might be some instances where it makes sense to extrapolate your regression beyond where you presently have data, but most of the time I would be hesitant to assume the linear trend extends beyond where the model was informed by data.

library(marginaleffects)

set.seed(123)
height_cm <- rnorm(30, mean = 175, sd = 10)
time_s <- rnorm(30, mean = 30 + (height_cm*-0.1), sd = 1)

df <- data.frame(height_cm, time_s)

model <- lm(time_s ~ height_cm, data = df)

predictions(model,
            newdata = datagrid(height_cm = -40))

 height_cm Estimate Std. Error    z Pr(>|z|)    S 2.5 % 97.5 %
       -40     37.1       3.41 10.9   <0.001 88.9  30.4   43.7

Columns: rowid, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, time_s, height_cm 
Type:  response

The only way I know of to let the model know the predictor is bounded is when you use monotonic variables to enter ordinal predictor variables into the regression.

shirdekel · December 3, 2024, 1:16am

Thanks for that. I’d be grateful if you (or anyone else stumbling upon this) have any resources that touch on this question.

sjp · December 3, 2024, 10:21am

Honestly, it just comes down to learning the mathematical form of the model you are using. When you do that, it’s easy to see why you can plug in any value for a predictor you want and get an answer (whether or not it is a reasonable answer is another story).

I’d recommend working through Statistical Rethinking. You can link the accompanying lectures and course materials here.

McElreath, R. (2018). Statistical rethinking: A Bayesian course with examples in R and Stan . Chapman and Hall/CRC.

Solomon · December 3, 2024, 4:38pm

This is the way.

Topic		Replies	Views
Zero-one-inflated beta regression with brms not finishing the chains Modeling techniques , fitting-issues , mcmc , brms	5	914	October 18, 2023
Interpretation of the a zero-one inflated beta regression model estimated using brms General	4	1444	November 15, 2021
How to use a Beta distribution for Bayesian meta-analysis with Brms? brms brms , meta-analysis	0	411	April 2, 2023
Beta family distribution and posterior predictive checks with continuous outcome variable Modeling brms	4	236	May 10, 2024
Multilevel (mixed effects) model with zero-inflated beta family in brms brms rstan	3	2709	April 15, 2019

Beta regression with bounded predictor

Related topics