Beta regression with bounded predictor

I’d like to model data that has a bounded outcome variable and a bounded predictor. These are both “factor scores” that I would like to present as percentiles (for easier interpretation). Since percentiles are bounded by 0 and 1, I modelled the outcome using a zero-inflated beta distribution. However, the predictions imply that the predictor can go past 0 or 1.

df <-
  data.frame(
    predictor = rnorm(1000),
    outcome = rnorm(1000)
  )

transform_ecdf_x <- \(x) ecdf(df$predictor)(x)
transform_ecdf_y <- \(x) ecdf(df$outcome)(x)

df$predictor <- transform_ecdf_x(df$predictor)
df$outcome <- transform_ecdf_y(df$outcome)

model <-
  brms::brm(
    outcome ~ predictor,
    family = brms::zero_one_inflated_beta(),
    backend = "cmdstanr",
    data = df
  )

model |>
  marginaleffects::plot_predictions(condition = "predictor")

Created on 2024-11-12 with reprex v2.1.0

Session info
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.3.2 (2023-10-31)
#>  os       macOS Sonoma 14.7.1
#>  system   aarch64, darwin23.0.0
#>  ui       unknown
#>  language (EN)
#>  collate  en_AU.UTF-8
#>  ctype    en_AU.UTF-8
#>  tz       Australia/Sydney
#>  date     2024-11-12
#>  pandoc   2.19.2 @ /opt/homebrew/bin/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  ! package         * version     date (UTC) lib source
#>    abind             1.4-8       2024-09-12 [1] CRAN (R 4.3.2)
#>    backports         1.5.0       2024-05-23 [1] CRAN (R 4.3.2)
#>    bayesplot         1.11.1.9000 2024-10-24 [1] https://stan-dev.r-universe.dev (R 4.3.2)
#>  P bridgesampling    1.1-2       2021-04-16 [?] CRAN (R 4.3.2)
#>    brms              2.22.2      2024-10-24 [1] Github (paul-buerkner/brms@f024203)
#>  P Brobdingnag       1.2-9       2022-10-19 [?] CRAN (R 4.3.2)
#>    checkmate         2.3.2       2024-07-29 [1] CRAN (R 4.3.2)
#>    cli               3.6.3       2024-06-21 [1] CRAN (R 4.3.2)
#>    cmdstanr          0.8.1       2024-10-24 [1] https://stan-dev.r-universe.dev (R 4.3.2)
#>    coda              0.19-4.1    2024-01-31 [1] CRAN (R 4.3.2)
#>  P codetools         0.2-19      2023-02-01 [?] CRAN (R 4.3.2)
#>  P collapse          1.9.5       2023-04-07 [?] CRAN (R 4.3.2)
#>    colorspace        2.1-1       2024-07-26 [1] CRAN (R 4.3.2)
#>    curl              5.2.3       2024-09-20 [1] CRAN (R 4.3.2)
#>    data.table        1.16.2      2024-10-10 [1] CRAN (R 4.3.2)
#>    digest            0.6.37      2024-08-19 [1] CRAN (R 4.3.2)
#>    distributional    0.5.0       2024-09-17 [1] CRAN (R 4.3.2)
#>    dplyr             1.1.4       2023-11-17 [1] CRAN (R 4.3.2)
#>  P evaluate          0.21        2023-05-05 [?] CRAN (R 4.3.2)
#>    fansi             1.0.6       2023-12-08 [1] CRAN (R 4.3.2)
#>    farver            2.1.2       2024-05-13 [1] CRAN (R 4.3.2)
#>  P fastmap           1.1.1       2023-02-24 [?] CRAN (R 4.3.2)
#>  P fs                1.6.2       2023-04-25 [?] CRAN (R 4.3.2)
#>  P generics          0.1.3       2022-07-05 [?] CRAN (R 4.3.2)
#>    ggplot2           3.5.1       2024-04-23 [1] CRAN (R 4.3.2)
#>    glue              1.8.0       2024-09-30 [1] CRAN (R 4.3.2)
#>  P gridExtra         2.3         2017-09-09 [?] CRAN (R 4.3.2)
#>    gtable            0.3.6       2024-10-25 [1] CRAN (R 4.3.2)
#>  P highr             0.10        2022-12-22 [?] CRAN (R 4.3.2)
#>  P htmltools         0.5.5       2023-03-23 [?] CRAN (R 4.3.2)
#>  P inline            0.3.19      2021-05-31 [?] CRAN (R 4.3.2)
#>    insight           0.20.5      2024-10-02 [1] CRAN (R 4.3.2)
#>    jsonlite          1.8.9       2024-09-20 [1] CRAN (R 4.3.2)
#>  P knitr             1.43        2023-05-25 [?] CRAN (R 4.3.2)
#>  P labeling          0.4.3       2023-08-29 [?] CRAN (R 4.3.2)
#>  P lattice           0.21-8      2023-04-05 [?] CRAN (R 4.3.2)
#>    lifecycle         1.0.4       2023-11-07 [1] CRAN (R 4.3.2)
#>    loo               2.8.0.9000  2024-10-24 [1] https://stan-dev.r-universe.dev (R 4.3.2)
#>  P magrittr          2.0.3       2022-03-30 [?] CRAN (R 4.3.2)
#>  P marginaleffects   0.23.0      2024-10-05 [?] CRAN (R 4.3.2)
#>  P Matrix            1.5-4       2023-04-04 [?] CRAN (R 4.3.2)
#>    matrixStats       1.4.1       2024-09-08 [1] CRAN (R 4.3.2)
#>    mvtnorm           1.3-1       2024-09-03 [1] CRAN (R 4.3.2)
#>  P nlme              3.1-162     2023-01-31 [?] CRAN (R 4.3.2)
#>  P pillar            1.9.0       2023-03-22 [?] CRAN (R 4.3.2)
#>    pkgbuild          1.4.4       2024-03-17 [1] CRAN (R 4.3.2)
#>  P pkgconfig         2.0.3       2019-09-22 [?] CRAN (R 4.3.2)
#>    posterior         1.6.0       2024-07-03 [1] CRAN (R 4.3.2)
#>    processx          3.8.4       2024-03-16 [1] CRAN (R 4.3.2)
#>    ps                1.8.0       2024-09-12 [1] CRAN (R 4.3.2)
#>    purrr             1.0.2       2023-08-10 [1] CRAN (R 4.3.2)
#>  P QuickJSR          1.0.7       2023-10-15 [?] CRAN (R 4.3.2)
#>  P R.cache           0.16.0      2022-07-21 [?] RSPM (R 4.3.2)
#>  P R.methodsS3       1.8.2       2022-06-13 [?] RSPM (R 4.3.2)
#>  P R.oo              1.26.0      2024-01-24 [?] RSPM (R 4.3.2)
#>  P R.utils           2.12.3      2023-11-18 [?] RSPM (R 4.3.2)
#>  P R6                2.5.1       2021-08-19 [?] CRAN (R 4.3.2)
#>  P RColorBrewer      1.1-3       2022-04-03 [?] CRAN (R 4.3.2)
#>    Rcpp              1.0.13      2024-07-17 [1] CRAN (R 4.3.2)
#>  P RcppParallel      5.1.7       2023-02-27 [?] CRAN (R 4.3.2)
#>  P reprex            2.1.0       2024-01-11 [?] RSPM (R 4.3.2)
#>    rlang             1.1.4       2024-06-04 [1] CRAN (R 4.3.2)
#>  P rmarkdown         2.22        2023-06-01 [?] CRAN (R 4.3.2)
#>    rstan           * 2.32.5      2024-01-10 [2] CRAN (R 4.3.2)
#>    rstantools        2.4.0.9000  2024-10-24 [1] https://stan-dev.r-universe.dev (R 4.3.2)
#>    scales            1.3.0.9000  2024-10-30 [1] Github (r-lib/scales@4a95f27)
#>    sessioninfo       1.2.2       2021-12-06 [2] CRAN (R 4.3.2)
#>    StanHeaders     * 2.32.5      2024-01-10 [2] CRAN (R 4.3.2)
#>    stringi           1.8.4       2024-05-06 [1] CRAN (R 4.3.2)
#>    stringr           1.5.1       2023-11-14 [1] CRAN (R 4.3.2)
#>  P styler            1.10.2      2023-08-29 [?] RSPM (R 4.3.2)
#>    tensorA           0.36.2.1    2023-12-13 [1] CRAN (R 4.3.2)
#>  P tibble            3.2.1       2023-03-20 [?] CRAN (R 4.3.2)
#>    tidyselect        1.2.1       2024-03-11 [1] CRAN (R 4.3.2)
#>  P utf8              1.2.4       2023-10-22 [?] CRAN (R 4.3.2)
#>  P V8                4.3.0       2023-04-08 [?] CRAN (R 4.3.2)
#>    vctrs             0.6.5       2023-12-01 [1] CRAN (R 4.3.2)
#>    withr             3.0.2       2024-10-28 [1] CRAN (R 4.3.2)
#>  P xfun              0.39        2023-04-20 [?] CRAN (R 4.3.2)
#>  P xml2              1.3.3       2021-11-30 [?] CRAN (R 4.3.2)
#>  P yaml              2.3.7       2023-01-23 [?] CRAN (R 4.3.2)
#> 
#>  [1] /Users/sdekel002/Library/CloudStorage/OneDrive-pwc/Documents/analysis/ceo-28/renv/library/R-4.3/aarch64-apple-darwin23.0.0
#>  [2] /opt/homebrew/Cellar/r/4.3.2/lib/R/library
#> 
#>  P ── Loaded and on-disk path mismatch.
#> 
#> ──────────────────────────────────────────────────────────────────────────────

I’m confused. I don’t see anything plotted past 0 or 1 on either axis. Can you maybe highlight what your problem is with the plot exactly?

2 Likes

Right, but whereas for the outcome variable, values beyond 0-1 are impossible, I can easily get predictions for predictor values beyond 0-1. For instance:

model |>
  marginaleffects::predictions(
    newdata = marginaleffects::datagrid(predictor = 2)
  )
#> 
#>  predictor Estimate 2.5 % 97.5 %
#>          2    0.489 0.402  0.575
#> 
#> Type:  response 
#> Columns: rowid, estimate, conf.low, conf.high, predictor, outcome

Oh, I see. So your outcome variable is bounded between zero and one, so you used zero-one-inflated Beta regression to capture this property in your model. No matter what values of predictors you pass, your outcome variable will always be bounded.

In contrast, these models (and most models that I am familiar with) don’t make any assumptions about the distribution of your predictor variables. In this case your predictor is also bounded between zero and one, but the model doesn’t know that. And, since you are fitting straight lines in the model space, you can ask your model what the value of the outcome variable would be if the predictor had a value of two.

Below I code up a quick example of predicting race time in seconds by a person’s height in centimeters. Even though there is no height data that is negative, and theoretically we know that you can’t have a negative height, we can still predict what someone’s racetime might be if they had a height of negative forty centimeters. This is why the plot_predictions() function only visualizes the regression line for where you actually have data. There might be some instances where it makes sense to extrapolate your regression beyond where you presently have data, but most of the time I would be hesitant to assume the linear trend extends beyond where the model was informed by data.

library(marginaleffects)

set.seed(123)
height_cm <- rnorm(30, mean = 175, sd = 10)
time_s <- rnorm(30, mean = 30 + (height_cm*-0.1), sd = 1)

df <- data.frame(height_cm, time_s)

model <- lm(time_s ~ height_cm, data = df)

predictions(model,
            newdata = datagrid(height_cm = -40))

 height_cm Estimate Std. Error    z Pr(>|z|)    S 2.5 % 97.5 %
       -40     37.1       3.41 10.9   <0.001 88.9  30.4   43.7

Columns: rowid, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, time_s, height_cm 
Type:  response 

The only way I know of to let the model know the predictor is bounded is when you use monotonic variables to enter ordinal predictor variables into the regression.

2 Likes

Thanks for that. I’d be grateful if you (or anyone else stumbling upon this) have any resources that touch on this question.

Honestly, it just comes down to learning the mathematical form of the model you are using. When you do that, it’s easy to see why you can plug in any value for a predictor you want and get an answer (whether or not it is a reasonable answer is another story).

I’d recommend working through Statistical Rethinking. You can link the accompanying lectures and course materials here.

McElreath, R. (2018). Statistical rethinking: A Bayesian course with examples in R and Stan . Chapman and Hall/CRC.

1 Like

This is the way.