Advice for modeling for dependent variable with ceiling effect with brms

Hi,

As a long-time user of frequentist stats, I am trying to convert to Bayesian modeling, so I am quite new to the Bayesian approach and, therefore, to Stan. So I apologize if my question comes off as naive.

I am trying to model data that has something like a bimodal distribution. This data is motor performance after stroke. Many of the stroke motor scales have a fairly strong ceiling effect, like below:


The model has three independent variables and one random effect. The formula is:
motor_score ~ lesion_load_z + lesion_side + lesion_volume_z + (1|site)

Motor score is scaled between 0 and 1, lesion load and lesion volume are z-scored and lesion side and site are categorical. Lesion side has 2 levels and site has 21 levels. I have 242 samples total.

The first approach that I tried is using a Beta glm with:

mod1 = brm(NORMED_MOTOR ~ Total_Percsub_Cramer_z + LesionSide + Lesion_Volume_z + (1|SITE), data=reduced_modeling_df, family=Beta(link="logit"), iter = 6000, chains = 6)

This model converges fine but the fit isn’t great :

The second approach I tried is scaling (i.e., z-score) the outcome and then trying to fit a gaussian mixture:

mix = mixture(gaussian, gaussian)
mod2 = brm(NORMED_MOTOR_z ~ Total_Percsub_Cramer_z + LesionSide + Lesion_Volume_z + (1|SITE), data=reduced_modeling_df, iter = 6000, chains = 6, family=mix)

For this model, it has trouble converging with the errors:
1: There were 150 divergent transitions after warmup. See
https://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
to find out why this is a problem and how to eliminate them.
2: There were 180 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 10. See
https://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded
3: Examine the pairs() plot to diagnose sampling problems

4: The largest R-hat is 3.25, indicating chains have not mixed.
Running the chains for more iterations may help. See
https://mc-stan.org/misc/warnings.html#r-hat
5: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
Running the chains for more iterations may help. See
https://mc-stan.org/misc/warnings.html#bulk-ess
6: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.


And the pp_check looks really weird as the model is predicting outlandish values:
![gaus_mix_fit|690x490](upload://6JCm8IqO6yNECaaQxjWvFhVzgYB.png)



My intuition is that the beta distribution is the right approach, but I am not sure how to intelligently change the parameters of the model to improve the fit. I have tried adjusting priors but nothing so far is affecting much. Are there parameters that I can adjust to get the fit a bit closer? Any advice is very much appreciated!

Thanks!

Hello @willi3by , at face value the beta response model seems like a sensible choice here.

I presume that the lesion load and lesion volume are approximately continuous as you’ve described them as z-scores. Your model has presumed that these effects are linear (straight lines) on the logistic scale. That is a potentially strong assumption; is it warranted? I’d make some graphics of model predictions and the data to visualize the form of the associations. Depending on what you get, you may then want to look at some more flexible models of these effects such as splines.

Hi @AWoodward,

Thanks so much for your reply! You are right, the association between these variables, especially lesion load, is nonlinear when plotting the predictions vs these predictors. I fit the model with the below adjustments:

mod2 <- brm(NORMED_MOTOR ~ s(Total_Percsub_Cramer_z, bs="cr", k=10) + s(Lesion_Volume_z, bs="cr", k=20) + (1|SITE), data=reduced_modeling_df, family=Beta(link="logit"), iter = 10000, chains = 6)

And got an improved fit but still not great:

I also notice that when I plot the density of the predicted values it seems like a “condensed” version of the density of the outcome (black = prediction, red=observed):

I think the different peaks are actually somewhat captured but in the wrong location and not spread over the whole distribution. The range for the predicted values is (0.31, 0.84) whereas the range for the true values is (0.01, 0.99). Is there a way to adjust the beta prior for the outcome to get a better “spread” over the domain and allow for predicting values over the whole range?

Please let me know if I can provide any more information and thanks again for the help!

I don’t think the prior for phi will be the problem here. I believe its default is quite weak (but you could do some form of prior predictive checks to confirm this). Here what I would try to do is to express in the most intuitive form possible what it is that the model is not capturing. I would try:

  • residuals vs. observed and fitted response value
  • conditional effects style plots with posterior predictions (prediction intervals) across each of the covariates

I’d say the first thing to rule out is that the model for the location is misspecified (that it is flexible enough), which seems to be OK. Then you can propose a more flexible observation model to capture whatever the conditional patterns are.

Looks like you have a meaningful number of observations at 0.01 and 0.99? That might also be a conceptual problem for this model.

Thanks again @AWoodward! Below are the results of these checks:

Residual vs Observed:

Residual vs Fitted:

Conditional Effects (pairs plot, total_percsub variable is lesion load measure):

The fit is steadily improving with each modification. I constrained phi to be exponential(1) with a lb=0 and that seemed to help a bit as well:

To answer your last question, there are 4/242 with 0.01 and 25/242 with 0.99. Still modifying the model a bit and open to any more advice you may have!

Thanks again

In general I would use a semiparametric ordinal model such as the proportional odds model for such problems. Ordinal models easily handle bimodality, floor and ceiling effects, and detection limits.

The brms package and rmsb packages will fit such models. The blrm function in the R rmsb package is more tuned for more continuous Y and provides functions making it easy to translate form the logit scale to the original Y scale, e.g., the Mean and Quantile functions. Restricted cubic splines can be used to relax linearity assumptions for X. A detailed vignette is here. Note that no binning of Y is used when fitting an ordinal model to continuous responses.

2 Likes

@harrelfe Thanks very much, Frank! I have successfully fit the model and all the diagnostics seem to indicate that the fit is better. But I am having trouble when trying to run the posterior predictive visualization. When I try to use Predict on the model, I get the following error:

Error in reformulate(attr(termobj, "term.labels")[-dropx], response = if (keep.response) termobj[[2L]],  : 
  'termlabels' must be a character vector of length at least one
In addition: Warning message:
Using formula(x) is deprecated when x is a character vector of length > 1.
  Consider formula(paste(x, collapse = " ")) instead. 

My model code is:

dd <- datadist(reduced_modeling_df)
options(datadist="dd")
f <- blrm(NORMED_MOTOR ~ rcs(Total_Percsub_Cramer, 5) + rcs(Lesion_Volume, 5) + LesionSide + rcs(AGE, 5) + cluster(SITE), data=reduced_modeling_df)

I have searched for a while to find the source of the error but I can’t seem to find it. Any advice here?

As an aside, I follow your work closely and have great respect for your work and knowledge. So I greatly appreciate your time in responding to my query.

UPDATE:
I fit an ordinal regression with brms using the cumulative(logit) family and the results are excellent!

I would still like to use the blrm function given the advantages you have stated. But still having issues getting Predict to work.

This fit looks really good.

How were you and to fit a continuous outcome with the cumulative family?

@JLC I was able to fit this outcome by simply converting it to an ordinal variable in R: as.ordered

I believe this is the correct way to handle this situation, but ideally I would like to get the blrm modeling working as it probably handles conversion from ordered to continuous space more elegantly/efficiently.

Also, I’m sure most know about this paper already for Dr. Harrell’s group, but this is an excellent read:

1 Like

Please distill the problem to a very simple one using simulated data such that I can run your code and debug. Thanks.

You just have more intercepts when Y is continuous. In the rmsb package execution time is pretty linear in the number of distinct Y values. An advantage of rmsb is that as the number of categories increases, if you do not specify priors the posterior mode is very similar to the maximum likelihood estimate. I don’t think that is the case with brms.

1 Like

@harrelfe Sorry about that! Below is simulation code to reproduce the problem:

df <- data.frame(x = rnorm(100), y = rnorm(100))
dd = datadist(df)
options(datadist="dd")
f <- blrm(y~x, data = df)
Predict(f)

Error in reformulate(attr(termobj, "term.labels")[-dropx], response = if (keep.response) termobj[[2L]],  : 
  'termlabels' must be a character vector of length at least one

Environment info is:

R version 4.3.1 (2023-06-16)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Sonoma 14.5

rmsb_1.1-0  rms_6.8-1   Hmisc_5.1-0

Thanks for looking into it!

It worked for me with a version of rmsb that has a little bug fix since I put it on CRAN. Try installing rmsb-mac.tgz from Directory Tree and see if that works for you. Some installation instructions are here.

Thanks @harrelfe! I am having issues with the install. I ran the following command as stated in the instructions:

install.packages('https://hbiostat.org/R/bin/rmsb-mac.tgz', repos=NULL)

I am using a Mac with an M1 chip. Not sure if that makes a difference. I also tried downloading the tar.gz file onto my Desktop and installing from there but got the same error message. Is there something else I could try? Sorry if I am missing something obvious.

UPDATE: I tried two other solutions without success. First, I downloaded the package from the Directory Tree to my Desktop and tried installing with:

install.packages("rmsb-mac.tgz", type='source', repos=NULL)

Which gave the error:

Error in rawToChar(info) : 
  embedded nul in string: '30 mtime=1716759731.927280433\n57 LIBARCHIVE.xattr.com.apple.provenance=AQAAZflYdkuiqFw\n49 SCHILY.xattr.com.apple.provenance=\001\0\0e\xf9XvK\xa2\xa8\\\n'

Second, I tried downloading the latest version from Github, opening the Rproj, and loading/installing from there. The blrm function works but I get the same error I mentioned in my previous post when trying to call Predict:

Error in reformulate(attr(termobj, "term.labels")[-dropx], response = if (keep.response) termobj[[2L]],  : 
  'termlabels' must be a character vector of length at least one

Thanks again

I have been unable to find a fix for the install.packages issue on Mac, which I can duplicate. This is a fix:

cd /tmp
wget https://hbiostat.org/R/bin/rmsb-mac.tgz
sudo R CMD INSTALL rmsb-mac.tgz

As for the rmsb bug, I have been unable to duplicate it with the latest rms and rmsb package versions. Please run the failing test code in a new session to make sure no package conflicts are present, and include session info in that run. Also include the output of search() after the last library or require command is run.

1 Like

Thanks very much for your help @harrelfe! I updated R and RStudio, installed with the commands that you noted, and now everything is working as expected! I really appreciate you taking the time to help troubleshoot and your advice for modeling.

1 Like

Good! I suspect there was a local copy of a package and a central system copy that conflicted. I always set my .Rprofile to include a command to put all packages I install in /usr/loca/rpackages.