How can one restrict smooths fit to data subsets to observed predictor variable ranges within those data subsets?

Hello, I am currently using brms to analyze an observational data set that records how far people walk per day. I am interested in modeling the relationship between age and distance walked per day, broken down by gender (treat gender as a factor, and fit a smooth for each gender). One feature of the data is that the age range differs for males and females: male ages span the range 3 to 68, while female ages span the range of 2-84. I have found that when I fit a model to the total dataset, using ‘by’ to request a smooth to be fit to males and females separately, and then use marginal_effects to view the posterior, I see that the estimated smooth for males extends past the observed range of male ages, and that (as can be expected) these estimates are wildly uncertain and unrealistic. This makes me wonder whether there is a way to set up the model formula such that for each level of the smooth, the smooth is only estimated across the observed range of the predictor variables within that category?

The data are also structured by repeated observations of individuals and study locations, which I model with random effects.

Here is the model formula that I am using currently

bf(distance_walked ~ gender + s(age, by=gender) + (1|person) + (1|location), shape~gender+ s(age, by=gender) + (1|person) + (1|location))

I’d be happy to provide a plot of the marginal_effects plot that is making me nervous or other information, but since my question doesn’t seem to be dependent on the features of a particular dataset I’ll post this now.

  • Operating System: macOS Sierra 10.12.6
  • brms Version: 2.8.0

I’m gonna quote the brms docs on marginal_effects not to be sarcastic but because I’m not super familiar with this and I want to make sure I don’t say anything wrong :D.

When creating marginal_effects for a particular predictor (or interaction of two predictors), one
has to choose the values of all other predictors to condition on. By default, the mean is used
for continuous variables and the reference category is used for factors, but you may change these
values via argument conditions. This also has an implication for the points argument: In the
created plots, only those points will be shown that correspond to the factor levels actually used in
the conditioning, in order not to create the false impression of bad model fit, where it is just due
to conditioning on certain factor levels.

I think marginal effects is plotting predictions from either ‘fitted’ or ‘predict’. Not sure which by default, but the x-axis in the plot is whatever you ask for the marginal effects on.

But to get a prediction for everything else, you need values for all the other variables, so brms uses the average of continuous values and the reference categories for factors (whatever that means).

So if the x-axis is age here, the question is what factor got used for the male/female variable? From reading the docs it’s possible female got used if it’s the reference class? I think you can do ‘points = TRUE’ to see which points are being used on the x-axis.

I think you can use the ‘conditions’ variable to specify whichever of male/female you specifically want to look at.

@bbales2, thank you for your notes, but your reply is focusing on visualizing the posterior, when I am actually concerned with the process of fitting the model, not representing a model that has already been fit. I’m sorry if that wasn’t clear. My concern is that when you fit smooths to different levels of a categorical variable, using the ‘by’ option, the range of the predictor variables [minimum and maximum values within each level] can be expected to vary across the levels. In my case, this is true. In my sample, female ages range from age 2-84, and male ages from 3-68. When I fit the model and specify gender as a factor for the ‘by’ option, the smooths that brms tries to fit extend across ages 2-84 for both males and females, and not surprisingly, the estimates for the outcome variable (distance walked) for males aged 67 and over are wildly inaccurate. This makes me doubt the entire smooth across all ages. So my question is: is it possible to specify the range of a predictor variable that should be used for the smooth fitting, within each level? The default is troubling because it tries to fit a smooth across ranges of the predictor variables, in some levels, which are not observed, leading to wildly inaccurate estimates.

Oh, I see.

@paul.buerkner can you look at this?

I don’t know if there is some sort of interaction between these two things in the fit. Seems like it would be no, but I don’t know how to prove that. You can investigate what is happening with make_stancode and make_standata, but Paul might have a clearer answer, or a clearer answer of where to look for info (what splines library or something).

Here’s an example model:


df1 = tibble(x = seq(0, 1, length = 10),
       y = rnorm(10, 0, 0.1),
       which = "a")

df2 = tibble(x = seq(-1, 1, length = 10)) %>%
  mutate(y = rnorm(10, x, 0.1),
         which = "b")

df = bind_rows(df1, df2)

fit = brm(y ~ s(x, by = which), df, cores = 2, chains = 2)


make_stancode(y ~ s(x, by = which), df)
make_standata(y ~ s(x, by = which), df)

I am fairly certain that the s() splines of mgcv use a separate range per by-level and that what you see in marginal_effects is just the result of visualization not of actual model fitting. Everything else would is not sensible is most cases and I doubt Simon Wood (the author of mgcv) has made that kind of mistake.

Yeah that is what I was thinking as well. The intention seems to be that the two splines be separate.

It’s been difficult for me to figure out what’s going on though. It seems like the xs for which == “a” do effect what basis functions you get for which == “b”.

For instance, if you run the above code and change the number of elements in df1, then if you look at the basis functions the second thing (data$Zs_2_1), then they do change.

This makes me think that the spline expansion chosen is somehow a function of the xs for both groups. I don’t understand how the basis is computed though.

The brms code looks like it’s doing the intended thing.

@Brian_Wood, this might be a question for mgcv. As I understand it brms just takes advantage of the stuff they use. I just don’t know enough about knots/splines to say what is happening for sure. It looks to me like something that goes into the basis selection for each group could be effected by all the data lumped together, but I don’t know enough about how these splines work to say for sure. If we assume the approximation is good, I think brms is fitting the right model.

Thank you @bbbales2 and @paul.buerkner, for your insights. To first respond to Paul’s note:

what you see in marginal_effects is just the result of visualization not of actual model fitting

To investigate whether this could be so, I created new plots from the default output of marginal_effects, modified so that they were restricted to show only the smooths across the observed ranges of the predictor variables within each level. If Paul’s hunch is correct, we’d expect these smooths to look plausible across those ranges, and I’m happy to say that indeed, they do look like good smooths that you’d expect if you were fitting these models to each level of the categorical variable separately. In fact, they seem indistinguishable from smooths fit to datasets that are split by the categorical variable prior to model fitting.

Thanks for your suggestion @bbbales2 to follow this up with a question to mgcv. I am now searching cross-validated for mgcv-tagged questions.