# Cross-validation with group-specific variables

Dear Stan Users and Creators,

I am trying to solve the following modeling problem with brms and loo:

Consider a simple linear regression model, where I predict some numerical variable y with another numerical variable x. This is estimated in two groups. For simplicity, I assume that these are female (group = 1) and male respondents (group = 0). y, x and group are stored in an R data frame df.

In my model, I assume that the slope parameter is gender-specific, but that the intercept parameter is not. To estimate such a model in brms, I used the following syntax to first obtain two group-specific predictors:
df\$xm ← df\$x * (1 - df\$group)
df\$xf ← df\$x * (df\$group)

I then estimated the following model:
model_common ← brm(
formula = y ~ xm + xf,
data = df
)

As a next step, I would like to use cross-validation between the two groups. The basic idea is to obtain a check whether the posterior for the intercept parameter is stable over both groups. The obvious challenge is that the predictor variables are now group-specific. My solution would be to first estimate the following two models:

model_m ← brm(
formula = y ~ xm,
data = df[group == 0,]
)

model_f ← brm(
formula = y ~ xf,
data = df[group == 1,]
)

The next step would be to replace the posteriors for the intercept parameter in model_m and model_f with the posterior from model_common, and evaluate the resulting model using the raw data. My question is if there is an easier way to do this, e.g. using some functionality from the loo package? As far as I have seen, this is not the case, but maybe I am overlooking something.

It sounds like you want to compare a model where the intercept is allowed to vary by group to a model where the intercept is not allowed to vary. In your construction, this is as simple as comparing your global model for both groups to a model that additionally contains a fixed effect for gender.

There are notational shortcuts for writing this down in r’s formula synax, including `brms`. You want to compare the formula `y ~ x:group` to the formula `y ~ x*group`. For details, see lm function - RDocumentation or experiment for yourself with, e.g. `brms::brm(Sepal.Length ~ Sepal.Width:Species, data = iris)` and `brms::brm(Sepal.Length ~ Sepal.Width*Species, data = iris)`

1 Like

Dear Jacob,

thanks for this feedback, your solution is very close to the classical (frequentist) treatment of linear regression models in R, which uses likelihood ratio tests or information indices to compare models with and without interaction effects.

The reason why I was thinking of the more complicated approach with cross-validation is that in my actual application, I am working with a continuous covariate (instead of the categorical gender covariate) that corresponds to the time in a time series. In addition, there are regression parameters that are specific for the individual time points. I am therefore interested in the stability of a regression parameter over time using cross-validation. The example above was therefore somewhat simplified, in that I used a categorical instead of a continuous time variable.

Check out the `group` argument in `brms::kfold.brmsfit`. I think that should do what you want.

However, I’d still suggest that you might get the inference you seek more directly by writing down two models and comparing them using approximate leave-one-out CV. I don’t know the form of the time dependence that you are interested in modeling in the regression parameters, but it seems to me almost certain that you want to make a comparison between some model under the restriction that several parameters are equal versus a model without that restriction.

1 Like

Dear Jacob,

thank you for pointing me to brms::kfold.brmsfit. To my understanding, in my initial model this function would fit a regression model to group 0, and evaluate it using cross-validation in group 1, and vice versa. I did not consider this approach for my problem because both groups contain group-specific regression parameters (in my example, the slope), which would not be estimated if the model is only estimated in one group.

I think your second approach would solve my problem. I think a good setup would be to a) first estimate a model with a group-specific slope and an overall intercept parameter, and b) compare it against a model with a group-specific slope and intercept parameter. This architecture can be generalized to many ordered groups with group-specific parameters, which is basically at the core of my problem.