How to integrate 2 grouping factors in the model

zhuang · July 30, 2025, 10:54pm

I am learning modeling with diamonds data in tidyverse R, I try to model price with carat, cut and color. so I processed the data as:

diamonds2 ← diamonds |>
  group_by(cut) |>
  mutate(carat_mean_cut = mean(carat),carat_center_cut = carat-carat_mean_cut) |>
  ungroup() |>
  group_by(color) |>
  mutate(carat_mean_color = mean(carat),carat_center_color = carat-carat_mean_color) |>
  ungroup() |>
  dplyr::select(price, carat_mean_cut, carat_center_cut, carat_mean_color, carat_center_color)

lmer(price~carat_mean_cut + carat_center_cut + carat_mean_color + carat_center_color + (carat_center_cut+1|cut) + (carat_center_color+1|color), data=diamonds2).

I am planning to run Stan on this model but first of all I need to make sure this model is valid. Hence I tried lmer() first to check the model. Clearly this model is not ideal as one predictor is dropped for its perfect collinearity with the rest predictors.

when analyzed seperately as grouping factors, cut and color both look good in Stan.

But I was just wondering if there is another way to construct the model that could integrates both cut and color as grouping factors? Thank you

caesoma · August 2, 2025, 6:59am

It’s great that you are getting into writing statistical models and want to use Stan for it – learning R is also useful to get them working with rstan or cmdstanr, but while many Stan users do use R, many use Python and other languages. Others still, like me, use R less frequently and are used to the old school, base R syntax and have to zone into the kind of pipe syntax that seems to be popular with newer R users.

Either way, the good thing about Stan (and MCMC in general) is that issues like collinearity of predictors will likely show up as correlated samples or issues with the sampling chain itself.

Without knowing further detail, we can say that a model may be valid, but the actual predictors used may be correlated, or otherwise the data may not allow confident inference.

That wouldn’t be a good test of whether the predictors would work well together in a model, if they are correlated, using both could add little information. But once again, the sampling chain could give you more information about what is happening.

That would really depend on what the issue is: how correlated they are, and how that affects inference. For similar syntax, you could probably consider using brms (I vaguely remember there being a stan glmer, but I never used it).

Topic		Replies	Views
Multiple Regression in Rstan with factors Modeling	12	3231	February 6, 2018
Nested model with uneven group membership Modeling specification	3	552	July 20, 2018
Group info General	1	286	September 9, 2023
Mean by groups Modeling	14	2588	October 31, 2019
Stan_lm: use it as base for expansion. Is it possible? Modeling techniques	11	1204	June 29, 2017

How to integrate 2 grouping factors in the model

Related topics