Regression, categorical variables with >2 levels

Jacob_Moore · May 14, 2024, 1:32pm

a bit of a novice question, what’s the right way to model categorical variables with 3+ levels in context of regression models.

In the frequentist analog, statistical software would return a T-test for each individual level and an F-test for the categorical variable as a whole.

Suppose that there are four variables, height, weight, age and sport; sport has four levels, baseball, football, soccer and basketball.

In Bayesian analysis, are we even interested in the question, “what’s the effect of sport?” (Without selecting any given sport.)

And as an aside, perhaps this is a good opportunity for multilevel modeling; a global coefficient for sport could be inferred as well as one coefficient for each baseball, football, …

javims · May 15, 2024, 9:14am

You have two main options.
First is to assume that each sport has an independent effect on the response, so you would need to create dummies for each of the sports. Second is to consider that sports have a proportional effect to each other. In the latter, you would need to recode your variable in a numeric way.

e.g. if we are interested in inferring the effect of playing a sport on the heart BPM, you can make the assumption that the found effect for football will be double than basketball and four times than badminton. If the effect for basketball is assumed to be double than badminton, then you have the necessaries assumption to code your variable. In that case, would be 1:badminton, 3:basketball, 4:football.

First option is always safer, however you will be losing the parsimony of the second in case you have the field knowledge.

MarijnG · May 15, 2024, 9:18am

Hi Jacob,

Could you provide some more information regarding your outcome variable?

Regarding the categorical variable, you can just add this to the model. It will take 1 of the categories as the reference category, probably baseball, because it is alphabetically. You can then, based on your model compare the means of the categories (adjusted for the other variables) using for example the emmeans package or use the hypothesis function of BRMS.

Whether it is interesting to answer the questions “what’s the effect of sport?” is dependant on your research questions ofcourse. What question do you aim to answer with this data?

emmeans: means ← emmeans(NAMEOFYOURMODELK, pairwise ~ sport)

Hypothesis function: q1<- c(q1 = “Intercept > (Intercept + soccer)”)
q1_answer ← hypothesis(NAMEOFYOURMODEL, q1)
q1_answer
plot(q1Factual_answer)

You can adjust the hypothesis function to compare different sports, this comparison will control for the other variables.

Topic		Replies	Views
Ordinal probit multilevel model with random effects brms specification , brms	9	132	August 27, 2024
Variable selection for an exploratory multilevel categorical model with weak priors Modeling techniques , loo	2	584	January 14, 2022
Multinomial logistic regression with categorical family (brms) Modeling multinomial-response , brms	5	101	May 2, 2025
Multilevel, Categorical/Multinomial Model- Terms and priors brms	14	10357	April 2, 2024
Estimate of all levels of categorical variable from brm summary brms	6	3690	March 1, 2019

Regression, categorical variables with >2 levels

Related topics