Regression, categorical variables with >2 levels

a bit of a novice question, what’s the right way to model categorical variables with 3+ levels in context of regression models.

In the frequentist analog, statistical software would return a T-test for each individual level and an F-test for the categorical variable as a whole.

Suppose that there are four variables, height, weight, age and sport; sport has four levels, baseball, football, soccer and basketball.

In Bayesian analysis, are we even interested in the question, “what’s the effect of sport?” (Without selecting any given sport.)

And as an aside, perhaps this is a good opportunity for multilevel modeling; a global coefficient for sport could be inferred as well as one coefficient for each baseball, football, …

You have two main options.
First is to assume that each sport has an independent effect on the response, so you would need to create dummies for each of the sports. Second is to consider that sports have a proportional effect to each other. In the latter, you would need to recode your variable in a numeric way.

e.g. if we are interested in inferring the effect of playing a sport on the heart BPM, you can make the assumption that the found effect for football will be double than basketball and four times than badminton. If the effect for basketball is assumed to be double than badminton, then you have the necessaries assumption to code your variable. In that case, would be 1:badminton, 3:basketball, 4:football.

First option is always safer, however you will be losing the parsimony of the second in case you have the field knowledge.

1 Like

Hi Jacob,

Could you provide some more information regarding your outcome variable?

Regarding the categorical variable, you can just add this to the model. It will take 1 of the categories as the reference category, probably baseball, because it is alphabetically. You can then, based on your model compare the means of the categories (adjusted for the other variables) using for example the emmeans package or use the hypothesis function of BRMS.

Whether it is interesting to answer the questions “what’s the effect of sport?” is dependant on your research questions ofcourse. What question do you aim to answer with this data?

emmeans: means ← emmeans(NAMEOFYOURMODELK, pairwise ~ sport)

Hypothesis function: q1<- c(q1 = “Intercept > (Intercept + soccer)”)
q1_answer ← hypothesis(NAMEOFYOURMODEL, q1)
q1_answer
plot(q1Factual_answer)

You can adjust the hypothesis function to compare different sports, this comparison will control for the other variables.