Using categorical predictors

Hello!

I’m learning how to use RSTAN and I already did my first multiple linear regression and 2 point predictions. Now I’m trying to go a bit further by trying to use categorical predictors. I transformed the categorical predicto to k-1 binary predictors — where k is the number of categories in that predictor — but a wild noob question appeared, what priors are recommended for binary predictors?

Also, how to make RSTAN distinguise binary, integer, and continuous predictos in a same predictors matrix?

I will try to answer this in part but keep in mind that I am a novice with Stan and I may not get everything right.
Setting priors is a big topic and it is hard to cover every situation in a single post. In the context of linear regression and a binary predictor, keep in mind that the prior is describing the size of the effect when the category coding goes from zero to one. If you are modeling the effect of a category on the height of an adult human measured in meters, a plausible effect would be much less than 1 in absolute value. A standard normal prior would not be very informative at all. If you are modeling a categorical effect on adult human weight in Kg, a coefficient of 10 might not be out of the question. A standard normal in that case might be very strong.
Your domain knowledge should guide you towards what is plausible and what is not. You may not know whether a category has an effect but you probably can put some bounds on effects that go beyond surprising into the realm of unbelievable.
I hope that helps.

3 Likes

The prior distribution of the coefficients can be any continuous distribution, right? Of course these distributions and their bounds should make sense.

But, How can I declare in STAN a matrix composed by binary, integers, and continuous variables (column vectors)?

I am not sure I understand your problem with the matrix. I think vectors and matrices are always real. Version 2.23 of the Reference Manual says

Vectors and matrices cannot be typed to return integer values. They are restricted to real values.

A column of a matrix may happen to contain integer values but the variable type is real. Are you encountering an error or is this a problem you are expecting?

2 Likes

No problem at all with the software. However, I was just wondering whether I was doing it correctly since I’m trying to use 1 categorical predictor (i.e. k-1 binary predictors), 3 continuos predictors, and 1 continuous response.

So declaring a matrix for these 4 predictor variables and 1 continuous response by using

data {
int<lower=1> N;
int<lower=1> K;
vector[N] y;
matrix[N,K] X;
}

Will be enought?

1 Like

Yes, that seems fine. I will admit to a nagging fear that I am forgetting something but your data set up seems completely reasonable.

1 Like

You always have to consider the prior in the context of the response variable/likelihood (see https://arxiv.org/abs/1708.07487). Afaik, the type of the predictor generally (continuous/binary) generally matters less, as long as the predictors are roughly on the same scale (i.e. all continuous predictors are scaled/normalized). Still, you generally want the prior to reflect the reasonable values that the parameter could be expected to take - for example, in social science research, if you have a scaled continuous response and scaled continuous/binary predictors, it’s rare to see absolute multiple regression coefficients bigger than 0.2-0.3, so a normal(0, 1) prior could be considered a weakly informative prior, in the sense that you’d be surprised by parameter values falling outside the -2, 2 range.

1 Like