Representing categorical predictor variables

Suppose I have a regression problem y_{i} \sim N(\alpha + x_{i}\beta, \sigma), where for each sample i I observe some continuous y_{i} and x_{i}. If now each sample i also belongs to a group kk_{i} and I expect \beta to vary by group, am I allowed to write the model like this?


data {
  int N;
  real y[N];
  real x[N];
  int K;
  int<lower=1, upper=K> kk[K];
}
 
parameters {
  real alpha;
  real<lower=0> sigma;
  vector[K] beta;
}

model {
  for (i in 1:N)
    y[i] ~ normal(alpha + beta[kk[i]] * x[i], sigma);
}

What confuses me is that most people seem to use a K-1 dimensional design matrix when representing a K dimensional categorical variable and here I have a K-dimensional vector \beta plus the intercept. I’ve tried it with some fake data and it seems to recover the original values, so is there anything obvious I’m missing here (aside from things like computational speed)?

The k-1 is a result of resolving identifiability issues. There’s are a few good threads referenced in this issue that might help. This vignette by @rtrangucci I also found helpful.

1 Like

@martinmodrak wrote up a nice piece on non-identifiability a while back that might come in handy
https://www.martinmodrak.cz/2018/05/14/identifying-non-identifiability/

1 Like

Thank you all for your replies! I expected it to have something to do with identifiability, but for me it is really hard to see where it is coming from. If I’d imagine to just split the dataset by group, I’d also get K different estimates for \beta.

Hi,

perhaps @Max_Mantei’s post will make things more concrete (it did help me):