Which of the following specifications for linear regression on categorical variables is more efficient?
If I one hot encode a categorical value
data {
int n;
int<lower=2> num_levels;
array[n] int<lower=1,upper=num_levels> level;
matrix[num_levels, 3] X;
vector[n] actual_value;
}
transformed data {
matrix[n, 3 + num_levels - 1] X_tot;
for (i in 1:n) {
X_tot[i, 1] = X[i, 1];
X_tot[i, 2] = X[i, 2];
X_tot[i, 3] = X[i, 3];
for (j in 1:num_levels - 1) {
if level[i] = j {
X_tot[3 + j] = 1.0;
} else {
X_tot[3 + j] = 0.0;
}
}
}
}
parameters {
vector[3 + num_levels - 1] beta;
real intercept;
real<lower=0> sigma;
}
transformed parameters {
vector[n] expected_value = X_tot * beta + intercept;
}
model {
beta ~ normal(0, 1);
sigma ~ student_t(3, 0, 1);
actual_value ~ normal(expected_value, sigma);
}
if I select the coefficient for a particular categorical level by indexing in a loop, the way hierarchical models are usually coded
data {
int n;
int<lower=2> num_levels;
array[n] int<lower=1,upper=num_levels> level;
matrix[num_levels, 3] X;
vector[n] actual_value;
}
parameters {
vector[3] beta;
vector[num_levels - 1] beta_levels;
real intercept;
real<lower=0> sigma;
}
transformed parameters {
vector[n] y = X * beta;
vector[n] expected_value;
for (i in 1:n) {
if (level[i] != num_levels) {
expected_value[i] = intercept + y[i] + beta_levels[level[i]];
} else {
expected_value[i] = intercept + y[i];
}
}
}
model {
beta ~ normal(0, 1);
beta_levels ~ normal(0, 1);
sigma ~ student_t(3, 0, 1);
actual_value ~ normal(expected_value, sigma);
}
My intuition is that the first representation is faster because it doesn’t have to allocate data structures for intermediate representations in every transformed parameters loop, instead just using one fast matrix multiplication. On the other hand, the second one would probably use less memory since it doesn’t have to build up a halfway-sparse matrix.
As an aside, I can’t think of a better way to build the data matrix than the nested loop. It seems kind of bad, but it’s in the transformed data block so whatever. Any ideas on a better way?