One-hot encoding or indexed coefficients

Which of the following specifications for linear regression on categorical variables is more efficient?

If I one hot encode a categorical value

data {
    int n;
    int<lower=2> num_levels;
    array[n] int<lower=1,upper=num_levels> level;

    matrix[num_levels, 3] X;

    vector[n] actual_value;
}
transformed data {
    matrix[n, 3 + num_levels - 1] X_tot;
    for (i in 1:n) {
        X_tot[i, 1] = X[i, 1];
        X_tot[i, 2] = X[i, 2];
        X_tot[i, 3] = X[i, 3];
        for (j in 1:num_levels - 1) {
            if level[i] = j {
                X_tot[3 + j] = 1.0;
            } else {
                X_tot[3 + j] = 0.0;
            }
        }
    }
}
parameters {
    vector[3 + num_levels - 1] beta;
    real intercept;
    real<lower=0> sigma;
}
transformed parameters {
    vector[n] expected_value = X_tot * beta + intercept;
}
model {
    beta ~ normal(0, 1);
    sigma ~ student_t(3, 0, 1);
    actual_value ~ normal(expected_value, sigma);
}

if I select the coefficient for a particular categorical level by indexing in a loop, the way hierarchical models are usually coded

data {
    int n;
    int<lower=2> num_levels;
    array[n] int<lower=1,upper=num_levels> level;

    matrix[num_levels, 3] X;

    vector[n] actual_value;
}
parameters {
    vector[3] beta;
    vector[num_levels - 1] beta_levels;
    real intercept;
    real<lower=0> sigma;
}
transformed parameters {
    vector[n] y = X * beta;
    vector[n] expected_value;
    for (i in 1:n) {
        if (level[i] != num_levels) {
            expected_value[i] = intercept + y[i] + beta_levels[level[i]];
        } else {
            expected_value[i] = intercept + y[i];
        }
    }
}
model {
    beta ~ normal(0, 1);
    beta_levels ~ normal(0, 1);
    sigma ~ student_t(3, 0, 1);
    actual_value ~ normal(expected_value, sigma);
}

My intuition is that the first representation is faster because it doesn’t have to allocate data structures for intermediate representations in every transformed parameters loop, instead just using one fast matrix multiplication. On the other hand, the second one would probably use less memory since it doesn’t have to build up a halfway-sparse matrix.

As an aside, I can’t think of a better way to build the data matrix than the nested loop. It seems kind of bad, but it’s in the transformed data block so whatever. Any ideas on a better way?

re: the second representation, the explicit if / else branches in transformed parameters collapse if we redefine beta_levels and pad it with an additional artificial beta levels “parameter” that is constrained to always take the constant value zero.

Pinning a parameter to a constant value does not seem to be allowed by Stan, but adding an extra layer of indirection gives something that compiles:

data {
    int n;
    int<lower=2> num_levels;
    array[n] int<lower=1,upper=num_levels> level;

    matrix[num_levels, 3] X;

    vector[n] actual_value;
}
parameters {
    vector[3] beta;
    vector[num_levels-1] beta_levels;
    real intercept;
    real<lower=0> sigma;
}
transformed parameters {
    vector[num_levels] beta_levels_;
    beta_levels_[1:num_levels-1] = beta_levels;
    beta_levels_[num_levels] = 0;

    vector[n] expected_value = intercept + (X * beta) + beta_levels_[level];
}
model {
    beta ~ normal(0, 1);
    beta_levels ~ normal(0, 1);
    sigma ~ student_t(3, 0, 1);
    actual_value ~ normal(expected_value, sigma);
}

I don’t have intuition about relative performance. If you’ve got the time, implement them all and look at what is happening through a CPU profiler!

1 Like