One-hot encoding or indexed coefficients

laifuthegreat · November 18, 2021, 10:58pm

Which of the following specifications for linear regression on categorical variables is more efficient?

If I one hot encode a categorical value

data {
    int n;
    int<lower=2> num_levels;
    array[n] int<lower=1,upper=num_levels> level;

    matrix[num_levels, 3] X;

    vector[n] actual_value;
}
transformed data {
    matrix[n, 3 + num_levels - 1] X_tot;
    for (i in 1:n) {
        X_tot[i, 1] = X[i, 1];
        X_tot[i, 2] = X[i, 2];
        X_tot[i, 3] = X[i, 3];
        for (j in 1:num_levels - 1) {
            if level[i] = j {
                X_tot[3 + j] = 1.0;
            } else {
                X_tot[3 + j] = 0.0;
            }
        }
    }
}
parameters {
    vector[3 + num_levels - 1] beta;
    real intercept;
    real<lower=0> sigma;
}
transformed parameters {
    vector[n] expected_value = X_tot * beta + intercept;
}
model {
    beta ~ normal(0, 1);
    sigma ~ student_t(3, 0, 1);
    actual_value ~ normal(expected_value, sigma);
}

if I select the coefficient for a particular categorical level by indexing in a loop, the way hierarchical models are usually coded

data {
    int n;
    int<lower=2> num_levels;
    array[n] int<lower=1,upper=num_levels> level;

    matrix[num_levels, 3] X;

    vector[n] actual_value;
}
parameters {
    vector[3] beta;
    vector[num_levels - 1] beta_levels;
    real intercept;
    real<lower=0> sigma;
}
transformed parameters {
    vector[n] y = X * beta;
    vector[n] expected_value;
    for (i in 1:n) {
        if (level[i] != num_levels) {
            expected_value[i] = intercept + y[i] + beta_levels[level[i]];
        } else {
            expected_value[i] = intercept + y[i];
        }
    }
}
model {
    beta ~ normal(0, 1);
    beta_levels ~ normal(0, 1);
    sigma ~ student_t(3, 0, 1);
    actual_value ~ normal(expected_value, sigma);
}

My intuition is that the first representation is faster because it doesn’t have to allocate data structures for intermediate representations in every transformed parameters loop, instead just using one fast matrix multiplication. On the other hand, the second one would probably use less memory since it doesn’t have to build up a halfway-sparse matrix.

As an aside, I can’t think of a better way to build the data matrix than the nested loop. It seems kind of bad, but it’s in the transformed data block so whatever. Any ideas on a better way?

rfc · November 19, 2021, 7:22am

re: the second representation, the explicit if / else branches in transformed parameters collapse if we redefine beta_levels and pad it with an additional artificial beta levels “parameter” that is constrained to always take the constant value zero.

Pinning a parameter to a constant value does not seem to be allowed by Stan, but adding an extra layer of indirection gives something that compiles:

data {
    int n;
    int<lower=2> num_levels;
    array[n] int<lower=1,upper=num_levels> level;

    matrix[num_levels, 3] X;

    vector[n] actual_value;
}
parameters {
    vector[3] beta;
    vector[num_levels-1] beta_levels;
    real intercept;
    real<lower=0> sigma;
}
transformed parameters {
    vector[num_levels] beta_levels_;
    beta_levels_[1:num_levels-1] = beta_levels;
    beta_levels_[num_levels] = 0;

    vector[n] expected_value = intercept + (X * beta) + beta_levels_[level];
}
model {
    beta ~ normal(0, 1);
    beta_levels ~ normal(0, 1);
    sigma ~ student_t(3, 0, 1);
    actual_value ~ normal(expected_value, sigma);
}

I don’t have intuition about relative performance. If you’ve got the time, implement them all and look at what is happening through a CPU profiler!

Topic		Replies	Views
Efficient way of indexing General	2	368	April 21, 2020
Vector braodcasting from matrix by multiple indexing for parameter with two 'layers' Modeling techniques , specification	5	346	September 10, 2020
Representing categorical predictor variables Modeling	4	1528	April 18, 2019
Efficiency of design matrix multiplication vs. range indexing in the framework of a hierarchical model Modeling techniques	2	305	August 28, 2020
Using rows_dot_product in a loop to set column contents; any optimization ideas? Modeling techniques , specification	2	396	August 16, 2021

One-hot encoding or indexed coefficients

Related topics