tl;dr: I’ve got a multivariable linear regression example working which uses multi-threading to slice the data. I’d like to try slicing the design matrix instead (because it’s very big) but I’m struggling to produce a working example. Any help would be hugely appreciated!
Hi there! I’m having some trouble getting a multi-threading set-up for a linear regression working where I slice the design matrix instead of the data (I’ve got ~30 predictors and >200,000 observations so it’s pretty hefty).
So I’ve got this basic set-up where I slice the data and which works fine (though only ~1.5x speed-up for 3 threads so if that seems like poor performance and you have comments on this code, please say!):
functions {
real partial_sum(array[] real y_slice,
int start, int end,
matrix x, real alpha, vector beta, real sigma) {
return normal_lpdf( y_slice | x[start:end,:] * beta + alpha, sigma);
}
}
data {
int<lower=0> N; // number of data items
int<lower=0> K; // number of predictors
matrix[N, K] x; // predictor matrix
array[N] real y; // outcome vector
int<lower=1> grainsize; // grainsize for threading
}
parameters {
real alpha; // intercept
vector[K] beta; // coefficients for predictors
real<lower=0> sigma; // error scale
}
model {
alpha ~ normal(0, 1);
beta ~ normal(0, 1);
sigma ~ exponential(1);
target += reduce_sum(partial_sum, y, grainsize, x, alpha, beta, sigma);
}
I was inspired by this previous post to think about possibly slicing other parts of the model, specifically the design matrix (and avoid copying over the whole huge matrix into each threaded process).
I’m struggling currently to get something that even compiles though - particularly struggling to get all the types right to make it actually work. Does anyone have suggestions for how to modify the code below?
functions{
real partial_sum(vector[] x_slice,
int start, int end,
vector y, vector beta, real sigma) {
int N = (end - start + 1);
vector[N] mu = to_matrix(x_slice) * beta;
return normal_lpdf( y[start:end] | mu, sigma);
}
}
data {
int<lower=0> N; // number of data items
int<lower=0> K; // number of predictors
matrix[N, K] x; // predictor matrix
vector[N] y; // outcome vector
int<lower=1> grainsize; // grainsize for threading
}
transformed data {
row_vector[N] x2[K]; //
for (i in 1:K) {
x2[i, ] = to_row_vector(x[, i]);
}
}
parameters {
vector[K] beta; // coefficients for predictors
real<lower=0> sigma; // error scale
}
model{
beta ~ normal(0, 1);
sigma ~ exponential(1);
target += reduce_sum(partial_sum, x2, grainsize, y, beta, sigma);
}
Any help would be hugely appreciated! And thanks in advance :)
PS I’ve also played around with a formulation involving a loop in partial_sum
, which looks like the following. This model compiles and the samples run, but they produce absolute nonsense which I think must be because the loops not doing what I want it to (but again, don’t know what needs to change). I also assumed the loop would slow things down, hence wanting to go with the approach above!
functions{
real partial_sum(vector[] x_slice,
int start, int end,
vector y, vector beta, real sigma,
int K) {
int N = (end - start + 1);
vector[N] mu = rep_vector(0, N);
for (i in 1:K) {
mu += beta[i] * to_vector(x_slice[, i]);
}
return normal_lpdf(y[start:end] | mu, sigma);
}
}
data {
int<lower=0> N; // number of data items
int<lower=0> K; // number of predictors
matrix[N, K] x; // predictor matrix
vector[N] y; // outcome vector
int<lower=1> grainsize; // grainsize for threading
}
transformed data {
vector[N] x2[K]; //
for (i in 1:K) {
x2[i, ] = x[, i];
}
}
parameters {
vector[K] beta; // coefficients for predictors
real<lower=0> sigma; // error scale
}
model{
beta ~ normal(0, 1);
sigma ~ exponential(1);
target += reduce_sum(partial_sum, x2, grainsize, y, beta, sigma, K);
}