 # Help with multi-threading a linear regression model and slicing the design matrix

tl;dr: I’ve got a multivariable linear regression example working which uses multi-threading to slice the data. I’d like to try slicing the design matrix instead (because it’s very big) but I’m struggling to produce a working example. Any help would be hugely appreciated!

Hi there! I’m having some trouble getting a multi-threading set-up for a linear regression working where I slice the design matrix instead of the data (I’ve got ~30 predictors and >200,000 observations so it’s pretty hefty).

So I’ve got this basic set-up where I slice the data and which works fine (though only ~1.5x speed-up for 3 threads so if that seems like poor performance and you have comments on this code, please say!):

``````functions {
real partial_sum(array[] real y_slice,
int start, int end,
matrix x, real alpha, vector beta, real sigma) {
return normal_lpdf( y_slice | x[start:end,:] * beta + alpha, sigma);
}
}

data {
int<lower=0> N;   // number of data items
int<lower=0> K;   // number of predictors
matrix[N, K] x;   // predictor matrix
array[N] real y;      // outcome vector
int<lower=1> grainsize; // grainsize for threading
}

parameters {
real alpha;           // intercept
vector[K] beta;       // coefficients for predictors
real<lower=0> sigma;  // error scale
}

model {
alpha ~ normal(0, 1);
beta ~ normal(0, 1);
sigma ~ exponential(1);
target += reduce_sum(partial_sum, y, grainsize, x, alpha, beta, sigma);
}
``````

I was inspired by this previous post to think about possibly slicing other parts of the model, specifically the design matrix (and avoid copying over the whole huge matrix into each threaded process).

I’m struggling currently to get something that even compiles though - particularly struggling to get all the types right to make it actually work. Does anyone have suggestions for how to modify the code below?

``````functions{
real partial_sum(vector[] x_slice,
int start, int end,
vector y, vector beta, real sigma) {
int N = (end - start + 1);
vector[N] mu = to_matrix(x_slice) * beta;
return normal_lpdf( y[start:end] | mu, sigma);
}
}

data {
int<lower=0> N;   // number of data items
int<lower=0> K;   // number of predictors
matrix[N, K] x;   // predictor matrix
vector[N] y;      // outcome vector
int<lower=1> grainsize; // grainsize for threading
}

transformed data {
row_vector[N] x2[K]; //
for (i in 1:K) {
x2[i, ] = to_row_vector(x[, i]);
}
}

parameters {
vector[K] beta;       // coefficients for predictors
real<lower=0> sigma;  // error scale
}

model{
beta ~ normal(0, 1);
sigma ~ exponential(1);
target += reduce_sum(partial_sum, x2, grainsize, y, beta, sigma);
}
``````

Any help would be hugely appreciated! And thanks in advance :)

PS I’ve also played around with a formulation involving a loop in `partial_sum`, which looks like the following. This model compiles and the samples run, but they produce absolute nonsense which I think must be because the loops not doing what I want it to (but again, don’t know what needs to change). I also assumed the loop would slow things down, hence wanting to go with the approach above!

``````functions{
real partial_sum(vector[] x_slice,
int start, int end,
vector y, vector beta, real sigma,
int K) {
int N = (end - start + 1);
vector[N] mu = rep_vector(0, N);
for (i in 1:K) {
mu += beta[i] * to_vector(x_slice[, i]);
}
return normal_lpdf(y[start:end] | mu, sigma);
}
}

data {
int<lower=0> N;   // number of data items
int<lower=0> K;   // number of predictors
matrix[N, K] x;   // predictor matrix
vector[N] y;      // outcome vector
int<lower=1> grainsize; // grainsize for threading
}

transformed data {
vector[N] x2[K]; //
for (i in 1:K) {
x2[i, ] = x[, i];
}
}

parameters {
vector[K] beta;       // coefficients for predictors
real<lower=0> sigma;  // error scale
}

model{
beta ~ normal(0, 1);
sigma ~ exponential(1);
target += reduce_sum(partial_sum, x2, grainsize, y, beta, sigma, K);
}``````

Hi there!

1 Like

Oh nevermind, you cant slice on beta in this model I think?

Hi @rok_cesnovar, thanks for responding!

Re your 1st comment, my take-away from Reduce Sum: A Minimal Example was that slicing on the data variable (`n_redcards` in the linked example) provided a significant speed-boost - have I misunderstood? (Note I do get a speed-boost when doing this, in the first code example I linked) - I agree that I don’t think it’ll be possible to slice on `beta` in this example.

Re your 2nd comment, are you suggesting to replace the creation of `mu` and the `normal_lpdf` call with `normal_id_glm`?

Note as soon as I posted the above, I saw I could change the `functions` block so that `x_slice` was declared as an array of row-vectors. The code now reads as follows:

``````functions{
real partial_sum(array[] row_vector x_slice,
int start, int end,
vector y, vector beta, real sigma) {
int N = (end - start + 1);
vector[N] mu = alpha + to_matrix(x_slice) * beta;
return normal_lpdf( y[start:end] | mu, sigma);
}
}
``````

The code now compiles but I’m not able to run anything, getting the following error:

``````Chain 1 Initialization between (-2, 2) failed after 100 attempts.
Chain 1  Try specifying initial values, reducing ranges of constrained values, or reparameterizing the model.
Chain 1 Initialization failed.
Warning: Chain 1 finished unexpectedly!
``````

Any idea what the problem might be? I can try providing initial values myself, but I get the sense that this error message is masking a deeper issue - given it’s a super simple linear regression and I haven’t had any problems with the initial conditions on e.g. the other example where I slice `y` instead.

Also (and apologies if this is poor forum etiquette, I’m hoping not!) tagging @andrjohns and @wds15 here as they were incredibly helpful on the post this attempt was inspired by! :)

I should have explained myself a bit more. The speedup in the Redcard example doesnt really come from the fact that they sliced `n_redcards`. There isnt any parameters to slice there, much like in your model, so slicing over data has to do. I just wanted to clarify that data is not copied in reduce_sum for each thread. The copies are only done for parameters.

Yeah, I think this model should actually work for your case:

``````data {
int<lower=0> N;   // number of data items
int<lower=0> K;   // number of predictors
matrix[N, K] x;   // predictor matrix
vector[N] y;      // outcome vector
}

parameters {
vector[K] beta;       // coefficients for predictors
real<lower=0> sigma;  // error scale
}

model{
beta ~ normal(0, 1);
sigma ~ exponential(1);
y ~ normal_id_glm(x, 0, beta, sigma);
}
``````

The GLM functions can also then be further parallelized, but they provide a lot of speedup on their own as well.

1 Like

I’ve been using a simple multiple regression model with biggish fake data to test reduce_sum performance with different versions of cmdstan and different hardware. Following a suggestion that I think was made here I found that using a dummy variable to slice over works pretty well. Here’s the complete code:

``````functions {
real partial_sum(int[] ind, int start, int end, vector y, matrix X,
real a, vector b, real sigma) {
return normal_id_glm_lpdf(y[start:end] | X[start:end,], a, b, sigma);
}
}
data {
int<lower=0> N;
int<lower=0> M;
matrix[N, M] X;
vector[N] y;
}
transformed data {
int grainsize = 1;
int ind[N] = rep_array(1, N);
}
parameters {
real a;
real<lower=0> sigma;
vector[M] b;
}
model {
a ~ normal(0, 5);
sigma ~ normal(0, 10);
b ~ std_normal();

target += reduce_sum(partial_sum, ind, grainsize, y, X, a, b, sigma);
}
``````

For reference here’s the R code to generate some fake data.

``````fake_multi <- function(N=100000, M=50, alpha=1, sigma=0.25) {
X <- matrix(rnorm(N*M), N, M)
beta <- rnorm(M)
y <- as.vector(alpha + X %*% beta + rnorm(N, sd=sigma))
data_list <- list(N=N, M=M, X=X, y=y)
list(data_list=data_list, beta=beta)
}
``````

I deliberately made all this as geometrically uncomplicated as possible to get a pure test of threading performance. I was a little surprised that on a Linux PC with an old I7 processor with 4 physical cores and 8 threads I got a better than factor 2 speedup using 4 chains and 2 threads per chain.

2 Likes

Thanks again for all your help @rok_cesnovar - so where is the speed-up actually coming from then? The example I linked heavily implies it’s coming from slicing up `n_redcards` (I can’t see any other differences between the code that I would have expected to speed it up, but you’re far more familiar with Stan than I am). Do you mean the actual slicing of the data itself doesn’t give you a big speed-up, but instead it’s the division of the `normal_lpdf` calls across the threads providing the speed-boost? Or something else entirely?

Re the point about data not being copied in `reduce_sum` for each thread, I’d formed that impression from this comment and this one - have I misunderstood/missed something or are those comments not quite right?

The model works really, really well, and even better when I parallelise further! Is there a multivariate-normal equivalent of this (for a slightly more complex model I need to develop)?

Thanks again for all your help so far! Hugely appreciate it! :)

This is awesome, thank you very much! I’ll have a play around with implementing it myself, much appreciated!

Out of interest, are you noticing significant benefits in performance over slicing just one of the observations (`y`, as I’m doing above) or the predictors (`X`, as I was trying to do)?

Out of interest, are you noticing significant benefits in performance over slicing just one of the observations ( `y` , as I’m doing above) or the predictors ( `X` , as I was trying to do)?

It looks like I tried a couple different ways to code the same model formulation in the `partial_sum()` function with slicing on `y` and what I tried was significantly slower than the version I pasted in my last comment. I don’t have complete records of what I did though, which was anyway well over a year ago and using cmdstan 2.23.0.

This is what I was trying to say yes. I was just phrasing it badly.

Yay!

Unfortunately not.

Another clarification, the slicing of input arguments is most likely only going to have a noticeable impact when you’re slicing parameters, not data (parameters require more copying/processing)

1 Like