Linear, parallell regression


#61

Do you mean e.g. in a regression context, providing number of shards equal to number of observations? And then automatically combining them into the right size?

If that is possible, users might not need to worry about padding out missing values in shards. That will make map_rect quite a bit easier to use with real data.


#62

Something like that. If the users gives to stan a lot of small jobs, we should be able to pack these cleverly together. I have never done this, but it sounds like a doable task to me.


#63

So I re-ran the linear model with a more granular spacing of shards to try to characterise the curves better. Here is the table of results and a plot:

time_in_ms order shards time_linear ratio
3 4 50 42,401 0.0001
24,834 4 100 42,401 0.586
18,222 4 200 42,401 0.430
23,384 4 300 42,401 0.551
20,959 4 400 42,401 0.494
47,334 4 500 42,401 1.116
60,061 4 600 42,401 1.416
22,590 4 700 42,401 0.533
31,672 4 800 42,401 0.747
33,686 4 900 42,401 0.794
61,607 4 1,000 42,401 1.453
34,010 4 1,100 42,401 0.802
26,374 4 1,200 42,401 0.622
83,719 4 1,300 42,401 1.974
35,303 4 1,400 42,401 0.833
37,058 4 1,500 42,401 0.874
56,689 4 2,000 42,401 1.337
31,483 4 2,500 42,401 0.743
95,567 4 3,000 42,401 2.254
88,839 4 4,000 42,401 2.095
262,206 4 5,000 42,401 6.184
333,721 5 50 646,038 0.517
442,397 5 100 646,038 0.685
82,083 5 200 646,038 0.127
201,167 5 300 646,038 0.311
113,416 5 400 646,038 0.176
683,850 5 500 646,038 1.059
85,219 5 600 646,038 0.132
589,495 5 700 646,038 0.912
94,406 5 800 646,038 0.146
481,697 5 900 646,038 0.746
379,561 5 1,000 646,038 0.588
86,849 5 1,100 646,038 0.134
194,543 5 1,200 646,038 0.301
498,565 5 1,300 646,038 0.772
185,828 5 1,400 646,038 0.288
467,989 5 1,500 646,038 0.724
131,470 5 2,000 646,038 0.204
122,695 5 2,500 646,038 0.190
661,582 5 3,000 646,038 1.024
159,023 5 4,000 646,038 0.246
141,625 5 5,000 646,038 0.219

Time vs shards:

Ratio vs shards:
test
There is alot of variation it seems


#64

@Ole_Petter_Hansen - the tutorials from StanCon Helsinkin are online. The ODE tutorial included map_rect example:

Edit: on my laptop the demo ODE takes 499seconds in normal execution - 94 seconds in parallel.


#65

Thanks for reporting this!


#66

Ran the ODE on the desktop just now also:
Serial model: 290 sec
Map_rect using only 1 logical core: 227 sec
Map_rect using all logical cores (24): 51 sec


#67

So guys, I’ve had a go at my own map_rect understanding the code a bit better. Alas I don’t understand it quite well enough. I have a multivariate normal linear model and since I want to work it on a big dataset with lots of X’s and Y’s, so I’m trying to map_rect it. Here is the model along with a script to generate some toy data and a script to run it:
gen_test_data_y2_x3.R (532 Bytes)

run_test_model.R (378 Bytes)

linear_multivariate_new.stan (1.1 KB)

The first point of confusion - it has a for loop in the model block and the transformed data block already so I was unsure which one to apply map_rect to. Because I thought it might be easier I tried to apply it to the model block. I also decided to simply call the map_rect for every row of data to avoid dealing with indexing and because for loops lend themselves to it. However, I’m now stuck on how to define x_r and x_i and confused. My best effort is here:

functions {
    vector mult_norm_par(vector mu_L, vector theta, real[] x_r, int[] x_i) {
        vector[NvarsY] = mu_L[1];
        matrix[NvarsY, NvarsY] L_Sigma = mu_L[2];
    
        real lp = multi_normal_cholesky_lpdf(y | mu, L_Sigma)
        return [lp]';
    }
}
data{
    int<lower=0> NvarsX; // num independent variables
    int<lower=0> NvarsY; // num dependent variables
    int<lower=0> N;     //Total number of rows

    matrix[N, NvarsX] x;
    vector[NvarsY] y [N];
}

parameters{
    cholesky_factor_corr[NvarsY] L_Omega; //For the correlations between outcome variables
    vector<lower=0>[NvarsY] L_sigma; //Residual SD of outcome variables

    vector[NvarsY] Beta0;         // Intercepts
    vector[NvarsX] Beta [NvarsY]; // Slopes
}

transformed parameters{
    vector[NvarsY] mu[N];
    matrix[NvarsY, NvarsY] L_Sigma;

    L_Sigma = diag_pre_multiply(L_sigma, L_Omega);

    for (i in 1:N){
        for (dv in 1:NvarsY){
            mu[i, dv] = Beta0[dv] + dot_product(to_vector(Beta[dv, 1:NvarsX]), x[i, 1:NvarsX]);
        }
    }
    
    // vars for map_rect
    vector theta[N];
    // x_r ??? confused
    // x_i ??? confused
}

model{
    //Priors
    L_Omega ~ lkj_corr_cholesky(1);
    L_sigma ~ cauchy(0,5);

    //for (i in 1:N){
    //    y[i, 1:NvarsY] ~ multi_normal_cholesky(mu[i, 1:NvarsY], L_Sigma);
    //}
    target +=sum(map_rect(  mult_norm_par,
                            append_row(mu, L_Sigma)),
	                        theta,
		                    x_r,
		                    x_i) );
}

generated quantities{
    matrix[NvarsY, NvarsY] Omega;
    matrix[NvarsY, NvarsY] Sigma;
    Omega = multiply_lower_tri_self_transpose(L_Omega);
    Sigma = quad_form_diag(L_Omega, L_sigma);
}

linear_multivariate_map_rect.stan (1.7 KB)

Any help appreciated!


#68

Well I tried to fill in some of the missing bits but I’m having PARSER EXPECTED compile errors at line 4. Can’t figure out the solution.

functions {
    vector mult_norm_par(vector mu_L, vector theta, real[] x_r, int[] x_i) {
        int NvarsY = x_i[2];
        real[NvarsY] mu = mu_L[1];
        matrix[NvarsY, NvarsY] L_Sigma = mu_L[2];

        real lp = multi_normal_cholesky_lpdf(x_r | mu, L_Sigma)
        return [lp]';
    }
}
data{
    int<lower=0> NvarsX; // num independent variables
    int<lower=0> NvarsY; // num dependent variables
    int<lower=0> N;     //Total number of rows

    matrix[N, NvarsX] x;
    vector[NvarsY] y [N];
}

parameters{
    cholesky_factor_corr[NvarsY] L_Omega; //For the correlations between outcome variables
    vector<lower=0>[NvarsY] L_sigma; //Residual SD of outcome variables

    vector[NvarsY] Beta0;         // Intercepts
    vector[NvarsX] Beta [NvarsY]; // Slopes
}

transformed parameters{
    vector[NvarsY] mu[N];
    matrix[NvarsY, NvarsY] L_Sigma;

    L_Sigma = diag_pre_multiply(L_sigma, L_Omega);

    for (i in 1:N){
        for (dv in 1:NvarsY){
            mu[i, dv] = Beta0[dv] + dot_product(to_vector(Beta[dv, 1:NvarsX]), x[i, 1:NvarsX]);
        }
    }
    
    // vars for map_rect
    vector theta[N];
    real x_r[N, NvarsX];
    int x_i[N, 2];
    {
        for (n in 1:N) {
          x_r[n] = to_array_1d(y[n]);
          x_i[n, 1] = n;
          x_i[n, 2] = NvarsY;
        }
    }
}

model{
    //Priors
    L_Omega ~ lkj_corr_cholesky(1);
    L_sigma ~ cauchy(0,5);

    //for (i in 1:N){
    //    y[i, 1:NvarsY] ~ multi_normal_cholesky(mu[i, 1:NvarsY], L_Sigma);
    //}
    target +=sum(map_rect( mult_norm_par, append_row(mu, L_Sigma)), theta, x_r, x_i) );
}

generated quantities{
    matrix[NvarsY, NvarsY] Omega;
    matrix[NvarsY, NvarsY] Sigma;
    Omega = multiply_lower_tri_self_transpose(L_Omega);
    Sigma = quad_form_diag(L_Omega, L_sigma);
}

#69

This line needs a ;


#70

Thanks @stijn! Fixed that now but it doesn’t fix my error :(


#71

I don’t think that’ll ever be the case unless the model is very simple. The samplers don’t do much and the time to calculate the log density always dominates almot all of the computation time.

I wouldn’t expect that to work well.

To keep the processors busy, communication time can’t dominate. So 100K shards for 100K data points is not ideal. Also, just the number of CPU shards may not be ideal if they take different amounts of time to execute, as everything’s synchronized on getting the last result back.

There may also be page sizes having to do with memory communication, too.

There will be lots of hardware specific issues, like how fast memory and transfer is and how fast the CPUs are. With modern CPUs, if the whole CPU isn’t under a lot of load, they overclock them relative to multi-core performance (I think it’s called “turbo boost” :-))

Each level up you have to go the data (memory is beyond L3), means more latency getting into the CPU. I’m not sure what the ratio of transfer speeds are, but the modern CPUs are good at streaming data through if it’s memory local in memory. If we start paging in and out of cache in a way that’s not streaming memory local, then that slows down processors, too.

It’s going to depend on the model in terms of how much computation is done per shard. It’s keeping all the pipelines full that’s the key part. CPUs can’t do anything without data to work on.

Are those plots just for single runs?

We’re applying for some Amazon AWS grants to see just how well we can scale for various models. And we have some algorithm evals to do on hundreds of models, to boot.

Nice. Thanks for reporting. This is the use case we really built MPI for—large computations relative to the amount of data going back and forth.


#72

Yeah single runs. I have some deadlines coming up so I didnt’ get change to do multiple runs but I can do so if you wish later in the month.

No problem. I think I said somewhere above - I’m a novice at stan and more-so at map_rect - I’ve have difficulty to follow the models. But I have a 12core/24thread desktop that I’m generally happy to run some benchmarks on if its helpful to the effort and I’m not using it for work stuff.
I’m trying to learn this stuff but as I work on it alone its a tough learning curve. i’d be grateful for any advice re my model in post 67! :)


#73

So I’ve made some changes but I’m still running into problems with passing the variables into the mult_norm_par function. My code now looks like this:

functions {
    vector mult_norm_par(vector mu_L, vector theta, real[] x_r, int[] x_i) {
        int NvarsY = x_i[2];
        vector[NvarsY] mu = mu_L[1];
        matrix[NvarsY, NvarsY] L_Sigma = mu_L[2];

        real lp = multi_normal_cholesky_lpdf(x_r | mu, L_Sigma);
        return [lp]';
    }
}
data{
    int<lower=0> NvarsX; // num independent variables
    int<lower=0> NvarsY; // num dependent variables
    int<lower=0> N;     //Total number of rows

    matrix[N, NvarsX] x;
    vector[NvarsY] y [N];
}

parameters{
    cholesky_factor_corr[NvarsY] L_Omega; //For the correlations between outcome variables
    vector<lower=0>[NvarsY] L_sigma; //Residual SD of outcome variables

    vector[NvarsY] Beta0;         // Intercepts
    vector[NvarsX] Beta [NvarsY]; // Slopes
}

transformed parameters{
    vector[NvarsY] mu[N];
    matrix[NvarsY, NvarsY] L_Sigma;

    L_Sigma = diag_pre_multiply(L_sigma, L_Omega);

    for (i in 1:N){
        for (dv in 1:NvarsY){
            mu[i, dv] = Beta0[dv] + dot_product(to_vector(Beta[dv, 1:NvarsX]), x[i, 1:NvarsX]);
        }
    }
    
    // vars for map_rect
    vector theta[N];
    real x_r[N, NvarsX];
    int x_i[N, 2];
    {
        for (n in 1:N) {
          x_r[n] = to_array_1d(y[n]);
          x_i[n, 1] = n;
          x_i[n, 2] = NvarsY;
        }
    }
}

model{
    //Priors
    L_Omega ~ lkj_corr_cholesky(1);
    L_sigma ~ cauchy(0,5);

    //for (i in 1:N){
    //    y[i, 1:NvarsY] ~ multi_normal_cholesky(mu[i, 1:NvarsY], L_Sigma);
    //}
    target +=sum(map_rect( mult_norm_par, append_row(to_vector(mu[, 1:NvarsY]), L_Sigma)), theta, x_r, x_i) );
}

generated quantities{
    matrix[NvarsY, NvarsY] Omega;
    matrix[NvarsY, NvarsY] Sigma;
    Omega = multiply_lower_tri_self_transpose(L_Omega);
    Sigma = quad_form_diag(L_Omega, L_sigma);
}

When I run this I’m getting the error:

variable definition base type mismatch, variable declared as base type: vector variable definition has base: real error in ‘examples/map_rect/linear_multivariate_map_rect.stan’ at line 4, column 36


 2:     vector mult_norm_par(vector mu_L, vector theta, real[] x_r, int[] x_i) {
 3:         int NvarsY = x_i[2];
 4:         vector[NvarsY] mu = mu_L[1];
                                       ^
 5:         matrix[NvarsY, NvarsY] L_Sigma = mu_L[2];

I’'ve tried a few variations, but each time I’m running into a problem unpacking elements of mu_L


#74

mu_L is declared to be a vector. So mu_L[1] is a real value. You are trying to assign it to mu, which is delcared to be a vector. That’s not going to work. Same for the matrix. I don’t know what your intent is, though, so it’s hard to advise further.


#75

Hi Bob, thanks for your reply.

I’m trying to pass 1 row of mu from the transformed parameters block, and the L_Sigma for the multi_normal_choleskpy_lpdf function in the map_rect function. I’m trying to do this by calling map_rect with append_row(to_vector(mu[, 1:NvarsY]), L_Sigma)) assigned as the mu_L variable in the mult_norm_par function. I was trying to emulate the approach taken by Ole in his earlier code: Linear, parallell regression

Have I misunderstood something?


#76

I can’t speak to that, but the code you pasted has the problem I mentioned—you can’t assign mu_L[1] (a scalar) to mu (a vector). Stan requires vectors to be assigned to vectors, scalars to scalars, etc.


#77

Ok. So then my point of misunderstanding here is the mu_L[1] is a scalar. I thought it should be a vector. So I’m calling making assigning mu_L as append_row(to_vector(mu[, 1:NvarsY]), L_Sigma)).

Is mu_L[1] not equal to the first element of that-i.e.: to_vector(mu[, 1:NvarsY]). This is what I imagined I was doing in any case.


#78

mu_L[1] is the first element of the vector mu_L. When you declare vector mu_L, it’s saying that mu_L is a vector. Did you mean to delcare mu_L as an array of vectors? That’s vector[].


#79

Aha! Yes ok that was my point of confusion so! Thank you very much. So that fixed the error on that line.

If I might get your help to clarify this point too think I will be ok -> now I’m getting an error on the next line:
matrix[NvarsY, NvarsY] L_Sigma = mu_L[2];

So this is throwing an error because I am attempting to pass a matrix to a vector (this error I was expecting). L_sigma is a triangular matrix. How can I collapse that to a vector to pass into map_rect, and then reconstruct into the correct triangular matrix ?

I can guess that I could do to_vector(L_sigma) in the call to map_rect, but how then do I correctly reconstruct L_sigma from a vector in my user defined function? NvarsY, the dimensions of L_sigma, is already passed to the function. I don’t know two things here:

  1. If I do to_vector does that collapse a matrix by row or by column?
  2. If I extract that vector from `mu_L[2]’, how to I reshape it back into L_sigma ?

#80

Actually - I think I figured it out @Bob_Carpenter. I found the function reference manual and learned that to_vector is column-wise, and I can use the function to_matrix to rebuild a matrix, columnwise.

Thanks for your help. I’m off to figure out the next error on my own if I can :)