Fitting the same model to many datasets

energyatinfinity · November 12, 2022, 12:51am

I have a use case where I’m fitting a simple linear regression model to many datasets. The code below works, but for fitting 1,000s or 10,000s of datasets I’d like to optimize it to make better use of eg. a multi-core CPU or a GPU. In particular, I’d appreciate any advice/tricks for optimizing the for-loop over datasets. Thanks!

data {
    int<lower=0> N_y; // Few 100s observations
    int<lower=0> N_x; // Few 10s predictors
    int<lower=0> N_datasets; // Few 1,000s or 10,000s datasets
    array[N_datasets] vector[N_y] y;
    array[N_datasets] matrix[N_y, N_x] X;
}
parameters {
    array[N_datasets] real alpha;
    array[N_datasets] vector[N_x] beta;
    array[N_datasets] real<lower=0> sigma;
}
model {
    for (i in 1:N_datasets) {
        y[i] ~ normal(alpha[i] + X[i] * beta[i], sigma[i]);
    }
}

caesoma · November 12, 2022, 11:21pm

Hi and Welcome. As far I know, that would have to be done at the interface level, so that the Stan program would work as any other function within that. I have used Python’s multiprocessing library and the map function to do exactly that, run the same GLM for different data sets. I have also set up multiple parallel runs using CmdStan by writing a shell script with all analyses and run them on a terminal in the background (or using job manager like Slurm).

ahartikainen · November 15, 2022, 6:49am

You could redefine that for loop with matrix calculus.

For running over multiple datasets, you could set-up your script to run in parallel in one python run or have a python script with a small cli to run over specific dataset and then extract and save what you need inside that script and then call this script from python with subprocess and thread pool.

I recommend pre-compiling the stan model.

Similar stuff can be done with R.

energyatinfinity · November 18, 2022, 9:09pm

Thanks very much for your answers @caesoma and @ahartikainen! So it sounds like the best approach is probably to have the Stan program process a single dataset, and then manage the parallelism over datasets externally in python or R?

On your point about redefining the for loop with matrices @ahartikainen - my understanding is that the X[i] * beta[i] piece is already doing matrix multiplication (with each X[i] a matrix and each beta[i] a column vector)? I’d love to know if it’s possible to redefine this loop in a more efficient way, but being a Stan novice I was struggling to understand whether it was possible given that what I’m really trying to do is a matrix multiplication on the last 2 indices of the 3-dimensional X (a special case handled by eg. numpy.matmul).

caesoma · November 19, 2022, 10:55am

Yes, that’s it. Since an external program is needed to call Stan in all cases, in the case where there are data sets to be fit independently it’s easiest to use the interface’s tools to run them in parallel.

I think @ahartikainen referred to the loop part being redefined as a matrix operation, which in some interfaces could have a built-in parallelization of entry multiplication, which would be similar to what you already implemented within Stan, and maybe less cumbersome than using Python multiprocessing’s map or starmap, for instance.

The X * beta matrix operation itself is already optimized already through the numeric libraries. If an individual run is computationally intensive it may be worth looking into speeding up the program, but I doubt that would be comparable to 10, 100-fold (or however many CPU you have available) increase in speed from parallelization.

energyatinfinity · November 22, 2022, 6:05pm

Hi @caesoma, sorry to open this topic again, but I was wondering if you would please elaborate on how you used python’s multiprocessing with cmdstanpy?

EDIT: The error described below seems to occur only when output_dir is specified.

For example, the code below seems to work fine and I can manually inspect the generated csv files etc.

def sample_model(stan_file=None, exe_file=None, data=None):
    return CmdStanModel(stan_file=stan_file, exe_file=exe_file).sample(data=data, chains=4, parallel_chains=1, validate_csv=False, output_dir='tmp')

with Pool(n_processes) as p:
    res = p.starmap(sample_model, [('model.stan', 'model', data) for data in datasets])

However, when I set validate_csv=True then I get the following error:

ValueError: line 47: expecting metric, found:
	 "# Step size = 0.034296"

I have verified that I don’t have this error if I sample from the model normally without using map.

The cmdstanpy documentation seems to suggest that validate_csv is optional and can be turned off (for example to speed up processing of large samples), but it seems that functions like .stan_variable() don’t work without it because .column_names is empty.

Is there some other way to access the variables when validate_csv=False, or do you have any other suggestions to resolve the validate_csv error?

Thanks!

ahartikainen · November 22, 2022, 6:21pm

Interesting, sounds like the classes are using reference to a same object?

energyatinfinity · November 22, 2022, 7:25pm

An update to the above:

The issue seems to occur only when output_dir is set when sampling. It works when this is left as the default.

ahartikainen · November 22, 2022, 7:33pm

Ah, ok, then it tries to read files from another process.

I recommend that either define unique path for each process or let cmdstanpy do that automatically.

energyatinfinity · November 22, 2022, 7:41pm

That makes sense, thanks!

Topic		Replies	Views
Parallel the same model fitting for differen data CmdStan techniques	8	1321	November 22, 2023
Efficient fitting of multiple dataset/model combinations Publicity	0	1125	January 28, 2019
Running stan models in parallel in cmdstan py Interfaces cmdstanpy	3	505	August 31, 2021
New User of Stan Model 'many' regressions Modeling	2	447	July 5, 2021
Fixed param, but for whole fit? Interfaces cmdstanpy	13	1156	April 13, 2021

Fitting the same model to many datasets

Related topics