Fitting the same model to many datasets

I have a use case where I’m fitting a simple linear regression model to many datasets. The code below works, but for fitting 1,000s or 10,000s of datasets I’d like to optimize it to make better use of eg. a multi-core CPU or a GPU. In particular, I’d appreciate any advice/tricks for optimizing the for-loop over datasets. Thanks!

data {
    int<lower=0> N_y; // Few 100s observations
    int<lower=0> N_x; // Few 10s predictors
    int<lower=0> N_datasets; // Few 1,000s or 10,000s datasets
    array[N_datasets] vector[N_y] y;
    array[N_datasets] matrix[N_y, N_x] X;
}
parameters {
    array[N_datasets] real alpha;
    array[N_datasets] vector[N_x] beta;
    array[N_datasets] real<lower=0> sigma;
}
model {
    for (i in 1:N_datasets) {
        y[i] ~ normal(alpha[i] + X[i] * beta[i], sigma[i]);
    }
}

Hi and Welcome. As far I know, that would have to be done at the interface level, so that the Stan program would work as any other function within that. I have used Python’s multiprocessing library and the map function to do exactly that, run the same GLM for different data sets. I have also set up multiple parallel runs using CmdStan by writing a shell script with all analyses and run them on a terminal in the background (or using job manager like Slurm).

1 Like

You could redefine that for loop with matrix calculus.

For running over multiple datasets, you could set-up your script to run in parallel in one python run or have a python script with a small cli to run over specific dataset and then extract and save what you need inside that script and then call this script from python with subprocess and thread pool.

I recommend pre-compiling the stan model.

Similar stuff can be done with R.

Thanks very much for your answers @caesoma and @ahartikainen! So it sounds like the best approach is probably to have the Stan program process a single dataset, and then manage the parallelism over datasets externally in python or R?

On your point about redefining the for loop with matrices @ahartikainen - my understanding is that the X[i] * beta[i] piece is already doing matrix multiplication (with each X[i] a matrix and each beta[i] a column vector)? I’d love to know if it’s possible to redefine this loop in a more efficient way, but being a Stan novice I was struggling to understand whether it was possible given that what I’m really trying to do is a matrix multiplication on the last 2 indices of the 3-dimensional X (a special case handled by eg. numpy.matmul).

Yes, that’s it. Since an external program is needed to call Stan in all cases, in the case where there are data sets to be fit independently it’s easiest to use the interface’s tools to run them in parallel.

I think @ahartikainen referred to the loop part being redefined as a matrix operation, which in some interfaces could have a built-in parallelization of entry multiplication, which would be similar to what you already implemented within Stan, and maybe less cumbersome than using Python multiprocessing’s map or starmap, for instance.

The X * beta matrix operation itself is already optimized already through the numeric libraries. If an individual run is computationally intensive it may be worth looking into speeding up the program, but I doubt that would be comparable to 10, 100-fold (or however many CPU you have available) increase in speed from parallelization.

Hi @caesoma, sorry to open this topic again, but I was wondering if you would please elaborate on how you used python’s multiprocessing with cmdstanpy?

EDIT: The error described below seems to occur only when output_dir is specified.

For example, the code below seems to work fine and I can manually inspect the generated csv files etc.

def sample_model(stan_file=None, exe_file=None, data=None):
    return CmdStanModel(stan_file=stan_file, exe_file=exe_file).sample(data=data, chains=4, parallel_chains=1, validate_csv=False, output_dir='tmp')

with Pool(n_processes) as p:
    res = p.starmap(sample_model, [('model.stan', 'model', data) for data in datasets])

However, when I set validate_csv=True then I get the following error:

ValueError: line 47: expecting metric, found:
	 "# Step size = 0.034296"

I have verified that I don’t have this error if I sample from the model normally without using map.

The cmdstanpy documentation seems to suggest that validate_csv is optional and can be turned off (for example to speed up processing of large samples), but it seems that functions like .stan_variable() don’t work without it because .column_names is empty.

Is there some other way to access the variables when validate_csv=False, or do you have any other suggestions to resolve the validate_csv error?

Thanks!

1 Like

Interesting, sounds like the classes are using reference to a same object?

An update to the above:

The issue seems to occur only when output_dir is set when sampling. It works when this is left as the default.

1 Like

Ah, ok, then it tries to read files from another process.

I recommend that either define unique path for each process or let cmdstanpy do that automatically.

1 Like

That makes sense, thanks!