Catastrophic performance drop with map_rect

lucasgautheron · May 7, 2023, 5:30pm

I’m modelling hundreds of time-series (each with thousands of observations) using a Hidden Markov Model, using a strategy described here.
However, this strategy is not efficient enough to process my whole dataset, and therefore I tried to parallelise the processing of each time-series, using stan’s map_rect.
Unfortunately, this has led to a dramatic performance drop which defeats this strategy altogether.
For instance, using an unrealistically short sample of data (2 time-series of 59 bins each), and pystan 3.7.0) the parallelised version is 40 times slower (2000 seconds vs 50 seconds !). CmdStan 2.32 shows similar performance.

Here are my questions:

Did I make any mistake in my attempt to parallelize the model (see below for the code before/after parallelization)? Am I using map_rect incorrectly?
Is there anything I could do to significantly improve the performance of the parallelized version?
If not, is there any alternative? (I had imagined that using 48 cores, parallelization could have yielded a x10 speedup that could have made my problem tractable)

Here are the unparallelised and parallelised versions of the code, for comparison:

Model 1 (unparallelised)

data {
    int<lower=1> n_children;
    int<lower=1> n_recs;
    int<lower=1> n_bins;
    int<lower=1> bins[n_recs];
    int a[n_recs,n_bins];
    int c[n_recs,n_bins];
    
    real<lower=0> age[n_recs];
}

parameters {
    matrix<lower=0,upper=1>[2,2] k[n_recs];
    real<lower=1> T_up;
    real<lower=1> T_down;
    real<lower=0> mu_chi;
    real<lower=0> mu_adu;
    
    vector<lower=0>[n_recs] epsilon_chi;
    vector<lower=0>[n_recs] epsilon_adu;
}

transformed parameters {
    matrix<lower=0,upper=1>[4,4] A;
    vector[4] logalpha[n_recs,n_bins];
    
    for (s1c in 1:2) {
        for (s2c in 1:2) {
            for (s1a in 1:2) {
                for (s2a in 1:2) {
                    int i = 1+(s1c-1)+2*(s1a-1);
                    int j = 1+(s2c-1)+2*(s2a-1);
                    
                    if (s1c == 1 && s2c == 2) {
                        A[i,j] = 1/T_up;
                    }
                    else if (s1c == 2 && s2c == 1) {
                        A[i,j] = 1/T_down;
                    }
                    else if (s1c == s2c && s1c==1) {
                        A[i,j] = 1-1/T_down;
                    }
                    else if (s1c == s2c && s1c==2) {
                        A[i,j] = 1-1/T_up;
                    }

                    if (s1a == 1 && s2a == 2) {
                        A[i,j] *= 1/T_up;
                    }
                    else if (s1a == 2 && s2a == 1) {
                        A[i,j] *= 1/T_down;
                    }
                    else if (s1a == s2a && s1a==1) {
                        A[i,j] *= 1-1/T_down;
                    }
                    else if (s1a == s2a && s1a==2) {
                        A[i,j] *= 1-1/T_up;
                    }
                }
            }
        }
    }

    {
        real accumulator[4];
        
        for (i in 1:n_recs) {
            logalpha[i, 1] = rep_vector(0,4);
            for (t in 2:n_bins) {
                for (s1 in 1:4) { // s1 = state at time t
                    for (s2 in 1:4) { // s2 = state at time t-1
                        accumulator[s2] = logalpha[i, t-1, s2] + log(A[s2,s1]);
                        accumulator[s2] += poisson_lpmf(c[i,t] | ((s1-1)%2)*(k[i,1,1]*c[i,t-1] + k[i,2,1]*a[i,t-1] + mu_chi) + epsilon_chi[i]);
                        accumulator[s2] += poisson_lpmf(a[i,t] | floor((s1-1)/2.0)*(k[i,2,2]*a[i,t-1] + k[i,1,2]*c[i,t-1] + mu_adu) + epsilon_adu[i]);
                    }
                    logalpha[i, t, s1] = log_sum_exp(accumulator);
                }
            }
        }
        
    }
}

model {
    for (i in 1:n_recs) {
        target += log_sum_exp(logalpha[i,n_bins]);
    }
    
    T_up ~ gamma(2, 0.01);
    T_down ~ gamma(2, 0.01);
    
    for (i in 1:2) {
        for (j in 1:2) {
            k[i,j] ~ uniform(0,1);
        }
    }
    mu_chi ~ normal(0,1);
    mu_adu ~ normal(0,1);

    epsilon_chi ~ normal(0,1);
    epsilon_adu ~ normal(0,1);
}

Model 2 (with map_rect)

functions {
    vector la(vector thetas, vector phis, array[] real x, array[] int y) {
        int n_bins = size(y)%/%2;
        array [4] real accumulator;
        vector[n_bins*4] lgalpha;

        lgalpha[1:4] = rep_vector(0, 4);
        
        for (t in 2:n_bins) {
            for (s1 in 1:4) { // s1 = state at time t
                for (s2 in 1:4) { // s2 = state at time t-1
                    accumulator[s2] = lgalpha[(t-2)*4+s2] + log(thetas[2+(s1-1)*4+s2]);
                    accumulator[s2] += poisson_lpmf(y[(t-1)*2+1] | ((s1-1)%2)*(phis[3]*y[(t-2)*2+1] + phis[5]*y[(t-2)*2+2] + thetas[1]) + phis[1]);
                    accumulator[s2] += poisson_lpmf(y[(t-1)*2+2] | floor((s1-1)/2.0)*(phis[6]*y[(t-2)*2+2] + phis[4]*y[(t-2)*2+1] + thetas[2]) + phis[2]);
                }
                lgalpha[(t-1)*4+s1] = log_sum_exp(accumulator);
            }
        }
        return lgalpha;
    }
}
data {
    int<lower=1> n_children;
    int<lower=1> n_recs;
    int<lower=1> n_bins;
    array [n_recs] int <lower=1> bins;
    array [n_recs, n_bins] int a;
    array [n_recs, n_bins] int c;
    
    array [n_recs] real<lower=0> age;
}

transformed data {
    array [n_recs,n_bins*2] int vocs;
    array [n_recs,1] real x;

    for (i in 1:n_recs) {
        x[i,1] = age[i];
        for (t in 1:n_bins) {
            vocs[i,(t-1)*2+1] = c[i,t];
            vocs[i,(t-1)*2+2] = a[i,t];
        }
    }
}

parameters {
    array [n_recs] matrix<lower=0,upper=1>[2,2] k;
    real<lower=1> T_up;
    real<lower=1> T_down;
    real<lower=0> mu_chi;
    real<lower=0> mu_adu;
    
    vector<lower=0>[n_recs] epsilon_chi;
    vector<lower=0>[n_recs] epsilon_adu;
}

transformed parameters {
    matrix<lower=0,upper=1>[4,4] A;
    vector[n_recs*n_bins*4] logalpha;
    
    for (s1c in 1:2) {
        for (s2c in 1:2) {
            for (s1a in 1:2) {
                for (s2a in 1:2) {
                    int i = 1+(s1c-1)+2*(s1a-1);
                    int j = 1+(s2c-1)+2*(s2a-1);
                    
                    if (s1c == 1 && s2c == 2) {
                        A[i,j] = 1/T_up;
                    }
                    else if (s1c == 2 && s2c == 1) {
                        A[i,j] = 1/T_down;
                    }
                    else if (s1c == s2c && s1c==1) {
                        A[i,j] = 1-1/T_down;
                    }
                    else if (s1c == s2c && s1c==2) {
                        A[i,j] = 1-1/T_up;
                    }

                    if (s1a == 1 && s2a == 2) {
                        A[i,j] *= 1/T_up;
                    }
                    else if (s1a == 2 && s2a == 1) {
                        A[i,j] *= 1/T_down;
                    }
                    else if (s1a == s2a && s1a==1) {
                        A[i,j] *= 1-1/T_down;
                    }
                    else if (s1a == s2a && s1a==2) {
                        A[i,j] *= 1-1/T_up;
                    }
                }
            }
        }
    }


    vector[18] thetas;
    thetas[1] = mu_chi;
    thetas[2] = mu_adu;
    for (s1 in 1:4) {
        for (s2 in 1:4) {
            thetas[2+(s1-1)*4+s2] = A[s1,s2];
        }
    }

    array[n_recs] vector[6] phis;
    for (i in 1:n_recs) {
        phis[i] = [epsilon_chi[i],epsilon_adu[i],k[i,1,1],k[i,1,2],k[i,2,1],k[i,2,2]]';
    }

    logalpha = map_rect(la, thetas, phis, x, vocs);
}

model {
    for (i in 1:n_recs) {
        int j = (i*n_bins-1)*4;
        target += log_sum_exp(logalpha[j+1:j+4]);
    }
    
    T_up ~ gamma(2, 0.01);
    T_down ~ gamma(2, 0.01);
    
    for (i in 1:2) {
        for (j in 1:2) {
            k[i,j] ~ uniform(0,1);
        }
    }
    mu_chi ~ normal(0,1);
    mu_adu ~ normal(0,1);

    epsilon_chi ~ normal(0,1);
    epsilon_adu ~ normal(0,1);
}

Thank you very much for your help!

Lucas

lucasgautheron · May 9, 2023, 1:06pm

Dear everyone,

Here is an update:

The implementation of map_rect seems semantically correct, since it yields the same results as the former implementation.
I’ve done some profiling for the un-parallelised version and the parallelised version (see below).
markov-chain is the time spent computing the contribution of each time-series to the log-likelihood basically; markov-chain-accumulator is how long was spent calculating accumulator as expressed in the models above; and markov-chain-lse is how long was spent computing the log_sum_exp of accumulator.

Here are the results:

Without map_rect:

hmm-20230509140032-profile.csv (428 Bytes)

With map_rect:

hmm_parallel-20230509140330-profile.csv (950 Bytes)

Does this make sense to anyone?
In particular
- why such an extreme difference in the amount of autodiff calls between the two versions?
- what is the meaning of the extreme unbalance between forward time and reverse time in the case of the model using map_rect?

Thank you for your help!

stevebronder · May 9, 2023, 2:54pm

I can’t comment on map_rect besides that its a bit older and using the MPI backend. Have you tried using reduce_sum instead? Tmk map_rect is preferred when working with a cluster, but if you are working on just a single computer and want to use threads reduce_sum should be friendlier.

lucasgautheron · May 9, 2023, 9:06pm

Thank you! Using reduce_sum seems to have solved the performance issue.
Unfortunately the computation of \log \alpha cannot be re-used with this strategy (if I’m not mistaken), and I do need \log \alpha (for some extra calculations in the generated quantities block which I have not shared). I guess it is alright if I just redo these calculations again post-hoc, considering the gains from parallelisation.

NB: i’m still interested about the cause of the issue with map_rect!

stevebronder · May 9, 2023, 10:06pm

map_rect has a lot more overhead and is meant for doing operations over a cluster of computers so imo it’s most likely that it’s just not the right tool for the job.

Code duplication is unfortunate, though it should not be too slow since the generated quantities are just working with doubles instead of autodiff types. Once you have all your draws it should be pretty reasonable to generate the quantities you want in a R or Python (which then should be easy to do in parallel).

Topic		Replies	Views
Map_rect in the Stan language (MPI is coming) Developers	21	2677	July 14, 2020
Linear, parallell regression CmdStan	89	5245	September 20, 2018
Multiple calls to map_rect Modeling performance	4	524	May 7, 2019
Map_rect causing substantial slowdown; trying to understand how to fix Modeling cmdstanr , paralellization	13	884	July 24, 2020
Using map_rect with multiple Gaussian processes (sampling time) Modeling paralellization	6	457	February 26, 2021

Catastrophic performance drop with map_rect

Related topics