Reduce_sum parallelisation issue

Hi, first of all thanks for all your work.
I have a question regarding within-chain parallelisation using the function reduce_sum because I see a really strange behaviour, as the model with parallelisation is much slower than the model without it using cmdstanr.
My model is a liability threshold model which aim to estimate at variance components using individuals clustered in families. Liabilities are simulated from a truncated multivariate normal distribution, therefore each individual has an underlying liability distribution which slow the program. The likelihood is calculated summing the contribution of each family and not each individuals. Anyway, model works fine, but I wanted to improve speed through within-chain parallelisation (in case of large sample size and to perform simulations), summing the partial sum from each family, but I donā€™t know why I get the opposite result. I post the code, do you have any idea for this behaviour?


// log-likelihood based on a truncated multivariate normal distribution.
// families are sliced so that the computation is parellilised.
functions {
real partial_sum(int[] fam_slice,
                   int start, int end,
                  vector X, matrix Sigma, int[] ni, int[] firstfam, vector lb, vector ub, vector lb_ind, vector ub_ind, vector u, int NFAM, int minff, int maxff,int minni, int maxni) {

real likelihood = 0;
// loop needed to extract data for the specific family and needed to compute the likelihood
 for (k in start:end) {
int nik=ni[k];
int firstfamk =firstfam[k];
matrix [nik,nik] Sigmakc=cholesky_decompose(block(Sigma, firstfamk, firstfamk, nik, nik));
vector[nik] Xk=segment(X, firstfamk ,nik);
vector[nik] lbk=segment(lb, firstfamk ,nik);
vector[nik] ubk=segment(ub, firstfamk ,nik);
vector[nik] lb_indk=segment(lb_ind, firstfamk ,nik);
vector[nik] ub_indk=segment(ub_ind, firstfamk ,nik);
vector[nik] uk=segment(u, firstfamk ,nik);

// likelihood computation
     int M = rows(uk); 
      vector[M] z;
real prob = 0;
      for ( m in 1:M ) {
        if ( lb_indk[m] == 0 && ub_indk[m] == 0 )  z[m] = inv_Phi(uk[m]);
          else { 
            int km1 = m - 1;
            real v; 
            real z_star;
            real logd;
            row_vector [2] log_ustar = [negative_infinity(), 0];           
            real constrain = Xk[m] + ((m > 1) ? Sigmakc[m, 1:km1] * head(z, km1) : 0);
            if ( lb_indk[m] == 1 ) log_ustar[1] = normal_lcdf( ( lbk[m] - constrain ) / Sigmakc[m, m] | 0.0, 1.0 );
            if ( ub_indk[m] == 1 ) log_ustar[2] = normal_lcdf( ( ubk[m] - constrain ) / Sigmakc[m, m] | 0.0, 1.0 );
            logd  = log_diff_exp(log_ustar[2], log_ustar[1]);
            v     = exp( log_sum_exp( log_ustar[1], log(uk[m]) + logd ) );    
            z[m]  = inv_Phi(v);                                          
            prob   += logd;                                                    
          }
        }
 likelihood += prob;
}

   return  likelihood ; // likelihood
  }
}
..... // data omitted
// parameters: variance components and individual uniformn distributio needed to compute liabilities
parameters {
real<lower=0,upper=1> heritability;
real<lower=0,upper=1-heritability> sharedenvironment;
vector<lower=0,upper=1>[N] u;
}
model {
matrix[N, N] Sigma;
Sigma = (heritability*pedigree_matrix + sharedenvironment*household_matrix + (1-heritability-sharedenvironment)*identity_matrix);
target += reduce_sum(partial_sum, fam, grainsize, X, Sigma, ni, firstfam, lb, ub, lb_ind, ub_ind, u, NFAM, minff, maxff,minni, maxni);
heritability~ beta(12,12);
sharedenvironment~ beta(6,14) ;
    }

****

Thanks for any help

Andrea

How is the speed of the model if you use the partial sum function to calculate the entire log likelihood in a single go without going via reduce_sum?

Maybe the partial sum function is not written in an efficient way?

How does the serial version of the model look like which runs fast?

You can also try profiling to find out where things take time, check the manuals for that feature.

97 seconds are needed if I use the partial_sum only for 100 iterations in 1 chain.
408 seconds are needed if I use the reduce_sum for 100 iterations using 100 cores within 1 chain.

The partial sum function is written basically in the same way for both ā€œscenariosā€, the only thing it changes is the fact that the first loop in the function is done for ā€œk in 1:NFAMā€ for the first scenario, and for ā€œk in start:endā€ for the second scenario (so to consider only the sliced family). Thus, the function may be inefficient in an absolute way, but I would not expect this inefficiency relative to using reduce_sum. Consider I have only 25 families, so it is really strange since the parallelisation should handle all 25 partial likelihoods saparately and make it faster (if my interpretation is correct). Maybe Iā€™m doing something wrong in the partial sum and it does not slice the data as I wish?

Iā€™ve also thought about the profilinig, but basically the only calculations are done inside the partial sum, therefore I assume that reduce_sum does not like something there, but I canā€™t figure out what since I am simply slicing each of the 25 families.

Regarding the question ā€œHow does the serial version of the model look like which runs fast?ā€, actually is the same as in scenario 1 which takes 97 seconds, where log likelihood is calculated in a single go.

I realize itā€™s not easy to catch the issue, I just wanted to understand if the slicing that Iā€™m doing makes sense, since time with reduce_sum is 4x longer and it is really unexpected.

Thanks

Please start with few cores first! Have a look at the vignette on within chain parallelization from the r package brms to diagnose how many cores to use.

Also, if you parallelize over 25 families, why do you use 100 cores? Maybe I misread something.

You are correct, Iā€™ve used 100 but I thought it would simply use 25 detecting the number of families. If thatā€™s not the case Iā€™ll try with 25. I havenā€™t seen the brms vignette and Iā€™m going to take a look now, thanks.

Iā€™ve read the vignette and Iā€™ve tried with 1,2,5,10 cores, but paradoxically itā€™s faster with only 1 core, I donā€™t understand how it can happen. Perhaps it does not work well with a multivariate normal, itā€™s a pity but I donā€™t really understand why it should not work.

Maybe try reduce sum static and set the grain size to 5 and use 5 cores as a test? Then the 25 families will be split into 5 chunksā€¦or even make the chunk size 30 and use 1 core to see if that is any good (no parallelization would be happening), just as a sanity check.

Also check if the likelihood is really what is the most expensive thing (should be)ā€¦still profiling things should help you pin down whatā€™s going on.

Thank you again. Iā€™ve got these results which should give an overview. Iā€™ve profiled matrix Sigma (in model chunk), reduce_sum_static and priors. All these results are obtained using reduce_sum_static:

  1. grainsize = 25, cores=1. Total Time = 69s. Time for reduce_sum_static = 19s, Time for Sigma = 46s, Time for Prior = 0.005s
  2. grainsize = 5, cores=5. Total Time = 139s. Time for reduce_sum_static = 78s, Time for Sigma = 56s, Time for Prior = 0.01s
  3. grainsize = 1, cores=25. Total Time = 223s. Time for reduce_sum_static = 147s, Time for Sigma = 71s, Time for Prior = 0.01s

So it really seems that parallelisation is not only working but making things (much) longer. Moreover, given no parallelisation as in 1) I get the same time as the model without reduce_sum, so probably something goes wrong with the function reduce_sum when it detects more than 1 core?

Thanks for your time

What are you reporting as total time? Is it wall time or user time? Wall time is what you want as it reports what actually lapsed at the clock. User time reports the total cpu time spend.

It is Wall time I guess, itā€™s what is reported by cmdstanr (and by profiling for the other components)

Could it be that cores are not correctly loaded?

I use:
model ā† cmdstan_model(ā€œmodel.stanā€, cpp_options = list(stan_threads = TRUE))

and then in model$sample(ā€¦ threads_per_chain = 25,)

Just to be sure that it is not a trivial error.

Looking at the code once more suggest that Sigma is huge and that thing is part of the shared parameters. This creates significant overhead. Could you rewrite this so that you slice over the sigma? That would require some reshaping, of course.

1 Like

Iā€™ve managed to create directly Sigma for the specific family without passing through Sigma and it improved speed massively! Even without parallelisation now it does 500 iteration in 10 seconds, with parallelisation 6 seconds. Thanks a lot for your time, youā€™ve been incredibly useful and efficient. Iā€™m very happy.

Kind regards

Andrea

1 Like