Improving the performance of a model with lots of iteration

justinrporter · December 19, 2021, 10:51pm

HI all,

I’ll preface by saying I’m pretty new to MCMC and brand new to Stan.

I’m working on a multicompartment pharmacokinetic model that need to take as input 1) arbitrary, sparse drug doses over time and fit them to 2) arbitrary, sparse drug measurements. It needs to do this simultaneously for (ideally) up to thousands of patients, hundreds of drug administrations, and maybe a dozen serum level measurements. Usually the events span something like 5-50 days.

The kinetics are linear, so it could be represented as an ODE, but my current (working) implementation treats it as discrete, fitting a rate matrix and multiplying to get the concentration at every minute.

The full model is attached (example data here), but the slow step is (no surprise) this:

  // an array of rate matrices (one per patient)
  array[N_pts] matrix[3, 3] K;
  // 2D array (patients x time) of concentrations in each compartment
  array[N_pts, T] row_vector[3] y_pred; 
  // in-coming concentrations to each compartment at each time
  array[N_pts, T] row_vector[3] C_in;

  for (p in 1:N_pts){
    for (t in 2:T) {
      y_pred[p, t] = (y_pred[p, t-1] * K[p]) + C_in[p, t-1];
    }
  }

Some options I thought of:

trying to vectorize this computation across patients (is there some specialized function for this?)
skipping times there aren’t doses or measurements with matrix exponentiation
some other way of inlining/avoiding the copying that happens in the inner loop (looking at the C++, theres a bunch of deepcopy statements which seems like it’d be pretty slow.
actually modeling it as an ODE. This makes me nervous as 1) ODE solvers seem complicated/slow 2) I never took diff eq in college and therefore have a complex.

I’m hopeful there some obvious/commonly accepted answer here; guidance very appreciated!

rok_cesnovar · December 20, 2021, 11:18am

The deep copies happen because you are assigning to a variable that is used on the right-hand side.

For example if you are doing A = A * B there needs to be a deep copy, because changing chunks of A during the multiply would produce the wrong result.

justinrporter · December 20, 2021, 2:41pm

Thanks for responding!

you are assigning to a variable that is used on the right-hand side.

Interesting. Of course, I’m actually doing A[t] = A[t-1] * B so there is not a need to copy A[t-1] since it is unaffected by the assignment. Although I can see how that might not be obvious to the (pre?)compiler the way I’ve written it.

I saw this construction in the Stan manual for AR1 models:

model {
  y[2:N] ~ normal(alpha + beta * y[1:(N - 1)], sigma);
}

And tried to adapt it as follows:

    for (p in 1:N_pts){
       y_pred[p, 2:T] += (y_pred[p, 1:(T-1)] * rep_array(K[p], T-1));
    }

But this is rejected on the basis that:

[...]
    73:        y_pred[p, 2:T] += (y_pred[p, 1:(T-1)] * rep_array(K[p], T-1));
                                  ^
[...]
Ill-typed arguments supplied to infix operator *. Available signatures:
[...]
Instead supplied arguments of incompatible type: array[] row_vector, array[] matrix.

Is there a way to multiply an array of matrices by an array of vectors?

Advice is very appreciated!

Thanks!

jsocolar · December 20, 2021, 4:09pm

Just want to double check: do I understand correctly that you are modeling multiple timeseries as long as 50 days at minute-level resolution via repeated matrix multiplication? So a single timeseries involves 72000 matrix multiplications?

justinrporter · December 20, 2021, 4:25pm

Just want to double check: do I understand correctly that you are modeling multiple timeseries as long as 50 days at minute-level resolution via repeated matrix multiplication? So a single timeseries involves 72000 matrix multiplications?

The 50 days might be a bit exaggerated (maybe more like 25 days) but basically the answer to your question is yes, that’s what I’m doing.

It sounds it’s not what I should be doing, though and I’m hoping that’s where y’all can help. :)

Edit:
The vast majority of multiplications could be replaced by a single K^[big number], but it would require adding a bunch of branching and maybe break some vectorization opportunities. Since the “events” (drug administrations, serum level measurements) don’t occur uniformly within or between time series.

Alternatively I could switch over to treating it like an ODE, but I don’t have a good sense of how many matrix multiplications are worth an ODE solver…

Edit edit:
I also expect/am prepared to need MPI on a large cluster (+/- GPUs), but would like to stave it off as long as possible.

justinrporter · December 21, 2021, 8:00pm

Okay. So I took your suggestion of switching to matrix power to iterate the matrix. I reformulated the input data in terms of “events” and used matrix_power to skip from event to event. The inner loop looks like this now:

  profile("rollout") {
    for (e in 1:N_events){
      int p = event_pt[e];
      c_pred[e] = (c_prev[p] * matrix_power(K[p], event_t[e]-t_prev[p])) + c_in[e];

      c_prev[p] = c_pred[e];
      t_prev[p] = event_t[e];
    }
  }

On two patients with 14 events per patient (one dose and 13 observations), we get:

Testing matrix exponentiation vs. repeated multiplication

Using `matrix_power` in the inner loop

 Elapsed Time: 47.49 seconds (Warm-up)
               46.731 seconds (Sampling)
               94.221 seconds (Total)

name,thread_id,total_time,forward_time,reverse_time,chain_stack,no_chain_stack,autodiff_calls,no_autodiff_calls
rollout,140320908879680,86.6827,31.1306,55.5522,12200936,59895504,138647,1

I haven’t gotten to being slick about re-using matrix_power results as you suggested because it’s pretty complicated and wanted to report back with this result first.

Using raw matrix multiplication in the inner loop

 Elapsed Time: 148.353 seconds (Warm-up)
               162.844 seconds (Sampling)
               311.197 seconds (Total)

name,thread_id,total_time,forward_time,reverse_time,chain_stack,no_chain_stack,autodiff_calls,no_autodiff_calls
rollout,139920360060736,234.348,144.421,89.9264,1158389532,3475168596,160087,1

(I also tried manually removing the deepcopy from the C++ and that improved it about 10%).

Obviously, this isn’t quite as dramatic as I was hoping. And it makes me skeptical that it’s just a function of the number of matrix multiplications I’m doing.

As a comparison, the implementation with matrix multiplications in numpyro took 28 seconds.

What is actually slow?

Matrix multiplication itself on modern CPUs is really fast, right? Almost always it seems like if your task is slow, the problem is that your matrix math isn’t being done correctly (like you’ve got stray copies or whatever happening).

Obviously, Stan is also doing a lot more than just multiplying matrices (in particular tracking gradients seems like it’d be nontrivial), but that obscures from me why this is slow (and how numpyro is able to beat Stan by a factor of >10 for the iterated matrix case).

Eigendecomposition?

One thought I had is that

c_t+j = K^jc_{t} = (S^{-1}\Lambda S)^jc_{t} = S^{-1}\Lambda^j Sc_{t}

if S is the matrix of eigenvectors of K and \Lambda is the diagonal matrix of eigenvalues of K. So,

\Lambda = \begin{pmatrix} \lambda_1 & 0 & 0 \\ 0 & \lambda_2 & 0 \\ 0 & 0 & \lambda_3 \end{pmatrix} ^ j = \begin{pmatrix} \lambda_1^j & 0 & 0 \\ 0 & \lambda_2^j & 0 \\ 0 & 0 & \lambda_3^j \end{pmatrix}

So \Lambda^j should be really fast to compute if you have an eigenvector eigenvalue decomposition of K. But I can only find eigenvectors_sym (for symmetric matrices). Maybe this formulation would avoid some of the complexity in the gradients or something?

Vectorization

The loop

    for (p in 1:N_pts){
      for (t in 2:T) {
        y_pred[p, t] += (y_pred[p, t-1] * K[p]);
      }
    }

ought to be vectorized as

for (t in 2:T) {
  y_pred[:, t] += (y_pred[:, t-1] * K[:]);
}

or

for (p in 1:N_pts){
  y_pred[p, 2:T] += (y_pred[p, 1:(T-1)] * rep_array(K[p], T-1));
}

Right? Neither of those code blocks compile though, of course…

Thoughts? Thanks for all your help guys!

jsocolar · December 21, 2021, 8:55pm

Couple of comments:

As you’ve surmised, the matrix_power function should be radically more efficient if you can find a way to replace big chunks of repeated multiplications. I don’t know what algorithm Stan uses, or anything about matrix algebra algorithms generally, but it seems that you should be able to do it in \mathcal{O}(\log_2{}(k)) multiplications by treating it like a sports bracket and just doing one multiplication per round.
If the transition matrix is the same in multiple timeseries, or in multiple parts of a single timeseries between different pairs of observations, then you can re-use multiplications. (This will be especially true if the incoming doses are truly sparse; in timesteps with an arbitrary dose the transition matrix changes arbitrarily). For example, if I have a timeseries with measurements at 0, 11, and 30 minutes, and another one with measurements at 0 and 15 and 19 minutes, then all I really need is the eleventh, fifteenth, nineteenth, and thirtieth powers of the matrix, but the thirtieth is just the product of the eleventh and the nineteenth. And to get the 15th, I can start from the eleventh and then just do matrix_power with exponent 4 to make up the difference.
So you can build the transition matrices that you need to hop straight from one observation (or end of a dose) to the next observation (or beginning of a dose) iteratively by re-using the transition matrices for shorter transitions and multiplying them together until you don’t have a small enough matrix pre-computed and you can finish things off with a matrix_power.
If you’re clever with your indexing (sort the data by increasing time-since-last-dose) then you can build these transition matrices quickly and efficiently. Then you’re left with dealing with the transitions involving doses and the indexing problem of keeping track which timestep-with-a-dose you need to multiply by which transition matrix to obtain which timestep-prior-to-a-dose or timestep-with-a-measurement.
Lastly, if you think that you might be able to get away with coarsened temporal resolution, then there’s an obvious savings to be had.
Lastly #2, I don’t know much about ODE solvers, but I do suspect that they might be your friend on this particular problem. Tagging @wds15

justinrporter · December 23, 2021, 5:37pm

Not sure how my post got threaded in before your post that I was responding to, but here we are.

Upon reflection, I think that with the matrix_power stuff, each series is running fast enough. What I’m interested in is the parallelization.

Each chain is pretty much embarrassingly parallel once K has been chosen, so it seems like I should be able to use map_rect to split the matrix multiplications and aggregation of the target log density for each trajectory independenly and then reduce it at the end.

The thing is, since K has type array[N_pts] matrix[3, 3], how can it fit into a the function signature required for map_rect, which requires a bunch of arrays of reals, ints, or a vector?

vector f(vector phi, vector theta,
         data array[] real x_r, data array[] int x_i);

AWoodward · December 27, 2021, 12:10pm

Hello, regarding the PK: do you have enough knowledge of the system to suggest how much time you need to model around each sample? If for example you knew that both the absorption rate and the elimination rate were relatively fast compared to the inter-dosing or inter-sampling interval, then you may be able to reduce your problem to a sequence of independent dosing events per subject. Similarly if you are using steady-state conditions then you could encode that rather than modelling the whole dosing sequence.

For a system of this size I think the ODE implementation might be too slow without reducing it somehow. I have a two-compartment model with two analytes (joint model for both responses) that takes about 12 hours to sample sufficiently for 10 subjects with dense data. Sampling for sparse systems could be a lot slower, especially without strong priors and good initial conditions to keep the ODE solver on track.

justinrporter · December 30, 2021, 7:07pm

do you have enough knowledge of the system to suggest how much time you need to model around each sample? […] you may be able to reduce your problem.

I’m not totally sure I understand what you’re proposing–just simplifying K to include fewer parameters? I think this is a great idea (which I have to say I had not thought of) but I think unfortunately not. We are interested in the statistical relationship between single-dose and steady-state pharmacokinetics, and the model has to accept an arbitrary dosing interval…

AWoodward · January 9, 2022, 6:45am

Sorry if that was unclear. I’m not suggesting a different model, and in fact with a sparse system the appropriate model would probably need to be selected in advance.

If there are many more doses administered than observations made (in the post you mention dozens of observations from hundreds of doses), then many of the doses will be redundant because they are not close enough to an observation to have much influence. Modeling the entire dosing sequence for some subject then may be excessively expensive (as for the ODE) or demand a complicated implementation (many matrix operations). So you could look for opportunities to break up the time series into a series of pieces for each subject, rather than modelling the whole sequence.

andrjohns · January 9, 2022, 8:54am

I believe you can pack the matrix multiplications/operations into fewer (but larger) operations, which will be dramatically more amenable to gpu-acceleration.

Looking at the initial code that you posted:

  for (p in 1:N_pts){
    for (t in 2:T) {
      y_pred[p, t] = (y_pred[p, t-1] * K[p]) + C_in[p, t-1];
    }
  }

And noting that y_pred[p,1] is initialised to zeros:

    for (p in 1:N_pts){
      y_pred[p, 1] = [0, 0, 0];
    }

Then the operation solely depends on K and C_in. If we consider just 5 timepoints for one individual as an example, this operation expands to:

y_5 = c_1k^3+c_2k^2+c_3k+c_4

Which follows the general form:

y_t = c_1k^{t-2}+\cdot\cdot\cdot+c_{t-1}K^{t-t)}; t \ge 2

You could then restructure your data to consider every individual at a given time:

  // an array of rate matrices (one per patient)
  matrix[N_pts * 3, 3] K;
  // 2D array (patients x time) of concentrations in each compartment
  array[T] row_vector[N_pts * 3] y_pred; 
  // in-coming concentrations to each compartment at each time
  array[T] row_vector[N_pts * 3] C_in;

And implement the above general form as a loop over T:

    y_pred[1] = rep_row_vector(0, N_pts * 3);
    for (t in 2:T) {
      y_pred[t] = y_pred[t-1] + C_in[t-1] * matrix_power(K, t-2);
    }

Noting that matrix_power(matrix, 0) returns an identity matrix

andrjohns · January 9, 2022, 10:44am

Ack, just realised that the matrix being passed to matrix_power isn’t square. What you could do instead is define a function that performs the multiplication in blocks:

functions {
  matrix matrix_multi_blocked(matrix A, matrix orig) {
    matrix[rows(A), cols(A)] rtn;
    int cols_a = cols(A);
    int i = 1;
    while(i < rows(A)) {
      rtn[i:(i + cols_a - 1)] = A[i:(i + cols_a - 1)] * orig[i:(i + cols_a - 1)];
      i += cols_a;
    }
    return rtn;
  }
}

Then implement the loop over T as:

  y_pred[1] = rep_row_vector(0, N_pts * 3);
  y_pred[2] = C_in[1];
  y_pred[3] = y_pred[2] + C_in[2] * K;

  matrix[N_pts * 3, 3] K_power = K;
    for (t in 4:T) {
      K_power = matrix_multi_blocked(K_power, K);
      y_pred[t] = y_pred[t-1] + C_in[t-1] * K_power;
    }

It’s a bit clunky, but might get the job done

justinrporter · January 13, 2022, 8:20pm

Yes! I think this is definitely a path to pursue.

Thank you for your thoughtful, thorough response!! I will definitely give your solution a try, hopefully in the next few days.

In the meantime, I am a little confused why the rate matrices need to be represented as matrix[N_pts * 3, 3] K (which results in the need for the fancy matrix_multi_blocked method you wrote), and can’t be array[N_pts] matrix[3, 3] K.

Specifically, I feel like I ought to be able to write something like the “Slicing for Efficiency” section for AR1 models suggests, but for an array of matrices. Maybe something like:

y_pred[:, 2:T] = y_pred[:, 1:T] * K[:, 1:T] + C_in[1:T]

It seems like the operator * isn’t defined for array[row_vector] and array[matrix], but I can’t think of a reason it shouldn’t be? Am I missing something? Surely I’m not the first person to want this.

Anyway thanks everyone for your suggestions!

andrjohns · January 14, 2022, 12:58am

By stacking the matrices row-wise, rather than using an array of matrices, you can then use a single matrix multiplication.

Stan uses the c++ type std::vector to store arrays of data, which unfortunately means that each matrix within the array is stored in a different space in memory. Because the array elements are stored separately (not contiguously), we can’t treat them as a single large matrix for multiplication, and will always have to loop through the elements. In other words, there would be no performance benefit of an array[matrix] * array[matrix] signature because it would always have to be a loop (or similar).

We’ve discussed an approach to replacing this in this math issue, but it’s a pretty significant amount of work and unlikely to happen any time soon

Topic		Replies	Views
Slow hierarchical ODE model Modeling rstan	3	529	January 25, 2023
Fitting ODE models: best/efficient practices? General ode	85	5462	March 29, 2021
Code takes too long to run, despite small dataset size Modeling	7	689	April 11, 2022
Code review of slow hierarchical (ode) model Modeling fitting-issues , performance , epidemiology	3	782	November 19, 2020
Very slow MCMC sampling in stan_jm() due to large number of individuals Modeling fitting-issues , performance , mixed-model , rstanarm , paralellization	6	1210	April 30, 2021