Okay. So I took your suggestion of switching to matrix power to iterate the matrix. I reformulated the input data in terms of “events” and used `matrix_power`

to skip from event to event. The inner loop looks like this now:

```
profile("rollout") {
for (e in 1:N_events){
int p = event_pt[e];
c_pred[e] = (c_prev[p] * matrix_power(K[p], event_t[e]-t_prev[p])) + c_in[e];
c_prev[p] = c_pred[e];
t_prev[p] = event_t[e];
}
}
```

On two patients with 14 events per patient (one dose and 13 observations), we get:

#
Testing matrix exponentiation vs. repeated multiplication

###
Using `matrix_power`

in the inner loop

```
Elapsed Time: 47.49 seconds (Warm-up)
46.731 seconds (Sampling)
94.221 seconds (Total)
name,thread_id,total_time,forward_time,reverse_time,chain_stack,no_chain_stack,autodiff_calls,no_autodiff_calls
rollout,140320908879680,86.6827,31.1306,55.5522,12200936,59895504,138647,1
```

I haven’t gotten to being slick about re-using `matrix_power`

results as you suggested because it’s pretty complicated and wanted to report back with this result first.

###
Using raw matrix multiplication in the inner loop

```
Elapsed Time: 148.353 seconds (Warm-up)
162.844 seconds (Sampling)
311.197 seconds (Total)
name,thread_id,total_time,forward_time,reverse_time,chain_stack,no_chain_stack,autodiff_calls,no_autodiff_calls
rollout,139920360060736,234.348,144.421,89.9264,1158389532,3475168596,160087,1
```

(I also tried manually removing the deepcopy from the C++ and that improved it about 10%).

Obviously, this isn’t quite as dramatic as I was hoping. And it makes me skeptical that it’s just a function of the number of matrix multiplications I’m doing.

As a comparison, the implementation with matrix multiplications in numpyro took 28 seconds.

##
What is actually slow?

Matrix multiplication itself on modern CPUs is *really* fast, right? Almost always it seems like if your task is slow, the problem is that your matrix math isn’t being done correctly (like you’ve got stray copies or whatever happening).

Obviously, Stan is also doing a lot more than just multiplying matrices (in particular tracking gradients seems like it’d be nontrivial), but that obscures from me why this is slow (and how numpyro is able to beat Stan by a factor of >10 for the iterated matrix case).

##
Eigendecomposition?

One thought I had is that

c_t+j = K^jc_{t} = (S^{-1}\Lambda S)^jc_{t} = S^{-1}\Lambda^j Sc_{t}

if S is the matrix of eigenvectors of K and \Lambda is the diagonal matrix of eigenvalues of K. So,

\Lambda = \begin{pmatrix}
\lambda_1 & 0 & 0 \\
0 & \lambda_2 & 0 \\
0 & 0 & \lambda_3
\end{pmatrix} ^ j = \begin{pmatrix}
\lambda_1^j & 0 & 0 \\
0 & \lambda_2^j & 0 \\
0 & 0 & \lambda_3^j
\end{pmatrix}

So \Lambda^j should be really fast to compute if you have an eigenvector eigenvalue decomposition of K. But I can only find `eigenvectors_sym`

(for symmetric matrices). Maybe this formulation would avoid some of the complexity in the gradients or something?

##
Vectorization

The loop

```
for (p in 1:N_pts){
for (t in 2:T) {
y_pred[p, t] += (y_pred[p, t-1] * K[p]);
}
}
```

ought to be vectorized as

```
for (t in 2:T) {
y_pred[:, t] += (y_pred[:, t-1] * K[:]);
}
```

or

```
for (p in 1:N_pts){
y_pred[p, 2:T] += (y_pred[p, 1:(T-1)] * rep_array(K[p], T-1));
}
```

Right? Neither of those code blocks compile though, of course…

Thoughts? Thanks for all your help guys!