Understanding autodiff's impact on efficiency

I’ve a Kalman filter implementation in Stan that seems to cause a large autodiff expression tree. The memory usage of the process running the program increase about 5 times when the Kalman filter updating is turned on compared to using a fixed Kalman filter.

Is my assumption correct that memory usage grows with the size of the autodiff expression tree? Is there a way of seeing the expression tree or some other way of profiling its impact on performance?

Below is picture from the Wikipedia page on Kalman filters. Parameters enters the system as elements of the F, B, Q and u. If I understand the autodiff correctly, it will build up an expression tree tracking the parameters through all that matrix algebra in the following update steps? What kind of operations generally grows the expression tree? The P and S are covariance matrices, so it should be possible do some kind of decomposition, does such of use symmetry generally benefit the autodiff?

A workflow that works resonable well is to: a) first run optimization to find parameter modes; b) calculate the Kalman filter using the parameter modes; c) then run HMC with a fixed Kalman filter. However, the Kalman filter will no longer be optimal when the HMC is far from the parameter modes, which seems to shrink the parameter space that gets explored. Is it possible (and does it make sense) to fit a Kalman filter at start of each iterations, but keep it as fixed data in following leapfrog steps?

Are you using gaussian_dlm_lpdf or your own implementation of the Kalman filter?

I’m using an own implementation in Stan (not C++), as I need time varying matrices. I looked at the gaussian_dlm_obs_lpdf code, the filter math is the same as in my implementation just different notation.

The gaussian_dlm_obs_lpdf function should save a bit of autodiff compared to Stan code with equivalent math, but it is not as optimized as some of the other lpdf functions and is restricted to the most basic case. If that is insufficient for you, consider writing a C++ function that generalizes it.

I see that generalizing the gaussian_dlm_obs_lpdf to time-varying matrices is on the TODO list. But, I’ve too limited knowledge in C++ and the autodiff stuff to understand how to write efficient code in C++. I’ve hunch that the large memory usage has more to do Kalman filter updating math than which language it is written in.

Does large memory usage imply a large autodiff expression tree?

Loosely yes, but the C++ functions in the Stan Math library tend to do the derivatives analytically when possible in which case less is put into the autodiff expression tree. My guess is that someone has worked out updating equations for the gradient of the Kalman filter, in which case you could compute the values and gradients with doubles and attach the latter to the former with the precomputed_gradients function in Stan Math.

Ok, thanks for the input.

https://doi.org/10.1007/s00180-012-0352-y shows how to derive the partial derivatives. There seems to be an efficient matrix fraction decomposition method in the special case of fixed matrices. However, the general method with time-varying matrices seems very computational intensive, so I’m not sure it would do any better than autodiff.

Would it be worth testing out a matrix decomposition based filtering method? Fewer matrix elements should make it easier for the autodiff?

Still trying to get some metrics on performance, to better understand where things can be improved. I estimate the time to do the likelihood computation during sampling as:

t_ll_tot = sum(get_num_leapfrog_per_iteration(stan_fitobject)) * t_ll_avg

where t_ll_avg is the average time it takes to compute the likelihood function.

With 500 iterations (whereof 250 warmups) and the Kalman filter updating turned off I get total timing of 200 seconds for sampling. The average leapfrogs per iterations is 14, which seems to indicate 60 seconds spent in the likelihood function. Is it correct to assume that roughly 200-60 = 140 seconds are spent on “Stan overhead”? Or is the likelihood function computed more often than sum(get_num_leapfrog_per_iteration(stan_fitobject)) indicates?

How would a function signature for a Kalman filter doing it own gradients calculation look like?

the existing one looks like this: gaussian_dlm_obs_lpdf(matrix y | matrix F, matrix G, matrix V, matrix W, vector m0, matrix C0). But the parameters for which the gradients would need to be calculated are within the matrix elements: F(\theta), G(\theta), etc. Would you need to do something like kalman_filter_lpdf(matrix y | real[] theta, real[] data, ...), where … would need to hold information about how the matrices are constructed?