So two slightly conflicting suggestions:
-
Try to avoid declaring variables to store the result of an intermediate computation wherever possible. (Apparently more declared variables can induce substantial slowdown for auto-differentiation)
-
Try to look for redundant computations and instead of doing them redundantly, store as a declared variable and refer to them where they’re needed.
And presumably you’re aware of the various parallelization options?