Let’s assume we have a target density, \pi(\theta), defined in a Stan file. Let’s assue that \theta is in 1000 dimensions. We are developing an algorithm that uses a projection, \theta=\theta_0+\phi\delta where \phi might be in 5 dimensions (where \delta would then be in 5x1000 dimensions).
To integrate this algorithm with Stan, we want to compile the Stan file once such that, at each stage in the algorithm, we can then evaluate \pi(\theta_0+\phi\delta) and \frac{d\pi(\theta_0+\phi\delta)}{d\phi} for a given \theta_0, \delta and \phi without calculating any derivatives in 1000 dimensions (which we assume would be computationally wasteful).
Our proposed solution is to automate the generation of a second Stan file that defines \theta_0 and \delta as 'Data’, defines \phi as 'Parameters’, compiles once and then uses the copy constructor to reinitialise the resulting object (such that we can 'change’ \theta_0 and \delta without recompiling).
We hope that makes sense and are interested in comments on this approach.
This seems to be a sensible approach because running HMC on a 5-dimensional space, rather than a 1,000-dimensional space is typically easier.
That said, if all you care about is computing the gradient, I don’t expect the proposed scheme will be faster than computing the gradient with respect to \theta. This is because Stan uses reverse-mode automatic differentiation, so whether the gradient has 5 or 1,000 elements, we compute it using one reverse-mode sweep.
The question becomes: does the proposed approach make that one sweep cheaper? Even when differentiating with respect to \phi, the sweep computes
w = \partial \log \pi(\theta) / \partial \theta,
which is your dimension 1,000 gradient. You then get your 5 dimensional gradient by computing the codirectional derivative
w \cdot \frac{\partial \theta}{\partial \phi},
where the partial derivative here is a Jacobian matrix (which we do not compute explicitly). So computing the gradient with respect to \phi is actually slightly more expansive than computing with respect to \theta. But cheaper to store…
@charlesm93: thank you but I think this means I need to read up on exactly how reverse-mode automatic differentiation works. I imagined that ‘Data’ would behave like a constant, though I can imagine that \phi appearing in an expression in 1000 places would be problematic.
What if \delta was chosen to emulate Gibbs-like moves such that \phi only appeared in place of 5 elements of \theta?
If things are set to data then it becomes a lot cheaper to compute it usually. It’s a question about the final ad tree size which you get quickly with the profiling facility.
Right. But in this particular example, you still need to declare \theta as a var type, since it depends on \phi which is var. So the AD tree size should be about the same size. If you declared \theta_0 and \delta as var, then you would get a much larger AD tree and inefficient differentiation, but I don’t believe this was on the table to begin with.
Ok. I think i can see why we are talking across purposes. If we had, for example, \pi(\theta)=exp(\theta^Ty) and wrote it as \pi(\theta)=exp((\theta_0+\delta\phi^T)^Ty) then we’d have to calculate \theta_0+\delta\phi^T every time we calculated logprob. However, if we wrote it as \pi(\theta)=exp(\theta_0^Ty+\phi(\delta^Ty)) we could compute \theta_0^Ty and \delta^Ty once up front (assuming post-edit I got the maths right). Spotting that automatically might be possible but seems challenging. Make sense?
This makes sense if \log \pi(\theta) and its gradient are evaluated many times for varying values of \phi. In that case, you’d compute \theta_0^T y and \delta^T y once (so in transformed data) and calculations involving the parameters would be performed many times, e.g. once per step the leapfrog integrator takes if you’re running HMC.
So a single evaluation and differentiation of \log \pi(\theta) is not cheaper, but the calculation of many many evaluation and gradients becomes cheaper indeed.
Yes, I’d assumed we’d run NUTS for fixed \theta_0^Ty and \delta^Ty so want to calculate \log \pi(\theta) for many values of \phi (given each fixed \theta_0^Ty and \delta^Ty).
@charlesm93: Am I right to think Stan wouldn’t automatically “spot” that it will aid runtime to exploit the fact that (\theta_0 + \delta\phi^T)^Ty=\theta^Ty + \phi (\delta^Ty)? I suspect we’d need to do that code optimisation externally (assuming that we’d want the user to only see a definition which considers \pi(\theta) and then automate the creation of a Stan file that reparameterises in terms of \phi). That’s fine, but I’d rather not figure out how to do that if I don’t need to!
I don’t think Stan would spot this automatically. Stan pretty much runs what you write. I could be wrong but I’m fairly confident you’d want to use the transformed data block.