Surprising memory usage with eigendecomposition

#1

I have made a simple model that constructs a symmetric Jacobian for a linear ODE system, and applies eigenvalues_sym and eigenvectors_sym, for a 164 dimensional system. The expressions are very simple, yet memory usage saturates immediately: top shows 44 GB allocated on a 16 GB system… the input Rdump file is 400KB.

Are there known memory issues with these functions? For reference, I’ve posted the code on this gist.

edit I checked the scaling of memory usage with problem size, and I’m sure I can’t use this as is. Since I only need leading eigenvectors and values, I will find another way to compute them.

0 Likes

#2

Not a bad idea.

Those functions are autodiffed now, though I’m surprised they blow up that badly.

If you scan through this, it shows you how to do eigenvalues_sym: http://htmlpreview.github.io/?https://github.com/bbbales2/stancon_2018/blob/master/rus.html

I don’t think the derivatives are totally right if you have duplicate eigenvalues and such but that should get you started.

Sounds like you’d need to implement the eigenvectors yourself too. I liked this derivation: http://gifi.stat.ucla.edu/speed/deleeuw_R_07c.pdf .

0 Likes

#3

Thanks for the links! I see you have N=10 using the builting eigensolver. I also did some scaling tests, and this size works fine. It’s really above N=50 where I see inappropriate memory usage, which now makes sense, looking at the implementation.

By comparison, the Python autograd package implements eigh (equivalent of eigenvectors_sym) gradient explicitly

def grad_eigh(ans, x, UPLO='L'):
    """Gradient for eigenvalues and vectors of a symmetric matrix."""
    N = x.shape[-1]
    w, v = ans              # Eigenvalues, eigenvectors.
    def vjp(g):
        wg, vg = g          # Gradient w.r.t. eigenvalues, eigenvectors.
        w_repeated = anp.repeat(w[..., anp.newaxis], N, axis=-1)
        off_diag = anp.ones((N, N)) - anp.eye(N)
        F = off_diag / (T(w_repeated) - w_repeated + anp.eye(N))
        return _dot(v * wg[..., anp.newaxis, :] + _dot(v, F * _dot(T(v), vg)), T(v))
    return vjp

which is from this paper, but corresponds to (12) in your second link.

Would this be a useful PR to Stan’s math library?

edit the derivative formulas make sense to me now, I’m just not sure how easy it is to implement for Stan.

1 Like

#4

Sorry for the radio silence. I screwed up and wasn’t looking at the unread tab to find unread messages :/.

To answer the usefulness question, it’d be super important for eigenvalue problems, though I don’t know how common these are in models. Maybe they’d be more common if they were faster. I think the eigenvalues_sym function and such are primarily used in the Riemannian stuff, and there it’s higher order autodiff (and maybe that explains why this never got a fast reverse-mode implementation?).

The detail I haven’t investigated here is duplicate eigenvalues. I dunno if the derivatives fall apart there or what.

0 Likes

#5

This paper contains all of the information. I’m not really sure what to do - in practice I’d expect non-simple eigenvalues to be fairly rare (just because of numerical error), so it’s not completely clear to me how much we have to worry about this.

Anyway, the simple case would be easy to program up explicity (the formulas are in the Giles paper)

Maybe someone like @betanalpha has feelings? At least of how we can test how important the non-simple case is.

2 Likes

#6

My understanding is that in the case of degenerate, but real, eigenvalues the typical eigendecomposition algorithms will just return an arbitrary set of orthogonal vectors spanning the degenerate subspace and the derivatives will then be with respect to this arbitrary basis but otherwise well-defined.

In the Riemannian code I don’t use the stan::math::eigenvalue_sym but rather use the Eigen calls directly and then implement the gradients as described in https://arxiv.org/pdf/1212.4693.pdf. These might not be applicable to the general case, however, as I compute the gradients of the quadratic form and log determinant directly and the degeneracy might not propagate through those operations.

1 Like

#7

Yes. They’re just autodiffing through the iterative algorithm. The right thing to do is to replace with a reasonable matrix algorithm using the generalized adjoint-vector product.

0 Likes