Max likelihood time complexity as function of number of parameters


#1

Hi,

A gedanken-experiment :

What would be the time complexity of one optimisation step (using some greedy “down hill” method) if I would like to do max likelihood with STAN on a single layer neural network (one input/one output) with M neurons ? For a fixed amount of input output pairs, N ?

Would it be O(M^3) ?

Cheers

Jozsef


#2

Check out the math paper for details on this stuff: https://arxiv.org/abs/1509.07164

Stan uses the same sorta algorithms that everyone else does – so the theoretical scalings should be comparable between frameworks. The split is forward mode vs. reverse mode autodiff.

Forward mode scales better for problems with few inputs, reverse for problems with few outputs. When sampling a log density in HMC, you’ve just got one output so reverse mode autodiff works nicely. Same argument for optimization problems.

There’s probably room for confusion on how complexity is written out. One gradient descent step takes one reverse mode autodiff pass, but the things you do inside that pass could be complicated things. Depends on what you’re looking – if you care about the calculations inside your model or just how the autodiff is arrange.

edit: Edited to clarify a couple things, hope that helps!


#3

@bbbales2

Wow, this is very cool !

It sounds like a free monad :)

Oppps… http://h2.jaguarpaw.co.uk/posts/symbolic-expressions-can-be-automatically-differentiated/

http://blog.sigfpe.com/2017/06/

this Haskell is pretty good for these things, or not ?

:) apparently it is ? lol …

Anyway… interesting stuff !

Thanks for pointing me towards this paper !

Hmm… I found this paper, looked really interesting, then I got disappointed because there seems to be no working code anywhere : http://mlg.eng.cam.ac.uk/adam/icfp2018.pdf :(

So that is basically a paper describing what ppl should do but … it seems that other than saying what ppl should do… there is not much stuff there which actually works in practice… so probably I do not look really into that direction any more.

Regards,

Jozsef


#4

What?


#5

If you want to compute the Jacobian of a function with M outputs and N inputs, it’s M reverse mode passes or N forward mode passes.


#6

What I meant is that you had “few” in there twice, so couldn’t really parse what you were trying to say. From your message now it seems you pick the mode according to the smaller of N and M.