What would be the time complexity of one optimisation step (using some greedy “down hill” method) if I would like to do max likelihood with STAN on a single layer neural network (one input/one output) with M neurons ? For a fixed amount of input output pairs, N ?
Stan uses the same sorta algorithms that everyone else does – so the theoretical scalings should be comparable between frameworks. The split is forward mode vs. reverse mode autodiff.
Forward mode scales better for problems with few inputs, reverse for problems with few outputs. When sampling a log density in HMC, you’ve just got one output so reverse mode autodiff works nicely. Same argument for optimization problems.
There’s probably room for confusion on how complexity is written out. One gradient descent step takes one reverse mode autodiff pass, but the things you do inside that pass could be complicated things. Depends on what you’re looking – if you care about the calculations inside your model or just how the autodiff is arrange.
edit: Edited to clarify a couple things, hope that helps!
this Haskell is pretty good for these things, or not ?
:) apparently it is ? lol …
Anyway… interesting stuff !
Thanks for pointing me towards this paper !
Hmm… I found this paper, looked really interesting, then I got disappointed because there seems to be no working code anywhere : http://mlg.eng.cam.ac.uk/adam/icfp2018.pdf :(
So that is basically a paper describing what ppl should do but … it seems that other than saying what ppl should do… there is not much stuff there which actually works in practice… so probably I do not look really into that direction any more.
What I meant is that you had “few” in there twice, so couldn’t really parse what you were trying to say. From your message now it seems you pick the mode according to the smaller of N and M.
That’s what we’d call a logistic regression. David McKay, in his info theory book, presents logistic regression as classification with a single neuron.
If there are N observations with M continuous predictors (“neurons”) and a single binary output y then you have a data matrix x of size N \times M, and a coefficient vector \beta of size M, and a binary observation array y you need to evaluate the log likelihood function
The matrix-vector multiply requires N \times M multiply-add operations, the inverse logit requires N additions, subtractions, and exponentiations, and the Bernoulli requires N conditionals and N logarithms. It’s not quite done that way because have a bernoulli_logit function that composes the operations and provides more stable arithmetic and unfolded derivatives. But at worst, it’s \mathcal{O}(N \times M). The derivatives require the same number of operations.
Forward mode is also faster in theory, but less developed in our implementation. So there’d have to be a multiple in there that would change depending on the problem.
There’s also the issue of what you want to compute in the end. We sometimes want things like Hessian-vector products or directional derivatives, which changes the way we want to calculate.