Max likelihood time complexity as function of number of parameters

Jozsef_Hegedus · January 23, 2019, 9:16am

Hi,

A gedanken-experiment :

What would be the time complexity of one optimisation step (using some greedy “down hill” method) if I would like to do max likelihood with STAN on a single layer neural network (one input/one output) with M neurons ? For a fixed amount of input output pairs, N ?

Would it be O(M^3) ?

Cheers

Jozsef

bbbales2 · January 24, 2019, 5:55pm

Check out the math paper for details on this stuff: https://arxiv.org/abs/1509.07164

Stan uses the same sorta algorithms that everyone else does – so the theoretical scalings should be comparable between frameworks. The split is forward mode vs. reverse mode autodiff.

Forward mode scales better for problems with few inputs, reverse for problems with few outputs. When sampling a log density in HMC, you’ve just got one output so reverse mode autodiff works nicely. Same argument for optimization problems.

There’s probably room for confusion on how complexity is written out. One gradient descent step takes one reverse mode autodiff pass, but the things you do inside that pass could be complicated things. Depends on what you’re looking – if you care about the calculations inside your model or just how the autodiff is arrange.

edit: Edited to clarify a couple things, hope that helps!

Jozsef_Hegedus · January 25, 2019, 11:45pm

@bbbales2

Wow, this is very cool !

It sounds like a free monad :)

Oppps… http://h2.jaguarpaw.co.uk/posts/symbolic-expressions-can-be-automatically-differentiated/

http://blog.sigfpe.com/2017/06/

this Haskell is pretty good for these things, or not ?

:) apparently it is ? lol …

Anyway… interesting stuff !

Thanks for pointing me towards this paper !

Hmm… I found this paper, looked really interesting, then I got disappointed because there seems to be no working code anywhere : http://mlg.eng.cam.ac.uk/adam/icfp2018.pdf :(

So that is basically a paper describing what ppl should do but … it seems that other than saying what ppl should do… there is not much stuff there which actually works in practice… so probably I do not look really into that direction any more.

Regards,

Jozsef

maxbiostat · January 26, 2019, 11:35am

What?

bbbales2 · January 28, 2019, 12:49am

If you want to compute the Jacobian of a function with M outputs and N inputs, it’s M reverse mode passes or N forward mode passes.

maxbiostat · January 28, 2019, 1:44am

What I meant is that you had “few” in there twice, so couldn’t really parse what you were trying to say. From your message now it seems you pick the mode according to the smaller of N and M.

Bob_Carpenter · February 22, 2019, 4:34pm

That’s what we’d call a logistic regression. David McKay, in his info theory book, presents logistic regression as classification with a single neuron.

If there are N observations with M continuous predictors (“neurons”) and a single binary output y then you have a data matrix x of size N \times M, and a coefficient vector \beta of size M, and a binary observation array y you need to evaluate the log likelihood function

\mathcal{L}(\theta) = \log \mbox{bernoulli}(y \mid \mbox{logit}^{-1}(x \, \beta))

The matrix-vector multiply requires N \times M multiply-add operations, the inverse logit requires N additions, subtractions, and exponentiations, and the Bernoulli requires N conditionals and N logarithms. It’s not quite done that way because have a bernoulli_logit function that composes the operations and provides more stable arithmetic and unfolded derivatives. But at worst, it’s \mathcal{O}(N \times M). The derivatives require the same number of operations.

Bob_Carpenter · February 22, 2019, 4:37pm

Since the paper that you cite, Gharamani’s been involved in both Turing for the Julia language and Pyro which is built on top of PyTorch.

Bob_Carpenter · February 22, 2019, 4:38pm

Forward mode is also faster in theory, but less developed in our implementation. So there’d have to be a multiple in there that would change depending on the problem.

There’s also the issue of what you want to compute in the end. We sometimes want things like Hessian-vector products or directional derivatives, which changes the way we want to calculate.

Topic		Replies	Views
Optimization for computing likelihood Developers techniques	6	550	July 27, 2022
Decomposing models, fitting a simpler one and using the fitted simple model to fit the mode complex one General	6	518	January 29, 2019
Fine tuning for polynomial Posteriors Modeling performance	1	38	December 3, 2024
Forward algorithm for hidden Markov models General hmm	11	2984	April 13, 2018
Any benefit to supplying gradient myself? Modeling	10	1050	June 12, 2018

Max likelihood time complexity as function of number of parameters

Related topics