Max likelihood time complexity as function of number of parameters


A gedanken-experiment :

What would be the time complexity of one optimisation step (using some greedy “down hill” method) if I would like to do max likelihood with STAN on a single layer neural network (one input/one output) with M neurons ? For a fixed amount of input output pairs, N ?

Would it be O(M^3) ?



Check out the math paper for details on this stuff:

Stan uses the same sorta algorithms that everyone else does – so the theoretical scalings should be comparable between frameworks. The split is forward mode vs. reverse mode autodiff.

Forward mode scales better for problems with few inputs, reverse for problems with few outputs. When sampling a log density in HMC, you’ve just got one output so reverse mode autodiff works nicely. Same argument for optimization problems.

There’s probably room for confusion on how complexity is written out. One gradient descent step takes one reverse mode autodiff pass, but the things you do inside that pass could be complicated things. Depends on what you’re looking – if you care about the calculations inside your model or just how the autodiff is arrange.

edit: Edited to clarify a couple things, hope that helps!


Wow, this is very cool !

It sounds like a free monad :)


this Haskell is pretty good for these things, or not ?

:) apparently it is ? lol …

Anyway… interesting stuff !

Thanks for pointing me towards this paper !

Hmm… I found this paper, looked really interesting, then I got disappointed because there seems to be no working code anywhere : :(

So that is basically a paper describing what ppl should do but … it seems that other than saying what ppl should do… there is not much stuff there which actually works in practice… so probably I do not look really into that direction any more.




If you want to compute the Jacobian of a function with M outputs and N inputs, it’s M reverse mode passes or N forward mode passes.

What I meant is that you had “few” in there twice, so couldn’t really parse what you were trying to say. From your message now it seems you pick the mode according to the smaller of N and M.

1 Like

That’s what we’d call a logistic regression. David McKay, in his info theory book, presents logistic regression as classification with a single neuron.

If there are N observations with M continuous predictors (“neurons”) and a single binary output y then you have a data matrix x of size N \times M, and a coefficient vector \beta of size M, and a binary observation array y you need to evaluate the log likelihood function

\mathcal{L}(\theta) = \log \mbox{bernoulli}(y \mid \mbox{logit}^{-1}(x \, \beta))

The matrix-vector multiply requires N \times M multiply-add operations, the inverse logit requires N additions, subtractions, and exponentiations, and the Bernoulli requires N conditionals and N logarithms. It’s not quite done that way because have a bernoulli_logit function that composes the operations and provides more stable arithmetic and unfolded derivatives. But at worst, it’s \mathcal{O}(N \times M). The derivatives require the same number of operations.

Since the paper that you cite, Gharamani’s been involved in both Turing for the Julia language and Pyro which is built on top of PyTorch.

Forward mode is also faster in theory, but less developed in our implementation. So there’d have to be a multiple in there that would change depending on the problem.

There’s also the issue of what you want to compute in the end. We sometimes want things like Hessian-vector products or directional derivatives, which changes the way we want to calculate.