Understand the run time in Gaussian finite mixture models

performance

#1

Dear stan users: I am conducting a simulation study to compare the same models fitted with 2 different likelihood functions with varying data matrix.
The first model is a single Gaussian model where I fit 3 data matrices of size (12 \times 336, 101 \times 336 and 201 \times 336) where the rows (12, 101, 201 are number of observations per block and altogether there are 336 blocks). It is assumed that y_{it} \sim N(\mu_t, \sigma) where i is the observation index and t is the block index.
Then I run this model fitted with different data size (12, 101 and 201) 50 times and each time samples 1000 posterior samples (running 4 chains with 500 iterations per chain) from this model and record the time.
Here is my result
N Mean Time SE
12 261.1882 (3.8304)
101 774.0944 (33.4542)
201 778.2316 (29.5691)

From what I understand to estimate parameters associated with a Gaussian distribution, Stan would just need the sufficient statistics in this case would be the sample mean (column mean of the data matrix), rather than calculate the loglikelihood at each observation. so with increasing number of observations, I would suspect the mean run time stays roughly the same. However, increasing N from 12 to 101, there is also a significant increase in run time but with further increases to 201, the increases in time is relatively small. If my understanding is correct, can someone please suggest me the reason why a huge increase when N=12 to N=101? as the need is just to calculate a column mean?

Thank you so much for your suggestion/ advice.


#2

Stan doesn’t do sufficient statistics automatically like that unfortunately. There’s a pull request coming down the line that’ll let folks with certain models take advantage of them (https://github.com/stan-dev/stan/pull/2441), but for now if you want to take advantage of sufficient statistics you’ll need to do it yourself.

The performance of this model will also heavily be influenced by how many parameters you are fitting. So it’s not just the cost of the extra calculations or whatever, the exploration itself will be different. It’s hard to really nail down performance on things because there’s a lot of stuff interacting.


#3

The chapter on efficiency in the manual (user’s guide from 2.18 on) that explains how to do this in some cases.

As @bbbales2 points out, we will be releasing some compound functions that do this internally. It’s a particularly big win for some GLMs.


#4

Having said that, there’s no good way to compute the sufficient statistics for a mixture. For instance, you can run the forward algorithm for HMMs, but just by doing so, you compute all the necessary derivatives so there’s point in running backward to collect sufficient statistics. To do that efficiently, we need to get into the guts of the C++ implementation and pull the double values out of the autodiff types and build analytic partials.


#5

thank you for your reply I will take a look at the relevant sections in manual to understand it better.


#6

sorry for replying again but I cannot seem to find manual 2.18 on the website it is still 2.17. so it has not yet been released? could you please inform me when it will be released? thank you


#7

I think he meant the 2.17 manual has this information (https://github.com/stan-dev/stan/releases/download/v2.17.0/stan-reference-2.17.0.pdf), but when the 2.18 docs come out, it’ll be in a different place.

So depending on when you get around to looking at this you might end up looking in different places.