Using log_mix with lupdf - does that make sense?

I’m trying to understand how I can use the very convenient log_mix function.
The Stan manuals have an example using x_lpdf functions to compute the likelihood of each component.
Does it make sense also to use x_lupdf functions, dropping constants?

I’m not quite sure how the log_mix works balancing the different likelihoods, hence the question.

Thanks!

1 Like

Hi,
log_mix is actually conceptually reasonably easy, if we work on the probability scale, not on the log scale. So if we want to express a mixture model on the probability/density scale, we have a mixing probability \theta and an indicator binary variable z which is 1 (with probability \theta) or 2 (with probability 1 - \theta) to indicate which of the two components the observation came from, i.e.:

\rm{P}(Y = y | \theta) = \theta \rm{P}(Y = y | z = 1) + (1 - \theta) \rm{P}(Y = y | z = 2)

Now we need to move to the log scale, where Stan (for good reasons) works. If we say that \lambda_1 = \log {\rm{P}(Y = y | z = 1)} and \lambda_2 = \log {\rm{P}(Y = y | z = 2)} then we get Stan’s log_mix as:

\begin{eqnarray*} \mathrm{log\_mix}(\theta, \lambda_1, \lambda_2) = \log {\rm{P}(Y = y | \theta)} & = & \log \!\left( \theta \exp(\lambda_1) + \left( 1 - \theta \right) \exp(\lambda_2) \right) \\[3pt] & = & \mathrm{log\_sum\_exp}\!\left(\log(\theta) + \lambda_1, \ \log(1 - \theta) + \lambda_2\right). \end{eqnarray*}

I made some slight abuses of notation above, but I hope the idea is clear enough - feel free to ask for clarifications.

Next, I will assume what you want to use is directly plug the result of log_mix into the model log density, e.g. target += log_mix(something); We know that Stan only needs the target up to an additive constant, so the question is "If I use x_lupdf instead of x_lpdf will the result of log_mix only change by a constant?

So what happens, when use lupdf and modify \lambda_1 and \lambda_2 by constants c_1, c_2. Using lupdf would be safe if the difference between the two versions is itself a constant:

d = \rm{log\_mix}(\theta, \lambda_1, \lambda_2) - \rm{log\_mix}(\theta, \lambda_1 + c_1, \lambda_2 + c_2) = \\ \log \!\left( \theta \exp(\lambda_1) + \left( 1 - \theta \right) \exp(\lambda_2) \right) - \log \!\left( \theta \exp(c_1)\exp(\lambda_1) + \left( 1 - \theta \right) \exp(c_2)\exp(\lambda_2) \right)

If we can assume c = c_1 = c_2, then we can get the constants out of the \rm{log}, the last expression simplifies, we have d = c, the difference is constant and the posterior distribution of model parameters will be the same. But if c_1 \neq c_2, then d depends on all of \theta, \lambda_1 and \lambda_2 and the posterior will differ.

So the TL;DR is: using x_lpdf with log_mix is always safe, swapping for x_lupdf is safe only if the constants that are dropped are identical for both terms. This is both a) unlikely to hold for practical models and b) hard to check unless you understand a lot about how the functions are implemented and c) can change if the implementation changes with updates to Stan. Using x_lupdf with log_mix thus should be avoided unless you really know what you are doing and really need to squeeze all the last bits of performance - there are usually many safer modifications that will speed up your model much more than using x_lupdf.

Also note that if you want to do model comparison with loo or some other fancy post-processing that uses the samples of posterior log-density, you actually cannot omit the constants anyway.

EDIT: The above statement was misleading, see below.

Best of luck with your modelling work!

5 Likes

I just happened to read this. Are you sure that this is the case? So people using the tilde notation shouldn’t run loo?
The tilde notation is equivalent to lupdf right?

3 Likes

Sorry, you are right - the way I wrote it is misleading. For loo the critical part is the per-point log-likelihood. This is usually computed in generated quantities and we don’t use the target directly, so it is not important if the target has all constants (the log likelihood AFAIK should include all the constants or you need to make sure the dropped constants are the same for all models you compare).

Where you need all the constants in target is when computing Bayes factors with bridgesampling, and I presume there are other cases where its important.

Sorry for any confusion.

1 Like

Ok right! The tilde notation is irrelevant for loo what matters is how one stores the log lik.
Sorry for my confusing statement as well!

thank you, that was very informative!