Mixed models with deterministic and random variables

optimization
variational-bayes

#1

Hi,

I am a relatively new Stan user. I was wondering whether Stan supports models with mixed deterministic and random variables, either directly or via a good hack. To be more specific, let’s say my log joint is a function of $\theta$ and $\psi$, and I want to know the MAP over $\theta$ while also knowing the posterior distribution of $\psi$. In the ADVI sense, I want to be able to maximize $ELBO(\theta, \xi_\psi, \sigma_\psi)$, where $\theta$ is left as is and $(\xi_\psi, \sigma_\psi)$ denote the mean and std of the approximate posterior of $\psi$ in the unconstrained space.

Background: I have a model with mixed continuous and discrete latents. The discrete latents appear in a Markov chain so that integrating them out is intractable (i.e. the number of paths is exponential in the chain length). As an approximation, I thought I’d directly specify an ELBO under the assumption that the posterior is factorized over each node of the chain. The ELBO will be a functional of the posterior distribution of each discrete variable, as well as the rest of the continuous variables.

In the spirit of ADVI, one could simply optimize the “hard-coded” ELBO over an unconstrained parametrization of the simplexes, and the unconstrained parameters of the Gaussians that approximate the posterior of the continuous variables. However, if one implements this naively in Stan, the simplexes will not be treated deterministically; rather, they will also be represented by yet more Gaussians in an unconstrained space. Am I approaching this problem in a wrong manner?

ps> in other words, can I somehow mark some variables (let’s say unconstrained) in my model parameters to be treated as hyperparameters? from a variational inference perspective, this seems like a natural thing to do. From the sampling perspective, however, I’m not sure what this means.

pps> perhaps my best shot is to break down the problem into two steps: (1) a forward-backward step on the chain given the posterior of the continuous latents, and (2) optimizing the posterior of the continuous variables given the posterior of the discrete variables. Alternating between these two steps yields the best posterior factorized over continuous and discrete sectors. My original question still holds though: why can’t we simply extend ADVI to treat discrete variables? a discrete distribution defined on a finite set of size $S$ can be represented uniquely with $S-1$ unconstrained real variables and these can be taken as variational parameters.


#2

The answer to your basic question is no, we don’t have ways to do what you’re asking.

Discrete latent Markov chain states aren’t a problem — the forward algorithm marginalizes Markov chain states in linear time in O(N * K^2) where N is the size of the chain and K is the number of states. There’s an implementation in the manual for HMMs. But then you mention forward-backward later, so presumably you know this, so maybe I’m answering the wrong question.

Andrew and Sean are working on a max marginal likelihood / max marginal posterior mode solver, which is indeed related to ADVI, but it’s not available yet.

If you really need to start building custom samplers for components of your model, you might want to look at something like Edward or PyMC3.


#3

Hi Bob,

Many thanks for your reply! I will look into Edward and PyMC3. So far, I have had great success with Stan though! :-)

Regarding marginalizing Markov chain hidden states via the forward algorithm – correct me if I’m wrong, but I believe the marginalization is slightly more involved and requires both forward/backward message passing, as well as a recipe for calculating the entropy of the hidden state posteriors. Let’s say $\theta$ is a bundle of continuous variables, $z$ is the chain of hidden states, and $x$ is the observed data. We have the following exact identity:

$$\log p(\theta | x) = E_{q(z | \theta, x)}[\log p(\theta, z | x)] - H[q(z | \theta, x)]$$

where $q(c | \pi, x)$ is the conditional posterior of the hidden states. The expectation term can be calculated in $O(N * K^2)$ time via forward-backward. The entropy term can also be calculated via a dynamic programming trick with a similar time complexity [Hernando et al., IEEE Transactions on Information Theory, Vol. 51, No. 7, July 2005]. So, the gradient calculation time is indeed $O(N * K^2)$ as you also mentioned.

It would be very tidy if Markov chain marginalizations could be encapsulated in a generic Stan function that would automatically add both contributions to __lp. I am new to user-defined Stan functions, but I suppose this is doable by writing functions that end with “_lp” according to the manual. I wonder: (1) whether Stan performs such calculations on the computational graph (similar to model statements) and performs efficient autodiff? in other words, could one think of “_lp” functions as generic model statements? also, (2) does Stan pass variables by reference to _lp functions?

ps> I would be glad to contribute this generic HMM-marginalizer to Stan if it doesn’t exist yet. It would be quite handy for working with sequences (NLP, genomics, etc) and in hierarchical models dealing with multiple Markov chains.

Best,
Mehrtash


#4

The forward algorithm is all you need to compute the likelihood, which in turn, is all you need to code in Stan. There’s an example in the manual.

In some conjugate cases, you can use the forward-backward algorithm to marginalize and then use the resulting marginals to more efficiently update emission and transition distributions.

I didn’t understand (1). If you’re asking whether functions behave like everything else, then the answer is unequivocally yes. The functions ending in _lp can add to the underlying log density; functions ending in _lpdf can be used in sampling statements for continuous variables and those ending in _lpmf can be used in sampling statements for discrete variables.

As to (2), yes, Stan passes everything by constant reference.

As to the PS, we welcome contributions. We don’t really have a distribution mechanism in place for functions written in Stan itself, but we now have an include mechanism which should make it much more feasible.

The bigger issue is how to encapsulate all this. I’d like to write some HMM functionality into core Stan, but many realistic applications involve things like covariates, which complicate the underlying structure of a simple transition matrix.


#5

HI Bob,

Perhaps I’m thinking about it in the wrong way. What I wrote up there was a general recipe for calculating the log likelihood with $z$ integrated out ($z$ could be an arbitrary latent variable). In the case of Markov chains, perhaps there’s a shortcut for doing that with just the forward algorithm but that eludes me. I need to read about it.

Best,
Mehrtash


#6

It’s not complicated. Assuming N inputs y[1:N] with K states, the last column of forward values, alpha[N, 1:K], has entries

alpha[N, k] = log p(y[1:N], z[N] = k)

The law of total probablity then ensures that the likelihood is just the log sum of exponentials of that final column,

log p(y[1:N]) = log_sum_exp(alpha[N, ])

#7

I should also mention there’s an example in the manual.


#8

Hi Bob,

Uh, you’re absolutely right! I was killing the mosquito with a sledgehammer :-)

Best,
Mehrtash