Transversing up a graph (Hierarchical Hidden Markov Model)

Hi there!

I’m trying to write my Stan code for a Hierarchical Hidden Markov Model. Basically, it’s a graph where all but bottom nodes are a HMM in itself: each node has some children, an initial distribution vector and a transition matrix. The initial distribution vector is used to decide which children goes next and then children will transition until an end-node is hit (double-circled). End-node returns control to their parent.

The generalized forward probability is computed recursively bottom-up. In a traditional HMM, the transition matrix is enough to link all the siblings, but in this case I need to transverse the nodes up.

Using OOP, I’d just have an object with a reference to children and parent. Otherwise, I could use an adjacency list. But I can’t do neither in Stan.

Can you suggest a natural way to solve this in Stan?

Would it be possible to represent the transition matrix as a sparse matrix? You could parameterize this as a vector in the parameters block, specify which parameters correspond to to which matrix elements in the data block, and and then transform the parameters it into the sparse matrix by copying the parameters into an empty matrix in the transformed parameter block.

Hey Aarong!

If I understand correctly, you’re talking about a ragged matrix. My model has 26 transition matrices with different sizes (only 11 are shown in the plot). I could use a full database representation to produce ragged matrices by subsetting from a long vector according to a vector of “indices”.

My problem isn’t ragged or sparse data structures, it’s all about moving up and down the graph instead. I need a clever way to solve this recursively since the probability of a node depends on the probability of their children (bottom-up).

I coded a generative model in R and look how simple it is. It’s just a few lines since I can work with objects (or any kind of data structure).

How could I represent such thing in Stan?

I was actually thinking of a sparse matrix. How many states in total do you have? What precisely do you mean when you say “the probability of a node depends on the probability of their children?” Does the Markov property hold?

There are ~ 91 nodes in total, but only 63 can produce an observation (gray). We can assume that observations come in groups of 3, simplifying the problem into only 23 latent states. Theoretically I can find an equivalent sparse HMM representation for the whole graph, but I’ll lose the interpretability given by the hierarchical organization of states.

Is that what you mean? “Expanding” the HHMM transition matrices into a single large HMM with a sparse matrix that connects all the observation (gray) nodes? do I just hardcore a zero in the elements of the matrix that correspond to unconnected nodes?

I don’t follow the reduction from 91 to 23 states, but let’s assume you are working with the original 91 state space and you can use epsilon non-emissions. Now you are going to have some dense (or relatively dense) matrix 26x26(?) that represents the transitions between the other nested states- each of which has its own matrix. Now you can put these all together in a single 91x91 mostly sparse matrix with some block structure.

Say 500 elements of this matrix are being estimated and the rest are zero. As a practical matter you would declare an array of 500 reals in parameters, than instantiate a 91x91 matrix of zeros in the transformed block, assign the parameters into the appropriate location, and use that matrix to calculate the probability density.

This isn’t going to be the most memory efficient, but I think it will be a lot cleaner than dealing with a lot of smaller matrices and fighting index wars to make it work in Stan.

Aarong,

I think I understand your idea but I’m still unsure if this is exactly what I need.

There are five top nodes. SB has six notes at the second level, five transition and one end node. Each transition node have three production children plus one end node. In total, SB has 26 children nodes. WB, WU and SU have the same structure, although it’s not shown in the picture for brevity. The R node has only four nodes at the second level with two children each, a total of ten nodes. Overall, there are 119 nodes: 28 transitions, 63 emissions and 28 end-nodes. I simplified the 63 emissions into only 23 nodes because I grouped the observations in triplets.

Now, each internal node q_i^d, where d = {1, ..., D} is the level with 1 being root and D = leaf, and i = {1, ..., | q^d |} is the number of nodes in the level. Each node has a transition matrix a_{ij} (for horizontal transitions) and a initial probability vector pi (for vertical transitions).

The generalized forward probability is:

As you see, I need recursion to solve this. The forward probability of the bottom nodes (production) is trivial, but next I need to (1) sum over all possible horizontal transitions, and then (2) sum over all possible vertical transitions.

As per your suggestion, if I’m not following it wrong, I’d be working with only one mostly sparse matrix that joins all emission nodes from all levels and hierarchies. The elements corresponding to non-emission nodes would just be zero. The other elements will estimate a sort of “joint” probability that summarises all the possible paths from one emission node to the other (diagonal transitions). In that case, I wouldn’t need an initial distribution vector anymore.

For example: the element of the matrix from node 1 to 2 would be a zero, but the elements from node 3 to 4 would be the probability that one observation from node 4 comes after an observation of node 3 for all possible paths. Is that correct?

If I needed the compute the vertical or horizontal transition probabilities, wouldn’t it suffice to normalize these “diagonal” probabilities for a given subset of nodes?

I don’t really follow your description. could you post the math for the model formulation, not just the estimation method?

That’s correct. The initial probability vector pi would become part of this matrix. So the a_{ij} are going to go into the new matrix as is, and the pi vectors will go into the matrix as rows.

For example if the first node is sb then you have transition probabilities {p_sb, p_r, p_down}. In the expanded matrix this will be represented as {p_sb, p_r, p_down * pi_down_1, …, p_down * pi_down_k}.

You will still be estimating exactly as many parameters as before, but what you are thinking of as the initial distribution vector is really more of an ‘intermediate’ distribution vector, and becomes part of the matrix for computational convenience.

If that is still not clear, could you post the math formalism for the model (not just the estimation method), and I can try to translate it into those terms.

If you want to get the vertical or horizontal transition probabilities, you can extract them from the matrix, after the appropriate renormalization

This is the paper that introduces the model: The Hierarchical Hidden Markov Model: Analysis and Applications by Fine.

Thanks a lot for taking your time. Tried to keep the description as simple as possible to avoid the trouble.

Nop, this is not the case.

  1. The graph is composed by three type of nodes: (1) internal nodes that transition vertically or horizontally, (2) production nodes that emit one observation at a time, and (3) end nodes that do nothing but return control to the parent node.
  2. Each node has (1) children, (2) an initial distribution vector, and (3) a transition matrix.
  3. Each internal node transitions down if it’s got children and transition horizontally once control is returned to it.
    3.1. Vertical transition down: starting with the root, each internal node picks a child from the children list according to the initial distribution vector. Internal nodes keep going down until a production node is met.
    3.2. Horizontal transition: once a node returns control back its parent, the internal node transition to a sibling.
    3.3. Vertical transition up: once the end-node of the row is hit, control goes up the previous level.

So, for horizontal transitions there’s a transition matrix with rows summing 1 (with size = number of siblings) and for vertical transitions there’s a vector summing 1 (with size = number of children). Transitions aren’t summarised in only one vector that include all the possible transitions, the model has a logic that decides if it’s got to transition horizontally or vertically and then picks at random the next node.

Aarong,

I decided to simplify the problem a bit. I simulated a simpler graph: there are two nodes at the top level, each one has two production nodes as children. You first pick one of the branches, then you pick one component of the mixture. In a sense, it’s no more than two gaussian mixtures switching. Next, I coded a HHM in Stan that only considers the production nodes, ie the size of the transition matrix equals the number of production nodes (K = 4). I could recover all the parameters without much problem, although it’ll prove more computationally challenging since I’ll have to deal with 63 (or 23) production nodes in the real case.

At at first it seems that I lost all the hierarchical information. Transition probabilities is just a “summarized” measure of all the possible paths that connect two production nodes. I don’t have direct estimates of the state probability of internal nodes, but I can sort of reconstruct some of the probabilities of interest. In the end, it may just be a partial loss.

All this can be found in this file in the github repo. I’ll go on testing the model with more simulated data and doing some further research.

Thanks again for all the feedback, I’ll keep you posted.

Hi Luis,

I’m glad you were able to get it to work on the simpler case. I’d like to make a more general comment that may be helpful as you think about this going forward. You use a lot of terminology that jumps between ideas in the code that you are using such as ‘returns control to’ and your specific representation of the model ‘horizontal and vertical transitions,’ model representation, ‘model has a logic’ etc.

It might be helpful to separate but related steps such as:

  1. The underlying data generation process
  2. The statistical framework used to represent this
  3. The code that you are using to generate data from two
  4. The parameterization of the model (should be very similar to 2)
  5. The Stan code that you use to infer the model

If you use a particular framework to generate data from a model, you do not need to follow that exact framework in your Stan code in order to get the same posterior distribution, and often that won’t be the most efficient way to infer the parameters. The link for your data generation code actually seems quite complicated, as a bit of a benchmark, I have been working with a gaussian process model and I have a 30 line R program to generate data from the model, but the corresponding Stan code is over 300 lines (Bob Carpenter tells me this is on the extreme ends of things as far as Stan programs go), but the point is - don’t expect a 1-1 correspondence between the length of your R program to generate the data, and length of Stan program to infer the parameters.

1 Like