Aarong,

I think I understand your idea but I’m still unsure if this is exactly what I need.

There are five top nodes. SB has six notes at the second level, five transition and one end node. Each transition node have three production children plus one end node. In total, SB has 26 children nodes. WB, WU and SU have the same structure, although it’s not shown in the picture for brevity. The R node has only four nodes at the second level with two children each, a total of ten nodes. Overall, there are 119 nodes: 28 transitions, 63 emissions and 28 end-nodes. I simplified the 63 emissions into only 23 nodes because I grouped the observations in triplets.

Now, each internal node `q_i^d`

, where `d = {1, ..., D}`

is the level with 1 being root and D = leaf, and `i = {1, ..., | q^d |}`

is the number of nodes in the level. Each node has a transition matrix `a_{ij}`

(for horizontal transitions) and a initial probability vector `pi`

(for vertical transitions).

The generalized forward probability is:

As you see, I need recursion to solve this. The forward probability of the bottom nodes (production) is trivial, but next I need to (1) sum over all possible horizontal transitions, and then (2) sum over all possible vertical transitions.

As per your suggestion, if I’m not following it wrong, I’d be working with only one mostly sparse matrix that joins all emission nodes from all levels and hierarchies. The elements corresponding to non-emission nodes would just be zero. The other elements will estimate a sort of “joint” probability that summarises all the possible paths from one emission node to the other (diagonal transitions). In that case, I wouldn’t need an initial distribution vector anymore.

For example: the element of the matrix from node 1 to 2 would be a zero, but the elements from node 3 to 4 would be the probability that one observation from node 4 comes after an observation of node 3 for all possible paths. Is that correct?

If I needed the compute the vertical or horizontal transition probabilities, wouldn’t it suffice to normalize these “diagonal” probabilities for a given subset of nodes?