Decision trees in Stan

Dear Staners,

I am wondering how could a simple 5 Bernoulli random variable “data” be modelled by fitting a “mixed decision tree” model in Stan.

I am wondering if the traditional decision tree method is “just” simply a max likelihood approximation to some simple Stan model. Something that can be described in Stan, in a “not too complicated way”.

What needs to be described is a distribution on the tree structures, so it is a mixture model of decision trees, it is not clear to me how easy it is to do this in practice with Stan, partially due to the “mixture model chain convergence problem” because the random variable that indexes the tree structure is discrete, but, this is just a vague intuition.

Any thoughts on this topic ?

Cheers,

Jozsef

This may be of interest. Try implementing the log likelihood of a mixture as a Stan function with some generated data if it’s of interest and post it and I’d be happy to muddle through trying to make it work in Stan with you :-)

Yes, I think we had that as a homework assignment in 2010 or so, given by J. Hollmen : https://people.aalto.fi/jaakko.hollmen#publications . At least the EM approximation.

I was wondering, what would it mean to do a “decision tree” modelling on 3 binary random variables : A, B, C

then the “phase space” (the states, in which the system-to-be-modelled can be) is :

\begin{array}{c|cc} & A & B & C \\ P_1 & 0 & 0 & 0 &\hline \\ P_2 & 1 & 0 & 0 \\ P_3 & 0 & 1 & 0 \\ P_4 & 1 & 1 & 0 \\ P_5 & 0 & 0 & 1 \\ P_6 & 1 & 0 & 1 \\ P_7 & 0 & 1 & 1 \\ P_8 & 1 & 1 & 1 \\ \end{array}

In a decision tree, the result is “deterministic”, hence the probabilities are either zero, or “one”. Or, better said, the conditional probabilities.

So, if C is to be “predicted” by a decision tree then - for example - P(C=1|A=0,B=0) can be either 0 or 1 but nothing else, this is because decision tree-s are “deterministic”, at least the “non-Bayesian” ones.

So, this is one way to start to thinking about this question.

I think I will keep thinking about this and post it here if I come up with something meaningful. This is just a very simple start, a simple example, on which it is possible to get some insight into what it might mean “decision tree”, from Stan-s perspective.

I think decision trees also often have real covariates and actually determine cuts which decides what the branches are . E.g. if x > y then left otherwise right, then if z>a then left, otherwise right ,etc . So it may be possible to think of a decision tree on real covariates with a binary response as a change point model where each branch is a model. Then our model would need to determine both the branch model parameters and the change points.

I think real world decision tree algos involve some level of dynamic programming and decision trees don’t have static structures. But if there was a static structure I think it could be treated as a change point model.

Yes of course, life is indeed more complicated than the simple Bernoulli case I proposed, but if one cannot understand how things work in the simplest Bernoulli case (which I don’t understand myself yet) then it is difficult to understand what happens in more complex cases where, for example, the “decision boundary” is also a parameter (in the decision tree method itself, already).

If it can be understood how to represent a decision tree in Stan for the simplest Bernoulli case - with just 3 random variables (which is the simplest non trivial decision tree “in existence”, I think) - then that could give some intuition - in general - what a decision tree “really is” in the first place, and also some guidance when and why it is “good” to use a decision tree to model some “data” and when and why it is not.

In general, I think it is worth thinking about how one can transform some “common ML approaches” into Bayesian (Stan) models to get some meaningful, well justified insight into why and when those “common ML approaches” will “work” for some problems and why and when they will not “work” for some other problems.