Bayesian random forest


I wonder if there are any attempts to model random forest in stan? There is a thread which recommends stan. Thanks for any leads.


Basically, no. The fundamental idea behind tree-based approaches is to partition the parameter space, whereas the fundamental idea behind the algorithms in Stan is to utilize derivatives to move through the parameter space. The discontinuity of partitioning makes tree-based approaches non-differentiable at points that the leapfrog integrator in Stan has to cross, leading to divergent transitions.

In Stan, it is preferable to use Gaussian processes to estimate “nonparametric” predictive models.


Some nice person coded out a classifier in Stan here:

1 Like

The nice person happens to be @James_Savage, who is also on here.

1 Like

Thanks @maxbiostat ! We use a lot of @James_Savage 's models for our ecology data. The classifier is pretty sweet to play with.

1 Like

Thank you very much. Technically I need regression (and would be awesome if there are some examples) but I guess I can translate classification into regression. I am currently using Bayesian Neural Networks for my ash tree problem so I thought about exploring different techniques.

Well, there is an article Since they did MCMC I wonder if stan could work. Since I have a regression problem and all variables are continuous - maybe in this case there won’t be discontinuities (I don’t know).

There are discontinuities due to partitioning the continuous parameter space.

OK. But in theory Gibbs samplers may work. They may be not so efficient…

bgoodri - do you think that this simple model with 3 level tree won’t work in stan?

data {
	int n;
	int nvar;
	matrix[n, nvar] x;
	vector[n] y;
parameters {
	real split1;	
	real split2[2];
	real split3[2, 2];
	real mu[2, 2, 2];
	real sigma[2, 2, 2];
	simplex[1] variable1;
	simplex[1] variable2[2];
	simplex[1] variable3[2, 2];
	vector[1] alpha;
model {
	variable1 ~ dirichlet(alpha);	
	for (i in 1:2) variable2[i] ~ dirichlet(alpha);	
	for (i in 1:2) for (j in 1:2) variable3[i, j] ~ dirichlet(alpha);
	for (i in 1:n) {
		int b1 = 1;
		int b2 = 1;
		int b3 = 1;
		int var1 = 1;
		int var2 = 1;
		int var3 = 1;

		if (variable1[1] > 0.5) var1 = 2;
		if (x[i, var1] >= split1) b1 = 2; 
		if (variable2[b1][1] > 0.5) var2 = 2;
		if (x[i, var2] >= split2[b1]) b2 = 2; 
		if (variable3[b1, b2][1] > 0.5) var3 = 2;
		if (x[i, var3] >= split3[b1, b2]) b3 = 2; 

		y[i] ~ normal(mu[b1, b2, b3], sigma[b1, b2, b3]);

Correct, due to the discontinuities that create divergences.

I remember @betanalpha and @anon75146577 musing about smooth parametrization of (phylogenetic) trees on " Betancourting disaster" my takeaway was that there are ideas floating around that suggest it might be possible to do, but we are not there yet.

I agree with others that splines/GP provide similar qualities as random forests do while being tractable for Stan.

Also remember that unfortunately Bayesian neural networks are fundamentally non-identified so you cannot treat their posteriors very seriously (while splines/GPs are quite well understood and reliable)


I messed with BNN quite a bit. Yes, they are unidentifiable but funny part that predictions from all (or almost all chains) are very similar. It leads me to suspicion that while individual weights are unidentifiable the predictive posterior is pretty robust. I don’t have enough theory but maybe some properties about predictive posterior could be proved theoretically.

GPs are very computationally expensive but I will try those anyway (just for a fun to compare with BNN). Are there any good ways to combine GPs with MPI?

If you’re worried about the computation for full GPs, check out the posts on splines and approximate GPs here. I think they’re made pretty simple in brms.


I started my academic career doing Bayesian neural networks with HMC (if interested see publication list at You have been just lucky if predictions from all (or almost all chains) are very similar. I stopped using BNNs when I saw cases where with some small but non-negligible probability some chains produced very different predictions leading to quite different conclusions. In addition, there was always the problem that NNs can be very surprising between observations (this is related to adversial examples etc in deep learning). The predictive posterior can be robust in simple examples, but you can’t trust that it is robust in the next case because there is no good diagnostics. There are many successful examples of NNs from more than twenty years, but also more examples of funny results which, e.g., in case of autonomous cars would not be funny at all. I’m co-author even in quite recent NN paper, that is, they can be useful, but I don’t think the posterior is even near robust.

Asymptotic theory says that in infinite time you will get convergence and robust result. Non-asymptotic theory says that there is a combinatorial explosion in identifiability and there is no hope.

I switched from BNNs to GPs because they were computationally superior, that is, I could get accurate robust posterior inference in finite time (you can see some comparison in my old papers).

If you need more scalable GPs, I recommend to check GP specific software. There are also some results for using deep NN as feature generator and then using GP to get better probabilistic guarantees.

These are very useful for additive models with no or only low order interactions.



If you run Stan from different starting points and get poor convergence, you can compute a weighted average of the different chains using stacking. We have hope that this could help with models such as phylogenetic trees. Yuling is writing a paper on this, but as a start you can look here:


Well, I have to make some corrections. Blind use of BNNs may lead to very different results. So as first check I looked at the trace of likelihood. If it is not stable then I don’t consider the chain. The second one is to trace all weights and biases. If those are unstable I disregard the chain. Then I look at the predictions (start:finish sample were selected based on stability of posterior). Lately I found that if predictions are too good (uncertainty is negligible) then I need to disregard the chain. So after this excruciating pain I was able to get 2-3 chains (out of 4) producing similar predictive posterior (I was looking how well observed falls within 95% CR) pretty consistently. Data was canopy thickness of 400 ash trees for 5 years (one observation per year). In BNN I like ARD thing - I remove unimportant predictors and go with stochastic optimization. Any comments? Actually I am thinking about stacking those few good chains. Meanwhile I am trying to run GP model for the same problem and compare the results. Any ARD in GP?

GP did amazing. And it was faster… Now I am concerned that I put too informative priors. I actually copied latent GP model from stan user’s guide and added some specific code. So if you have interest and time please take a look. BTW, data is in 0-1 scale.
GP.stan (4.2 KB)
Also attached is comparison of GP model and BNN model. It confirms the saying that all models are wrong but some are useful…
3cliffMPI_WhiteAsh_Purdue_GP_20_10_6_1_100coverage.pdf (17.0 KB)
3cliffMPI_WhiteAsh_cliffMPIHalfConstrSigmaConstHl2_20_10_6_1_100coverage.pdf (17.5 KB)


Could you please refer those?

1 Like

This approach of selecting “good” chains destroys the validity of MCMC and the draws are not from the posterior distribution.

I don’t have the references ready and I would need to do some google searches for them for which I may have time later this week. Meanwhile you can try, e.g. dnn kernel GP or calibration of dnn with GP or something like that (sorry I don’t remember now the exact keywords or authors)

Not sure which paper(s) Aki is referring to, but here’s at least one If I’m not mistaken, this might already be implemented in gpytorch. If you are more interested in these topics, you might find interesting also many other fairly recent papers by Andrew Gordon Wilson and his collaborators.

1 Like