Bayesian random forest

linas · November 7, 2019, 5:35pm

Hi,

I wonder if there are any attempts to model random forest in stan? There is a thread https://stats.stackexchange.com/questions/124658/how-can-regression-trees-be-fit-in-winbugs-openbugs-jags which recommends stan. Thanks for any leads.

bgoodri · November 7, 2019, 5:56pm

Basically, no. The fundamental idea behind tree-based approaches is to partition the parameter space, whereas the fundamental idea behind the algorithms in Stan is to utilize derivatives to move through the parameter space. The discontinuity of partitioning makes tree-based approaches non-differentiable at points that the leapfrog integrator in Stan has to cross, leading to divergent transitions.

In Stan, it is preferable to use Gaussian processes to estimate “nonparametric” predictive models.

Ara_Winter · November 7, 2019, 6:18pm

Some nice person coded out a classifier in Stan here:

maxbiostat · November 7, 2019, 6:34pm

The nice person happens to be @James_Savage, who is also on here.

Ara_Winter · November 7, 2019, 6:36pm

Thanks @maxbiostat ! We use a lot of @James_Savage 's models for our ecology data. The classifier is pretty sweet to play with.

linas · November 7, 2019, 6:48pm

Thank you very much. Technically I need regression (and would be awesome if there are some examples) but I guess I can translate classification into regression. I am currently using Bayesian Neural Networks for my ash tree problem so I thought about exploring different techniques.

linas · November 7, 2019, 7:37pm

Well, there is an article http://proceedings.mlr.press/v37/matthew15.pdf. Since they did MCMC I wonder if stan could work. Since I have a regression problem and all variables are continuous - maybe in this case there won’t be discontinuities (I don’t know).

bgoodri · November 7, 2019, 7:45pm

There are discontinuities due to partitioning the continuous parameter space.

linas · November 7, 2019, 7:48pm

OK. But in theory Gibbs samplers may work. They may be not so efficient…

linas · November 8, 2019, 12:22am

bgoodri - do you think that this simple model with 3 level tree won’t work in stan?

data {
	int n;
	int nvar;
	matrix[n, nvar] x;
	vector[n] y;
}
parameters {
	real split1;	
	real split2[2];
	real split3[2, 2];
	real mu[2, 2, 2];
	real sigma[2, 2, 2];
	simplex[1] variable1;
	simplex[1] variable2[2];
	simplex[1] variable3[2, 2];
	vector[1] alpha;
}
model {
	variable1 ~ dirichlet(alpha);	
	for (i in 1:2) variable2[i] ~ dirichlet(alpha);	
	for (i in 1:2) for (j in 1:2) variable3[i, j] ~ dirichlet(alpha);
	for (i in 1:n) {
		int b1 = 1;
		int b2 = 1;
		int b3 = 1;
		int var1 = 1;
		int var2 = 1;
		int var3 = 1;

		if (variable1[1] > 0.5) var1 = 2;
		if (x[i, var1] >= split1) b1 = 2; 
		if (variable2[b1][1] > 0.5) var2 = 2;
		if (x[i, var2] >= split2[b1]) b2 = 2; 
		if (variable3[b1, b2][1] > 0.5) var3 = 2;
		if (x[i, var3] >= split3[b1, b2]) b3 = 2; 

		y[i] ~ normal(mu[b1, b2, b3], sigma[b1, b2, b3]);
	}
}

bgoodri · November 8, 2019, 12:26am

Correct, due to the discontinuities that create divergences.

martinmodrak · November 8, 2019, 8:03am

I remember @betanalpha and @anon75146577 musing about smooth parametrization of (phylogenetic) trees on " Betancourting disaster" my takeaway was that there are ideas floating around that suggest it might be possible to do, but we are not there yet.

I agree with others that splines/GP provide similar qualities as random forests do while being tractable for Stan.

Also remember that unfortunately Bayesian neural networks are fundamentally non-identified so you cannot treat their posteriors very seriously (while splines/GPs are quite well understood and reliable)

linas · November 8, 2019, 1:33pm

I messed with BNN quite a bit. Yes, they are unidentifiable but funny part that predictions from all (or almost all chains) are very similar. It leads me to suspicion that while individual weights are unidentifiable the predictive posterior is pretty robust. I don’t have enough theory but maybe some properties about predictive posterior could be proved theoretically.

GPs are very computationally expensive but I will try those anyway (just for a fun to compare with BNN). Are there any good ways to combine GPs with MPI?

mike-lawrence · November 8, 2019, 4:03pm

If you’re worried about the computation for full GPs, check out the posts on splines and approximate GPs here. I think they’re made pretty simple in brms.

avehtari · November 8, 2019, 6:31pm

I started my academic career doing Bayesian neural networks with HMC (if interested see publication list at Aki Vehtari - Publications). You have been just lucky if predictions from all (or almost all chains) are very similar. I stopped using BNNs when I saw cases where with some small but non-negligible probability some chains produced very different predictions leading to quite different conclusions. In addition, there was always the problem that NNs can be very surprising between observations (this is related to adversial examples etc in deep learning). The predictive posterior can be robust in simple examples, but you can’t trust that it is robust in the next case because there is no good diagnostics. There are many successful examples of NNs from more than twenty years, but also more examples of funny results which, e.g., in case of autonomous cars would not be funny at all. I’m co-author even in quite recent NN paper, that is, they can be useful, but I don’t think the posterior is even near robust.

Asymptotic theory says that in infinite time you will get convergence and robust result. Non-asymptotic theory says that there is a combinatorial explosion in identifiability and there is no hope.

I switched from BNNs to GPs because they were computationally superior, that is, I could get accurate robust posterior inference in finite time (you can see some comparison in my old papers).

If you need more scalable GPs, I recommend to check GP specific software. There are also some results for using deep NN as feature generator and then using GP to get better probabilistic guarantees.

These are very useful for additive models with no or only low order interactions.

andrewgelman · November 8, 2019, 7:48pm

Martin:

If you run Stan from different starting points and get poor convergence, you can compute a weighted average of the different chains using stacking. We have hope that this could help with models such as phylogenetic trees. Yuling is writing a paper on this, but as a start you can look here: http://www.stat.columbia.edu/~gelman/research/published/stacking_paper_discussion_rejoinder.pdf

linas · November 8, 2019, 8:28pm

Well, I have to make some corrections. Blind use of BNNs may lead to very different results. So as first check I looked at the trace of likelihood. If it is not stable then I don’t consider the chain. The second one is to trace all weights and biases. If those are unstable I disregard the chain. Then I look at the predictions (start:finish sample were selected based on stability of posterior). Lately I found that if predictions are too good (uncertainty is negligible) then I need to disregard the chain. So after this excruciating pain I was able to get 2-3 chains (out of 4) producing similar predictive posterior (I was looking how well observed falls within 95% CR) pretty consistently. Data was canopy thickness of 400 ash trees for 5 years (one observation per year). In BNN I like ARD thing - I remove unimportant predictors and go with stochastic optimization. Any comments? Actually I am thinking about stacking those few good chains. Meanwhile I am trying to run GP model for the same problem and compare the results. Any ARD in GP?

GP did amazing. And it was faster… Now I am concerned that I put too informative priors. I actually copied latent GP model from stan user’s guide and added some specific code. So if you have interest and time please take a look. BTW, data is in 0-1 scale.
GP.stan (4.2 KB)
Also attached is comparison of GP model and BNN model. It confirms the saying that all models are wrong but some are useful…
3cliffMPI_WhiteAsh_Purdue_GP_20_10_6_1_100coverage.pdf (17.0 KB)
3cliffMPI_WhiteAsh_cliffMPIHalfConstrSigmaConstHl2_20_10_6_1_100coverage.pdf (17.5 KB)

linas · November 9, 2019, 1:12pm

Could you please refer those?

avehtari · November 11, 2019, 8:41am

This approach of selecting “good” chains destroys the validity of MCMC and the draws are not from the posterior distribution.

I don’t have the references ready and I would need to do some google searches for them for which I may have time later this week. Meanwhile you can try, e.g. dnn kernel GP or calibration of dnn with GP or something like that (sorry I don’t remember now the exact keywords or authors)

jpiironen · November 13, 2019, 6:48am

Not sure which paper(s) Aki is referring to, but here’s at least one Deep Kernel Learning. If I’m not mistaken, this might already be implemented in gpytorch. If you are more interested in these topics, you might find interesting also many other fairly recent papers by Andrew Gordon Wilson and his collaborators.

Topic		Replies	Views
Decision trees in Stan Modeling	4	937	February 4, 2020
Blog posts on ecological modeling in Stan Modeling	6	258	February 11, 2025
Bayesian nonparametric modeling Modeling	16	9099	January 18, 2018
Continuos bayesian networks General	10	3158	September 30, 2018
Bayesian Additive Regression Trees (BART)? General	6	3144	October 25, 2018

Bayesian random forest

Related topics