Bayesian nonparametric modeling


#1

Hi,
I just have a general question about the capability of stan doing nonparametric Bayesian modeling. So far in the user guide it is pretty much parametric. Is there any chance to handle for example CRP/IBP in stan? Or are there any plan to move in that direction?
Thanks!

jicun


#2

AFAIK the problem the two major problems with doing non-parametric stuff in Stan are that 1) it’s not possible to sample discrete parameters, and 2) Stan can not sample a “varying” number of parameters.

That being said, I read somewhere that people are overcoming these issues, for example by marginalizing out discrete parameters (1) or by for example set a very high number of clusters (2), which kind of capture in essence “going to infinity” and then regularize with priors (so that the arbitrary upper bound doesn’t actually matter).


Bayesian melding
#3

Thanks for the info. Very reasonable.

jicun


#4

Nope, no plans to go in that direction. We are trying to limit our focus to models that are tractable. None of these are because of the combinatorial multimodality in the posterior (not just index switching, which is manageable). In models that are tractable, you can almost always marginalize out any discrete parameters for much greater efficiency than you’d get from sampling them.

We do support Gaussian processes, which are essentially non-parametric curve fitters that work sort of like a smooth version of K-nearest neighbors.

I think what @Max_Mantei was getting at in the second paragraph is that you can approximate a Dirichlet process with a large Dirichlet. But you won’t be able to get good convergence in multiple chains because of multimodality. You see the same thing with a simple Latent Dirichlet allocation model (i.e., multinomial mixed membership mixture model).


#5

Hi Bob, thanks for the comments and info. Can you give a more precise definition of tractable model?


#6

I mean tractable in the computer science sense of a problem that can be fit in polynomial time.

But usually we really mean low-order polynomial as in linear or at most quadratic. Sometimes marginalization is exponential, as in variable selection indicators. Other times it’s linear, as in a mixture model. The linear marginalization is tractable and the exponential marginalization isn’t.


#7

Thanks again!


#8

Gaussian processes, which are essentially non-parametric curve fitters that work sort of like a smooth version of K-nearest neighbors.

I had this same epiphany that GPs are just smooth K-nearest neighbors a few months ago! At first I thought it was profound, but then it just made me like GPs less, because I realized they’ll suffer from the same problems as K-nearest neighbors. In particular, in a high dimensional feature-space all points are from each other, so you don’t really have any close neighbors to base your prediction on.

Now I’m more keen to just add higher order polynomial terms with informative sparse priors if I think there might be non-linear effects in a high-dimensional feature space.

I digress though. My apologies for the thread detour!


#9

You can do predictions also based on far points. In high dimensions, RBF type covariances, like exponentiated quadratic start to resemble more dot-product implying a linear model, which is different what happens with KNN.

Higher order polynomials are in most cases a bad choice even in 1D.


#10

This is only true if you choose those points at random. As you may remember from Susan Holmes’s talk at StanCon (nice to meet you there in person), the actual data can form a much lower dimensional submanifold of the higher-dimensional space. Then the trick’s making those points close together in some feature space.

So the fundamental problem with GPs is that it just reduces the problem to one of defining what “near” means, which is in and of itself a very challenging problem.


#11

RBF type covariances, like exponentiated quadratic start to resemble more dot-product implying a linear model, which is different what happens with KNN.

Ah I didn’t realize that, but Ben pointed out the Taylor expansion of the of the kernel and that makes sense.

Higher order polynomials are in most cases a bad choice even in 1D.

Do you usually use GPs for regression when the feature space is high-dimensional?


#12

This is only true if you choose those points at random.

Ah right, good point. Nice meeting you too!


#13

Most of the applications I’ve seen have been only a few dimensions, usually for time and space. But they can also be used for general regression. Radford Neal showed back in the early 90s that the limit as the hidden layers grow of a neural network is a GP. So it’s a very general non-linear curve fitting model.

The right way to think about data dimensionality is in topological terms. For example, a sphere is a 2-dimensional surface embedded in 3 dimensions. While on a sphere, you only have two degrees of freedom (latitude and longitude). The natural distance might be something like great circle distance (more generally, a geodesic).

In general, the trick’s finding the right embeddings for GPs. To tackle the birthday problem (cover of BDA 3), Aki modified distances so that all Mondays are near each other, all days in a given season are near each other, etc. It doesn’t just have to be Euclidean distance.


#14

Radford Neal showed back in the early 90s that the limit as the hidden layers grow of a neural network is a GP. So it’s a very general non-linear curve fitting model.

Yeah! I was actually reading his thesis a couple months ago. That result was pretty cool. So far I haven’t been able to find an explanation of why using more layers rather than a wider single layer, has been so successful in the ML applications.

In general, the trick’s finding the right embeddings for GPs. To tackle the birthday problem (cover of BDA 3), Aki modified distances so that all Mondays are near each other, all days in a given season are near each other, etc. It doesn’t just have to be Euclidean distance.

Ah I haven’t had a chance to read the birthday example yet, but that’s pretty damn cool! I know a kernel is just a distance function, but I haven’t really thought to get that creative with them in my own models.


#15

The number of dimensions is not a problem as long as you have n>>p, or using a prior which shuts down some dimensions, or dimensionality reduction. Having n even somewhat bigger than p, van help going beyond linearity.

I thought Radford Neal explains that, too. Google also papers on Deep GPs


#16

I thought Radford Neal explains that, too. Google also papers on Deep GPs

I’ve seen some of the stuff on how going deep is equivalent to learning a GP with a non-stationary kernel. Is that the work you were referring to?


#17

There are so many recent deep GP papers, but this https://arxiv.org/abs/1711.11280 is a good starting point as they review four ways to produce “deepness”: (1) deep composition, (2) rich kernel functions, (3) deep kernel operators, (4) convolutions.