Hi,
I just have a general question about the capability of stan doing nonparametric Bayesian modeling. So far in the user guide it is pretty much parametric. Is there any chance to handle for example CRP/IBP in stan? Or are there any plan to move in that direction?
Thanks!

AFAIK the problem the two major problems with doing non-parametric stuff in Stan are that 1) itâ€™s not possible to sample discrete parameters, and 2) Stan can not sample a â€śvaryingâ€ť number of parameters.

That being said, I read somewhere that people are overcoming these issues, for example by marginalizing out discrete parameters (1) or by for example set a very high number of clusters (2), which kind of capture in essence â€śgoing to infinityâ€ť and then regularize with priors (so that the arbitrary upper bound doesnâ€™t actually matter).

Nope, no plans to go in that direction. We are trying to limit our focus to models that are tractable. None of these are because of the combinatorial multimodality in the posterior (not just index switching, which is manageable). In models that are tractable, you can almost always marginalize out any discrete parameters for much greater efficiency than youâ€™d get from sampling them.

We do support Gaussian processes, which are essentially non-parametric curve fitters that work sort of like a smooth version of K-nearest neighbors.

I think what @Max_Mantei was getting at in the second paragraph is that you can approximate a Dirichlet process with a large Dirichlet. But you wonâ€™t be able to get good convergence in multiple chains because of multimodality. You see the same thing with a simple Latent Dirichlet allocation model (i.e., multinomial mixed membership mixture model).

But usually we really mean low-order polynomial as in linear or at most quadratic. Sometimes marginalization is exponential, as in variable selection indicators. Other times itâ€™s linear, as in a mixture model. The linear marginalization is tractable and the exponential marginalization isnâ€™t.

Gaussian processes, which are essentially non-parametric curve fitters that work sort of like a smooth version of K-nearest neighbors.

I had this same epiphany that GPs are just smooth K-nearest neighbors a few months ago! At first I thought it was profound, but then it just made me like GPs less, because I realized theyâ€™ll suffer from the same problems as K-nearest neighbors. In particular, in a high dimensional feature-space all points are from each other, so you donâ€™t really have any close neighbors to base your prediction on.

Now Iâ€™m more keen to just add higher order polynomial terms with informative sparse priors if I think there might be non-linear effects in a high-dimensional feature space.

I digress though. My apologies for the thread detour!

You can do predictions also based on far points. In high dimensions, RBF type covariances, like exponentiated quadratic start to resemble more dot-product implying a linear model, which is different what happens with KNN.

Higher order polynomials are in most cases a bad choice even in 1D.

This is only true if you choose those points at random. As you may remember from Susan Holmesâ€™s talk at StanCon (nice to meet you there in person), the actual data can form a much lower dimensional submanifold of the higher-dimensional space. Then the trickâ€™s making those points close together in some feature space.

So the fundamental problem with GPs is that it just reduces the problem to one of defining what â€śnearâ€ť means, which is in and of itself a very challenging problem.

RBF type covariances, like exponentiated quadratic start to resemble more dot-product implying a linear model, which is different what happens with KNN.

Ah I didnâ€™t realize that, but Ben pointed out the Taylor expansion of the of the kernel and that makes sense.

Higher order polynomials are in most cases a bad choice even in 1D.

Do you usually use GPs for regression when the feature space is high-dimensional?

Most of the applications Iâ€™ve seen have been only a few dimensions, usually for time and space. But they can also be used for general regression. Radford Neal showed back in the early 90s that the limit as the hidden layers grow of a neural network is a GP. So itâ€™s a very general non-linear curve fitting model.

The right way to think about data dimensionality is in topological terms. For example, a sphere is a 2-dimensional surface embedded in 3 dimensions. While on a sphere, you only have two degrees of freedom (latitude and longitude). The natural distance might be something like great circle distance (more generally, a geodesic).

In general, the trickâ€™s finding the right embeddings for GPs. To tackle the birthday problem (cover of BDA 3), Aki modified distances so that all Mondays are near each other, all days in a given season are near each other, etc. It doesnâ€™t just have to be Euclidean distance.

Radford Neal showed back in the early 90s that the limit as the hidden layers grow of a neural network is a GP. So itâ€™s a very general non-linear curve fitting model.

Yeah! I was actually reading his thesis a couple months ago. That result was pretty cool. So far I havenâ€™t been able to find an explanation of why using more layers rather than a wider single layer, has been so successful in the ML applications.

In general, the trickâ€™s finding the right embeddings for GPs. To tackle the birthday problem (cover of BDA 3), Aki modified distances so that all Mondays are near each other, all days in a given season are near each other, etc. It doesnâ€™t just have to be Euclidean distance.

Ah I havenâ€™t had a chance to read the birthday example yet, but thatâ€™s pretty damn cool! I know a kernel is just a distance function, but I havenâ€™t really thought to get that creative with them in my own models.

The number of dimensions is not a problem as long as you have n>>p, or using a prior which shuts down some dimensions, or dimensionality reduction. Having n even somewhat bigger than p, van help going beyond linearity.

I thought Radford Neal explains that, too. Google also papers on Deep GPs

There are so many recent deep GP papers, but this https://arxiv.org/abs/1711.11280 is a good starting point as they review four ways to produce â€śdeepnessâ€ť: (1) deep composition, (2) rich kernel functions, (3) deep kernel operators, (4) convolutions.