 # Classification with gaussian process

#1

Hello!
I am new to stan and i wanted to know if there is a way to implemet a stan code for (multi) Classification using Gaussian Process.
Thanks

#2

Other than the usual sources like the documentation and case studies another good place to look is the GitHub repo for example models: https://github.com/stan-dev/example-models

#3

Thanks…I checked the corresponding links and they were quite useful…
However i didnt find something regarding how to deal with multi class problems…
For example, in the manual there is an implementation for a binary classification that uses the bernouli_logit mapping. In case of 3 or more classes the likelihood should be categorical and the mapping should be done with the softmax function.My problem is, if we have a d-dimensional vector of inputs x_1,…,x_d and some of them belong to a different class each time, (so we have vectors with different dimensions belonging to a different class)how we should implement the case for the likelihood?
because the mapping is not 1-1 as it is in this case:

data {
int<lower=1> N1;
vector[N1] x1;
int<lower=0,upper=1> z1[N1];
int<lower=1> N2;
vector[N2] x2;
}
transformed data {
… define mu as zero, compute Sigma from x1, x2 …
}
parameters {
vector[N1] y1;
vector[N2] y2;
}
model {
vector[N] y;
for (n in 1:N1) y[n] = y1[n];
for (n in 1:N2) y[N1 + n] = y2[n];
y ~ multi_normal(mu, Sigma);
for (n in 1:N1)
z1[n] ~ bernoulli_logit(y1[n]);
}

#4

Yeah, I see why you’re asking the question above. I can paste a proper implementation of a Bernoulli likelihood tomorrow (when I’m at a computer). But I believe for multiclass we can use the multinomial distribution (I think probit can be used for many categories but I haven’t fact checked myself).

I can post some code for a multi-class classification tomorrow, it’s on my TODO. I have a half written case study then explains this. We’ll get this running.

#5

hi…this could be very useful for me too. Thanks

#6

Hey all -
@fabio @Kon7

So here’s a “binary classification with GP”, or a “Gaussian prior, with a logistic likelihood”.

We can write this in math as follows, the full posterior being:

\int p(y | X, f) p(f | x) p(x |\theta)

With a non-gaussian likelihood functions, we can’t integrate/marginalize out the latent f, intsead, we use it as the parameter for the logit likelihood function:

p(y |X, f) = log(\frac{f}{1-f})

I believe this is correct, correct me if not. p(f |x) \sim GP(x, x'), p(x|\theta) will be the specified covariance function (sometimes an RBF).

The stan code is below. I’m using x_pred for out of sample data (really could be in between two datapoints, which happens in spatial or temporal data, whatever). NOTE: Here I’m not using the GP’s covariance, I’m only using the latent mean function, f in the likelihood function.

functions {
vector gp_pred_rng(vector[] x_pred,
vector y1, vector[] x,
real magnitude, real[] length_scale) {
int N = rows(y1);
int N_pred = size(x_pred);
vector[N_pred] f2;
{
matrix[N, N] K = gp_matern52_cov(x, magnitude, length_scale);
matrix[N, N] L_K = cholesky_decompose(K);
vector[N] L_K_div_y1 = mdivide_left_tri_low(L_K, y1);
vector[N] K_div_y1 = mdivide_right_tri_low(L_K_div_y1', L_K)';
matrix[N, N_pred] k_x_x_pred = gp_matern52_cov(x, x_pred, magnitude, length_scale);
f2 = (k_x_x_pred' * K_div_y1);
}
return f2;
}
}
data {
int<lower=1> N;
int<lower=1> D;
vector[D] x[N];
int<lower=0,upper=1> y[N];

int<lower=1> N_pred;
vector[D] x_pred[N_pred];
}
parameters {
real<lower=0> magnitude;
real<lower=0> length_scale; // Next release, we can use length_scale[D] for multiple length scales, (ARD)
vector[N] eta;
}
transformed parameters {
vector[N] f;
{
matrix[N, N] K;
matrix[N, N] L_K;
K = gp_matern52_cov(x, magnitude, length_scale);
K = add_diag(K, 1e-12);
L_K = cholesky_decompose(K);
f = L_K * eta;
}
}
model {
magnitude ~ // what should the magnitude prior be?
length_scale ~ // what should the length scale prior be?

eta ~ normal(0, 1);

y ~ bernoulli_logit(f);
}
generated quantities {
vector[N_pred] f_pred = gp_pred_rng(x_pred, f, x, magnitude, length_scale, sig);
int y_pred[N_pred];
int y_pred_in[N];

for (n in 1:N) y_pred_in[n] = bernoulli_logit_rng(f[n]); // in sample prediction
for (n in 1:N_pred) y_pred[n] = bernoulli_logit_rng(f_pred[n]); // out of sample predictions
}


So for the trinary classification case, give me some time. It will require a bit of thought.

#7

it throws me an error
"
No matches for:

gp_matern52_cov(vector[], real, real[])

"
for the multi class we use the multinomial likelihood with the softmax mapping right?
also how can we implement the ARD efficiently?
Thank you

#8

Thanks for sharing!
I am working on a categorical_logit likelihood, but it is not straightforward for me to figure out an implementation.

#9

Have a look at this post. The error is similar, the solution is the same!

#10

Thank you!
I saw the post but i think i need the specific kernel gp_matern52_cov because it takes as an input a D-dimensional vector regarding the lenghtscale (i need it for ARD) and the other kernels (cov_exp_quad) cannot…
i found the implementation here https://github.com/standev/math/tree/develop/stan/math/prim/mat/fun
but i dont know how to load the library since i am working with R and the code is in C++…

#11

i tried to implement the multiclass case and i ended up in something like this…

functions {
vector gp_pred_rng(vector[] x_pred,
vector y1, vector[] x,
real magnitude, real length_scale) {
int N = rows(y1);
int N_pred = size(x_pred);
vector[N_pred] f2;
{
matrix[N, N] K = 10 + cov_exp_quad(x, magnitude, length_scale) +
diag_matrix(rep_vector(1e-10, N));
matrix[N, N] L_K = cholesky_decompose(K);
vector[N] L_K_div_y1 = mdivide_left_tri_low(L_K, y1);
vector[N] K_div_y1 = mdivide_right_tri_low(L_K_div_y1’, L_K)’;
matrix[N, N_pred] k_x_x_pred = cov_exp_quad(x, x_pred, magnitude, length_scale);
f2 = (k_x_x_pred’ * K_div_y1);
}
return f2;
}
}
data {
int<lower=1> N;
int<lower=1> D;
vector[D] x[N];
int<lower=0,upper=2> y[N];

int<lower=1> N_pred;
vector[D] x_pred[N_pred];
}
parameters {
real<lower=0> magnitude;
real<lower=0> length_scale;
vector[N] eta;
}
transformed parameters {
vector[N] f;
{
matrix[N, N] K;
matrix[N, N] L_K;
K = 10 +cov_exp_quad(x, magnitude, length_scale) + diag_matrix(rep_vector(10, N));
//K = add_diag(K, 1e-12);
L_K = cholesky_decompose(K);
f = L_K * eta;
}
}
model {
magnitude ~ cauchy(0,5);
length_scale ~ cauchy(0,5);

eta ~ normal(0, 1);

y ~ multinomial(softmax(f));
}
generated quantities {
vector[N_pred] f_pred = gp_pred_rng(x_pred, f, x, magnitude, length_scale);
int y_pred[N_pred];
int y_pred_in[N];

y_pred_in = multinomial_rng(softmax(f),N); // in sample predictin
y_pred = multinomial_rng(softmax(f_pred),N_pred); // out of sample predictions
}


When i used an amount of jitter=10 i had better accuracy especially in the training set but in the test is still low…(60%)…
Also i would appreciate if you know i way so as i can do ARD at the same time when my parameters are sampled…

#12

Yes, just change it to cov_exp_quad . I was using a matern52 on a local library. This will hopefully be available on next release.

cov_exp_quad is the only covariance function in Stan’s current release.

#13

thanks…is there another way that i can implement the ARD inside the code for the multi-class case?

#14

There’s an implementation of ARD as a user defined function in the Stan 2.17 manual. ARD has nothing to do with the likelihood function, only how one constructs the covariance matrix. Doesn’t matter, you can use ARD and not ARD for logistic regression.

#15

Thank you!

#16
matrix L_cov_exp_quad_ARD(vector[] x,
real alpha,
vector rho,
real delta) {
int N = size(x);
matrix[N, N] K;
real sq_alpha = square(alpha);
for (i in 1:(N-1)) {
K[i, i] = sq_alpha + delta;
for (j in (i + 1):N) {
K[i, j] = sq_alpha
* exp(-0.5 * dot_self((x[i] - x[j]) ./ rho));
K[j, i] = K[i, j];
}
}
K[N, N] = sq_alpha + delta;
return cholesky_decompose(K);
}


Sorry for bothering again but here if we have 2 input vectors vector x ,vector y for the covariance matrix what should we do instead of dot_self? i tried with the function pow but i got an error…

#17

not a big deal.

Good catch, I don’t think we had posterior predictive in mind when writing this. It would be something like:

sq_alpha * exp(-.5 * squared_distance(x[i], y[i])./rho^2)

I haven’t checked for accuracy. Just make sure you’re dividing each column by each length scale, respectively. The matrix will be non-square and non symmetric. If you write the code I can check to make sure you’ve implemented it accurately. It’s good practice.

If you want, I have a “beta” version of ARD on my branch, so you don’t have to implement this stuff yourself.

#18

i think i am stuck from getting the prior mean

functions{
// radial basis, or square exponential, function
matrix rbf(vector[] x,,vector[]y,real eta,vector rho,real delta) {
int Nx = size(x);
int Ny = size(y);
matrix[Nx, Ny] K;
real sq_eta = square(eta);
for (i in 1:(Nx-1)) {
K[i, i] = sq_eta + delta;
for (j in (i + 1):Ny) {
K[i, j] = sq_eta
* exp(-0.5 *  exp(-.5 * squared_distance(x[i], y[i])./rho^2);
K[j, i] = K[i, j];
}
}
K[Nx, Ny] = sq_eta + delta;
return K
}

// all of the posterior calculations are now done within this function
vector post_pred_rng(real eta,vector rho, real delta, int No, vector [] xo, int Np, vector[] xp, vector []yobs){
matrix[No,No] Ko;
matrix[Np,Np] Kp;
matrix[No,Np] Kop;
matrix[Np,No] Ko_inv_t;
vector[Np] mu_p;
matrix[Np,Np] Tau;
matrix[Np,Np] L2;
vector[Np] Yp;
// note the use of matrix multiplication for the sn noise, to remove any loops
Ko = rbf(xo,xo, eta, rho,delta) + diag_matrix(rep_vector(1, No))*sn ;
Kp = rbf(xp,xp, eta, rho,delta) + diag_matrix(rep_vector(1, Np))*sn ;
Kop = rbf(xo,xp, eta, rho,delta) ;
Ko_inv_t = Kop' / Ko;
mu_p = Ko_inv_t * yobs;
Tau = Kp - Ko_inv_t * Kop;
L2 = cholesky_decompose(Tau);
Yp = mu_p + L2*rep_vector(normal_rng(0,1), Np);
return Yp;
}
}


because here “mu_p = Ko_inv_t * yobs;” i cant do the multiplication … it is matrix * vector[]
it is somehow correct???

#19

When i use exp(-.5 * squared_distance(x[i], y[i])./rho^2) i get this error…
SYNTAX ERROR, MESSAGE(S) FROM PARSER:

Base type mismatch in assignment; variable name = K, type = real; right-hand side type=vector

#20

I didn’t run it through the language parser. I haven’t implemented this in Stan language. I think with some massaging datatypes one can figure it out