Hello!

I am new to stan and i wanted to know if there is a way to implemet a stan code for (multi) Classification using Gaussian Process.

Thanks

# Classification with gaussian process

**Kon7**#1

Other than the usual sources like the documentation and case studies another good place to look is the GitHub repo for example models: https://github.com/stan-dev/example-models

**Kon7**#3

Thanksâ€¦I checked the corresponding links and they were quite usefulâ€¦

However i didnt find something regarding how to deal with multi class problemsâ€¦

For example, in the manual there is an implementation for a binary classification that uses the bernouli_logit mapping. In case of 3 or more classes the likelihood should be categorical and the mapping should be done with the softmax function.My problem is, if we have a d-dimensional vector of inputs x_1,â€¦,x_d and some of them belong to a different class each time, (so we have vectors with different dimensions belonging to a different class)how we should implement the case for the likelihood?

because the mapping is not 1-1 as it is in this case:

data {

int<lower=1> N1;

vector[N1] x1;

int<lower=0,upper=1> z1[N1];

int<lower=1> N2;

vector[N2] x2;

}

transformed data {

â€¦ define mu as zero, compute Sigma from x1, x2 â€¦

}

parameters {

vector[N1] y1;

vector[N2] y2;

}

model {

vector[N] y;

for (n in 1:N1) y[n] = y1[n];

for (n in 1:N2) y[N1 + n] = y2[n];

y ~ multi_normal(mu, Sigma);

for (n in 1:N1)

z1[n] ~ bernoulli_logit(y1[n]);

}

**drezap**#4

Yeah, I see why youâ€™re asking the question above. I can paste a proper implementation of a Bernoulli likelihood tomorrow (when Iâ€™m at a computer). But I believe for multiclass we can use the multinomial distribution (I think probit can be used for many categories but I havenâ€™t fact checked myself).

I can post some code for a multi-class classification tomorrow, itâ€™s on my TODO. I have a half written case study then explains this. Weâ€™ll get this running.

**drezap**#6

So hereâ€™s a â€śbinary classification with GPâ€ť, or a â€śGaussian prior, with a logistic likelihoodâ€ť.

We can write this in math as follows, the full posterior being:

\int p(y | X, f) p(f | x) p(x |\theta)

With a non-gaussian likelihood functions, we canâ€™t integrate/marginalize out the latent f, intsead, we use it as the parameter for the logit likelihood function:

p(y |X, f) = log(\frac{f}{1-f})

I believe this is correct, correct me if not. p(f |x) \sim GP(x, x'), p(x|\theta) will be the specified covariance function (sometimes an RBF).

The stan code is below. Iâ€™m using `x_pred`

for out of sample data (really could be in between two datapoints, which happens in spatial or temporal data, whatever). **NOTE**: Here Iâ€™m not using the GPâ€™s covariance, Iâ€™m *only using the latent mean function, f* in the likelihood function.

```
functions {
vector gp_pred_rng(vector[] x_pred,
vector y1, vector[] x,
real magnitude, real[] length_scale) {
int N = rows(y1);
int N_pred = size(x_pred);
vector[N_pred] f2;
{
matrix[N, N] K = gp_matern52_cov(x, magnitude, length_scale);
matrix[N, N] L_K = cholesky_decompose(K);
vector[N] L_K_div_y1 = mdivide_left_tri_low(L_K, y1);
vector[N] K_div_y1 = mdivide_right_tri_low(L_K_div_y1', L_K)';
matrix[N, N_pred] k_x_x_pred = gp_matern52_cov(x, x_pred, magnitude, length_scale);
f2 = (k_x_x_pred' * K_div_y1);
}
return f2;
}
}
data {
int<lower=1> N;
int<lower=1> D;
vector[D] x[N];
int<lower=0,upper=1> y[N];
int<lower=1> N_pred;
vector[D] x_pred[N_pred];
}
parameters {
real<lower=0> magnitude;
real<lower=0> length_scale; // Next release, we can use length_scale[D] for multiple length scales, (ARD)
vector[N] eta;
}
transformed parameters {
vector[N] f;
{
matrix[N, N] K;
matrix[N, N] L_K;
K = gp_matern52_cov(x, magnitude, length_scale);
K = add_diag(K, 1e-12);
L_K = cholesky_decompose(K);
f = L_K * eta;
}
}
model {
magnitude ~ // what should the magnitude prior be?
length_scale ~ // what should the length scale prior be?
eta ~ normal(0, 1);
y ~ bernoulli_logit(f);
}
generated quantities {
vector[N_pred] f_pred = gp_pred_rng(x_pred, f, x, magnitude, length_scale, sig);
int y_pred[N_pred];
int y_pred_in[N];
for (n in 1:N) y_pred_in[n] = bernoulli_logit_rng(f[n]); // in sample prediction
for (n in 1:N_pred) y_pred[n] = bernoulli_logit_rng(f_pred[n]); // out of sample predictions
}
```

So for the trinary classification case, give me some time. It will require a bit of thought.

**Kon7**#7

it throws me an error

"

No matches for:

gp_matern52_cov(vector[], real, real[])

Function gp_matern52_cov not found.

"

for the multi class we use the multinomial likelihood with the softmax mapping right?

also how can we implement the ARD efficiently?

Thank you

**fabio**#8

Thanks for sharing!

I am working on a `categorical_logit`

likelihood, but it is not straightforward for me to figure out an implementation.

**Kon7**#10

Thank you!

I saw the post but i think i need the specific kernel gp_matern52_cov because it takes as an input a D-dimensional vector regarding the lenghtscale (i need it for ARD) and the other kernels (cov_exp_quad) cannotâ€¦

i found the implementation here https://github.com/standev/math/tree/develop/stan/math/prim/mat/fun

but i dont know how to load the library since i am working with R and the code is in C++â€¦

**Kon7**#11

i tried to implement the multiclass case and i ended up in something like thisâ€¦

```
functions {
vector gp_pred_rng(vector[] x_pred,
vector y1, vector[] x,
real magnitude, real length_scale) {
int N = rows(y1);
int N_pred = size(x_pred);
vector[N_pred] f2;
{
matrix[N, N] K = 10 + cov_exp_quad(x, magnitude, length_scale) +
diag_matrix(rep_vector(1e-10, N));
matrix[N, N] L_K = cholesky_decompose(K);
vector[N] L_K_div_y1 = mdivide_left_tri_low(L_K, y1);
vector[N] K_div_y1 = mdivide_right_tri_low(L_K_div_y1â€™, L_K)â€™;
matrix[N, N_pred] k_x_x_pred = cov_exp_quad(x, x_pred, magnitude, length_scale);
f2 = (k_x_x_predâ€™ * K_div_y1);
}
return f2;
}
}
data {
int<lower=1> N;
int<lower=1> D;
vector[D] x[N];
int<lower=0,upper=2> y[N];
int<lower=1> N_pred;
vector[D] x_pred[N_pred];
}
parameters {
real<lower=0> magnitude;
real<lower=0> length_scale;
vector[N] eta;
}
transformed parameters {
vector[N] f;
{
matrix[N, N] K;
matrix[N, N] L_K;
K = 10 +cov_exp_quad(x, magnitude, length_scale) + diag_matrix(rep_vector(10, N));
//K = add_diag(K, 1e-12);
L_K = cholesky_decompose(K);
f = L_K * eta;
}
}
model {
magnitude ~ cauchy(0,5);
length_scale ~ cauchy(0,5);
eta ~ normal(0, 1);
y ~ multinomial(softmax(f));
}
generated quantities {
vector[N_pred] f_pred = gp_pred_rng(x_pred, f, x, magnitude, length_scale);
int y_pred[N_pred];
int y_pred_in[N];
y_pred_in = multinomial_rng(softmax(f),N); // in sample predictin
y_pred = multinomial_rng(softmax(f_pred),N_pred); // out of sample predictions
}
```

When i used an amount of jitter=10 i had better accuracy especially in the training set but in the test is still lowâ€¦(60%)â€¦

Also i would appreciate if you know i way so as i can do ARD at the same time when my parameters are sampledâ€¦

**drezap**#12

Yes, just change it to `cov_exp_quad`

. I was using a matern52 on a local library. This will hopefully be available on next release.

`cov_exp_quad`

is the only covariance function in Stanâ€™s current release.

**Kon7**#13

thanksâ€¦is there another way that i can implement the ARD inside the code for the multi-class case?

**drezap**#14

Thereâ€™s an implementation of ARD as a user defined function in the Stan 2.17 manual. ARD has nothing to do with the likelihood function, only how one constructs the covariance matrix. Doesnâ€™t matter, you can use ARD and not ARD for logistic regression.

**Kon7**#16

```
matrix L_cov_exp_quad_ARD(vector[] x,
real alpha,
vector rho,
real delta) {
int N = size(x);
matrix[N, N] K;
real sq_alpha = square(alpha);
for (i in 1:(N-1)) {
K[i, i] = sq_alpha + delta;
for (j in (i + 1):N) {
K[i, j] = sq_alpha
* exp(-0.5 * dot_self((x[i] - x[j]) ./ rho));
K[j, i] = K[i, j];
}
}
K[N, N] = sq_alpha + delta;
return cholesky_decompose(K);
}
```

Sorry for bothering again but here if we have 2 input vectors vector x ,vector y for the covariance matrix what should we do instead of dot_self? i tried with the function pow but i got an errorâ€¦

**drezap**#17

not a big deal.

Good catch, I donâ€™t think we had posterior predictive in mind when writing this. It would be something like:

`sq_alpha * exp(-.5 * squared_distance(x[i], y[i])./rho^2)`

I havenâ€™t checked for accuracy. Just make sure youâ€™re dividing each column by each length scale, respectively. The matrix will be non-square and non symmetric. If you write the code I can check to make sure youâ€™ve implemented it accurately. Itâ€™s good practice.

If you want, I have a â€śbetaâ€ť version of ARD on my branch, so you donâ€™t have to implement this stuff yourself.

**Kon7**#18

i think i am stuck from getting the prior mean

```
functions{
// radial basis, or square exponential, function
matrix rbf(vector[] x,,vector[]y,real eta,vector rho,real delta) {
int Nx = size(x);
int Ny = size(y);
matrix[Nx, Ny] K;
real sq_eta = square(eta);
for (i in 1:(Nx-1)) {
K[i, i] = sq_eta + delta;
for (j in (i + 1):Ny) {
K[i, j] = sq_eta
* exp(-0.5 * exp(-.5 * squared_distance(x[i], y[i])./rho^2);
K[j, i] = K[i, j];
}
}
K[Nx, Ny] = sq_eta + delta;
return K
}
// all of the posterior calculations are now done within this function
vector post_pred_rng(real eta,vector rho, real delta, int No, vector [] xo, int Np, vector[] xp, vector []yobs){
matrix[No,No] Ko;
matrix[Np,Np] Kp;
matrix[No,Np] Kop;
matrix[Np,No] Ko_inv_t;
vector[Np] mu_p;
matrix[Np,Np] Tau;
matrix[Np,Np] L2;
vector[Np] Yp;
// note the use of matrix multiplication for the sn noise, to remove any loops
Ko = rbf(xo,xo, eta, rho,delta) + diag_matrix(rep_vector(1, No))*sn ;
Kp = rbf(xp,xp, eta, rho,delta) + diag_matrix(rep_vector(1, Np))*sn ;
Kop = rbf(xo,xp, eta, rho,delta) ;
Ko_inv_t = Kop' / Ko;
mu_p = Ko_inv_t * yobs;
Tau = Kp - Ko_inv_t * Kop;
L2 = cholesky_decompose(Tau);
Yp = mu_p + L2*rep_vector(normal_rng(0,1), Np);
return Yp;
}
}
```

because here â€śmu_p = Ko_inv_t * yobs;â€ť i cant do the multiplication â€¦ it is matrix * vector[]

it is somehow correct???

**Kon7**#19

When i use `exp(-.5 * squared_distance(x[i], y[i])./rho^2)`

i get this errorâ€¦

SYNTAX ERROR, MESSAGE(S) FROM PARSER:

Base type mismatch in assignment; variable name = K, type = real; right-hand side type=vector

**drezap**#20

I didnâ€™t run it through the language parser. I havenâ€™t implemented this in Stan language. I think with some massaging datatypes one can figure it out