Classification with gaussian process

Kon7 · February 21, 2019, 3:53pm

Hello!
I am new to stan and i wanted to know if there is a way to implemet a stan code for (multi) Classification using Gaussian Process.
Thanks

increasechief · February 21, 2019, 4:11pm

Other than the usual sources like the documentation and case studies another good place to look is the GitHub repo for example models: https://github.com/stan-dev/example-models

Kon7 · February 25, 2019, 1:50pm

Thanks…I checked the corresponding links and they were quite useful…
However i didnt find something regarding how to deal with multi class problems…
For example, in the manual there is an implementation for a binary classification that uses the bernouli_logit mapping. In case of 3 or more classes the likelihood should be categorical and the mapping should be done with the softmax function.My problem is, if we have a d-dimensional vector of inputs x_1,…,x_d and some of them belong to a different class each time, (so we have vectors with different dimensions belonging to a different class)how we should implement the case for the likelihood?
because the mapping is not 1-1 as it is in this case:

data {
int<lower=1> N1;
vector[N1] x1;
int<lower=0,upper=1> z1[N1];
int<lower=1> N2;
vector[N2] x2;
}
transformed data {
… define mu as zero, compute Sigma from x1, x2 …
}
parameters {
vector[N1] y1;
vector[N2] y2;
}
model {
vector[N] y;
for (n in 1:N1) y[n] = y1[n];
for (n in 1:N2) y[N1 + n] = y2[n];
y ~ multi_normal(mu, Sigma);
for (n in 1:N1)
z1[n] ~ bernoulli_logit(y1[n]);
}

anon79882417 · February 26, 2019, 2:09am

Yeah, I see why you’re asking the question above. I can paste a proper implementation of a Bernoulli likelihood tomorrow (when I’m at a computer). But I believe for multiclass we can use the multinomial distribution (I think probit can be used for many categories but I haven’t fact checked myself).

I can post some code for a multi-class classification tomorrow, it’s on my TODO. I have a half written case study then explains this. We’ll get this running.

fabio · February 26, 2019, 12:41pm

hi…this could be very useful for me too. Thanks

anon79882417 · February 27, 2019, 12:26am

Hey all -
@fabio @Kon7

So here’s a “binary classification with GP”, or a “Gaussian prior, with a logistic likelihood”.

We can write this in math as follows, the full posterior being:

\int p(y | X, f) p(f | x) p(x |\theta)

With a non-gaussian likelihood functions, we can’t integrate/marginalize out the latent f, intsead, we use it as the parameter for the logit likelihood function:

p(y |X, f) = log(\frac{f}{1-f})

I believe this is correct, correct me if not. p(f |x) \sim GP(x, x'), p(x|\theta) will be the specified covariance function (sometimes an RBF).

The stan code is below. I’m using x_pred for out of sample data (really could be in between two datapoints, which happens in spatial or temporal data, whatever). NOTE: Here I’m not using the GP’s covariance, I’m only using the latent mean function, f in the likelihood function.

functions {
  vector gp_pred_rng(vector[] x_pred,
                     vector y1, vector[] x,
                     real magnitude, real[] length_scale) {
    int N = rows(y1);
    int N_pred = size(x_pred);
    vector[N_pred] f2;
    {
      matrix[N, N] K = gp_matern52_cov(x, magnitude, length_scale);
      matrix[N, N] L_K = cholesky_decompose(K);
      vector[N] L_K_div_y1 = mdivide_left_tri_low(L_K, y1);
      vector[N] K_div_y1 = mdivide_right_tri_low(L_K_div_y1', L_K)';
      matrix[N, N_pred] k_x_x_pred = gp_matern52_cov(x, x_pred, magnitude, length_scale);
      f2 = (k_x_x_pred' * K_div_y1);
    }
    return f2;
  }
}
data {
  int<lower=1> N;
  int<lower=1> D;
  vector[D] x[N];
  int<lower=0,upper=1> y[N];

  int<lower=1> N_pred;
  vector[D] x_pred[N_pred];
}
parameters {
  real<lower=0> magnitude;
  real<lower=0> length_scale; // Next release, we can use length_scale[D] for multiple length scales, (ARD)
  vector[N] eta;
}
transformed parameters {
  vector[N] f;
  {
    matrix[N, N] K;
    matrix[N, N] L_K;   
    K = gp_matern52_cov(x, magnitude, length_scale);
    K = add_diag(K, 1e-12);
    L_K = cholesky_decompose(K);
    f = L_K * eta;
  }
}
model {
  magnitude ~ // what should the magnitude prior be?
  length_scale ~ // what should the length scale prior be?
  
  eta ~ normal(0, 1);

  y ~ bernoulli_logit(f);
}
generated quantities {
  vector[N_pred] f_pred = gp_pred_rng(x_pred, f, x, magnitude, length_scale, sig);
  int y_pred[N_pred];
  int y_pred_in[N];
  
  for (n in 1:N) y_pred_in[n] = bernoulli_logit_rng(f[n]); // in sample prediction
  for (n in 1:N_pred) y_pred[n] = bernoulli_logit_rng(f_pred[n]); // out of sample predictions
}

So for the trinary classification case, give me some time. It will require a bit of thought.

Kon7 · February 27, 2019, 7:11am

it throws me an error
"
No matches for:

gp_matern52_cov(vector[], real, real[])

Function gp_matern52_cov not found.
"
for the multi class we use the multinomial likelihood with the softmax mapping right?
also how can we implement the ARD efficiently?
Thank you

fabio · February 27, 2019, 9:06am

Thanks for sharing!
I am working on a categorical_logit likelihood, but it is not straightforward for me to figure out an implementation.

fabio · February 27, 2019, 1:49pm

Have a look at this post. The error is similar, the solution is the same!

Kon7 · February 27, 2019, 1:57pm

Thank you!
I saw the post but i think i need the specific kernel gp_matern52_cov because it takes as an input a D-dimensional vector regarding the lenghtscale (i need it for ARD) and the other kernels (cov_exp_quad) cannot…
i found the implementation here https://github.com/standev/math/tree/develop/stan/math/prim/mat/fun
but i dont know how to load the library since i am working with R and the code is in C++…

Kon7 · February 27, 2019, 3:06pm

i tried to implement the multiclass case and i ended up in something like this…

functions {
vector gp_pred_rng(vector[] x_pred,
vector y1, vector[] x,
real magnitude, real length_scale) {
int N = rows(y1);
int N_pred = size(x_pred);
vector[N_pred] f2;
{
matrix[N, N] K = 10 + cov_exp_quad(x, magnitude, length_scale) +
diag_matrix(rep_vector(1e-10, N));
matrix[N, N] L_K = cholesky_decompose(K);
vector[N] L_K_div_y1 = mdivide_left_tri_low(L_K, y1);
vector[N] K_div_y1 = mdivide_right_tri_low(L_K_div_y1’, L_K)’;
matrix[N, N_pred] k_x_x_pred = cov_exp_quad(x, x_pred, magnitude, length_scale);
f2 = (k_x_x_pred’ * K_div_y1);
}
return f2;
}
}
data {
int<lower=1> N;
int<lower=1> D;
vector[D] x[N];
int<lower=0,upper=2> y[N];

int<lower=1> N_pred;
vector[D] x_pred[N_pred];
}
parameters {
real<lower=0> magnitude;
real<lower=0> length_scale;
vector[N] eta;
}
transformed parameters {
vector[N] f;
{
matrix[N, N] K;
matrix[N, N] L_K;
K = 10 +cov_exp_quad(x, magnitude, length_scale) + diag_matrix(rep_vector(10, N));
//K = add_diag(K, 1e-12);
L_K = cholesky_decompose(K);
f = L_K * eta;
}
}
model {
magnitude ~ cauchy(0,5);
length_scale ~ cauchy(0,5);

eta ~ normal(0, 1);

y ~ multinomial(softmax(f));
}
generated quantities {
vector[N_pred] f_pred = gp_pred_rng(x_pred, f, x, magnitude, length_scale);
int y_pred[N_pred];
int y_pred_in[N];

y_pred_in = multinomial_rng(softmax(f),N); // in sample predictin
y_pred = multinomial_rng(softmax(f_pred),N_pred); // out of sample predictions
}

When i used an amount of jitter=10 i had better accuracy especially in the training set but in the test is still low…(60%)…
Also i would appreciate if you know i way so as i can do ARD at the same time when my parameters are sampled…

anon79882417 · February 27, 2019, 8:27pm

@Kon7
@fabio

Yes, just change it to cov_exp_quad . I was using a matern52 on a local library. This will hopefully be available on next release.

cov_exp_quad is the only covariance function in Stan’s current release.

Kon7 · February 27, 2019, 8:31pm

thanks…is there another way that i can implement the ARD inside the code for the multi-class case?

anon79882417 · February 27, 2019, 8:33pm

There’s an implementation of ARD as a user defined function in the Stan 2.17 manual. ARD has nothing to do with the likelihood function, only how one constructs the covariance matrix. Doesn’t matter, you can use ARD and not ARD for logistic regression.

Kon7 · February 27, 2019, 8:40pm

Thank you!

Kon7 · February 27, 2019, 9:37pm

matrix L_cov_exp_quad_ARD(vector[] x,
real alpha,
vector rho,
real delta) {
int N = size(x);
matrix[N, N] K;
real sq_alpha = square(alpha);
for (i in 1:(N-1)) {
K[i, i] = sq_alpha + delta;
for (j in (i + 1):N) {
K[i, j] = sq_alpha
* exp(-0.5 * dot_self((x[i] - x[j]) ./ rho));
K[j, i] = K[i, j];
}
}
K[N, N] = sq_alpha + delta;
return cholesky_decompose(K);
}

Sorry for bothering again but here if we have 2 input vectors vector x ,vector y for the covariance matrix what should we do instead of dot_self? i tried with the function pow but i got an error…

anon79882417 · February 27, 2019, 9:52pm

not a big deal.

Good catch, I don’t think we had posterior predictive in mind when writing this. It would be something like:

sq_alpha * exp(-.5 * squared_distance(x[i], y[i])./rho^2)

I haven’t checked for accuracy. Just make sure you’re dividing each column by each length scale, respectively. The matrix will be non-square and non symmetric. If you write the code I can check to make sure you’ve implemented it accurately. It’s good practice.

If you want, I have a “beta” version of ARD on my branch, so you don’t have to implement this stuff yourself.

Kon7 · February 27, 2019, 10:09pm

i think i am stuck from getting the prior mean

functions{
	// radial basis, or square exponential, function
	matrix rbf(vector[] x,,vector[]y,real eta,vector rho,real delta) {
          int Nx = size(x);
          int Ny = size(y);
          matrix[Nx, Ny] K;
          real sq_eta = square(eta);
          for (i in 1:(Nx-1)) {
          K[i, i] = sq_eta + delta;
          for (j in (i + 1):Ny) {
          K[i, j] = sq_eta
          * exp(-0.5 *  exp(-.5 * squared_distance(x[i], y[i])./rho^2);
          K[j, i] = K[i, j];
          }
        }
          K[Nx, Ny] = sq_eta + delta;
 return K
  }


        // all of the posterior calculations are now done within this function
	vector post_pred_rng(real eta,vector rho, real delta, int No, vector [] xo, int Np, vector[] xp, vector []yobs){
			matrix[No,No] Ko;
			matrix[Np,Np] Kp;
			matrix[No,Np] Kop;
			matrix[Np,No] Ko_inv_t;
			vector[Np] mu_p;
			matrix[Np,Np] Tau;
			matrix[Np,Np] L2;
			vector[Np] Yp;
    // note the use of matrix multiplication for the sn noise, to remove any loops
			Ko = rbf(xo,xo, eta, rho,delta) + diag_matrix(rep_vector(1, No))*sn ;
			Kp = rbf(xp,xp, eta, rho,delta) + diag_matrix(rep_vector(1, Np))*sn ;
			Kop = rbf(xo,xp, eta, rho,delta) ;
			Ko_inv_t = Kop' / Ko;
			mu_p = Ko_inv_t * yobs;
			Tau = Kp - Ko_inv_t * Kop;
			L2 = cholesky_decompose(Tau);
			Yp = mu_p + L2*rep_vector(normal_rng(0,1), Np);
			return Yp;
	}
}

because here “mu_p = Ko_inv_t * yobs;” i cant do the multiplication … it is matrix * vector[]
it is somehow correct???

Kon7 · February 27, 2019, 11:09pm

When i use exp(-.5 * squared_distance(x[i], y[i])./rho^2) i get this error…
SYNTAX ERROR, MESSAGE(S) FROM PARSER:

Base type mismatch in assignment; variable name = K, type = real; right-hand side type=vector

anon79882417 · February 28, 2019, 11:43pm

I didn’t run it through the language parser. I haven’t implemented this in Stan language. I think with some massaging datatypes one can figure it out

Topic		Replies	Views
Bayesian Neural Network with Categorical Likelihood: efficient implementation Modeling	5	712	March 16, 2021
User-defined models for gaussian processes with non-gaussian likelihoods (and other generic models) Modeling specification , gaussian-process	19	1261	May 9, 2018
How to model mixed variables Gaussian, Ranking binary and categorical in stan all together to generate a joint distribution of them? Modeling	2	447	October 14, 2023
Trying to implement a latent Gaussian model of categorical data (categorical PCA) Modeling	3	1022	November 25, 2019
Gaussian Process Latent Variable Models in Stan? Modeling gaussian-process	3	890	February 25, 2021

Classification with gaussian process

Related topics