Gaussian process negative binomial

Hi all,

I’m new to Stan and gaussian processes, but I have a problem at hand where these can potentially be very useful.

I want to build a non-parametric gaussian process using an exponentiated quadratic kernel to model count data with a negative binomial distribution. My y is a vector of count data and my x is a vector of natural numbers.

I followed Michael’s excellent blog post ( Robust Gaussian Processes in Stan ), although I have a few doubts on how to modify the existing example to consider a neg_binomial_2 distribution.

Could you point me to any example that could achieve this, similarly to the example below (I found this thread Redirecting to Google Groups but I couldn’t understand/find the final model)?

data {
int<lower=1> N;
real x[N];
vector[N] y;
}

parameters {
real<lower=0> rho;
real<lower=0> alpha;
real<lower=0> sigma;
}

model {
matrix[N, N] cov = cov_exp_quad(x, alpha, rho) + diag_matrix(rep_vector(square(sigma), N));
matrix[N, N] L_cov = cholesky_decompose(cov);

y ~ multi_normal_cholesky(rep_vector(0, N), L_cov);
}

I also would like to fit this model to two different y’s and find where these two GPs differ significantly (I’ll also need to contain a specific offset to each y). I presume I can obtain this by assessing calculating | GP1(x) - GP2(x) | and validating if the difference is higher than the expected standard variation of each GP1 and GP2.

I apologise in advance for the basic questions and thank you any opinion/suggestion you might have.

Thanks,

Could you point me to any example that could achieve this

So neg_binomial_2 is different from poisson_log in that it has two parameters instead of one, and they are constrained (so they both need to be greater than zero).

For the first, you just get to decide whether you want one or two GPs. Maybe just start with a GP on the mean, and a single parameter for sigma? But that’s a function of your data I guess.

The second thing is that, by default, the GP latent variables are on an unconstrained space. If you’re GP latent variables are f, the simplest thing to do is exp(f) to constrain them. That’s what makes y ~ poisson_log(f) work anyway – it’s more or less the same thing as y ~ poisson(exp(f)).

So just try something like y ~ neg_binomial_2(exp(f), sigma), if that makes sense.

I also would like to fit this model to two different y’s and find where these two GPs differ significantly (I’ll also need to contain a specific offset to each y)

This sounds like a generated quantities thing. Fit two GPs to data, and then in generated quantities draw from them and compare the numbers you get out? Is this a model comparison thing? Or is it figuring out the differences in two separate but comparable processes in your application? Doing model comparison stuff is always trickier than it seems is why I’m asking (so I don’t point you in the wrong direction).

2 Likes

Please see the materials for “Modeling the Rate of Public Mass Shootings with Gaussian Processes” by Nathan Sanders and Victor Lei from https://github.com/stan-dev/stancon_talks.

1 Like

Thank you both for the quick replies and very helpful resources.

I tried to amend the model I had:

data {
    int<lower=1> N;
    real x[N];
    vector[N] y;
}

parameters {
    real<lower=0> rho;
    real<lower=0> alpha;
    real<lower=0> sigma;
    real<lower=0> phi;
    vector[N] y_tilde;
}

model {
    matrix[N, N] cov = cov_exp_quad(x, alpha, rho) + diag_matrix(rep_vector(sigma, N));
    matrix[N, N] L_cov = cholesky_decompose(cov);

    vector[N] f = L_cov * y_tilde;

    y ~ neg_binomial_2_log(f, phi);
}

As suggested by @bbbales2 I’ll start with a GP on the mean and estimate sigma, which I think is all I need. The problem is that the model is not yet correct (my little experience with Stan doesn’t help).

I also would like to fit this model to two different y’s and find where these two GPs differ significantly (I’ll also need to contain a specific offset to each y)

This sounds like a generated quantities thing. Fit two GPs to data, and then in generated quantities draw from them and compare the numbers you get out? Is this a model comparison thing? Or is it figuring out the differences in two separate but comparable processes in your application? Doing model comparison stuff is always trickier than it seems is why I’m asking (so I don’t point you in the wrong direction).

A description of my bioinformatics problem might help to clarify. Briefly, I have a phenotypic response, y var, to multiple individual mutations located across the chromosome, x var. Now this was acquired for multiple cell lines and a control condition. My idea is to fit a GP in the chromosome for each condition and compare them to the GP fitted for the control. Basically, I would like to find the x’s for which the difference between the conditions and control is greatest.

I presume I don’t necessarily need to do a complete model comparison, but would be rather drawing from the GPs and compare.

Remember to include:

y_tilde ~ normal(0, 1)

You need that for the non-centered parameterization of the GP to make sense. Also put priors on alpha, rho, and phi.

rho (lengthscale) is the tricky one. You’ll probably want a zero avoiding prior that doesn’t put much mass below the smallest length scales in your system. Variants of gamma(4, 4) are what appear in the manual these days.

I don’t think you necessarily want the sigma added onto the diagonal here. That’s really for when your outputs are normals and sigma is how much noise you have. Phi plays the same role here. The GP is really just the cov_exp_quad bit.