Gaussian Process regression: done but could be improved?

performance

#1

Hi folks, I’ve coded up a model for doing a Gaussian Process regression where you have some numeric variable X whose influence on Y you want to model as a GP (with the possibility of multiple observations of Y per X) but also have some other variables Q,R,S,etc that you want to model as influencing the GP. The code is below and seems to work well, but I suspect that the last bit (starting with the comment “# loop over observations”) could be made more efficient. Any suggestions?

functions{
    # GP: computes noiseless Gaussian Process
    vector GP(real volatility, real amplitude, vector normal01, int n_x, real[] x ) {
        matrix[n_x,n_x] cov_mat ;
        real amplitude_sq_plus_jitter ;
        amplitude_sq_plus_jitter = amplitude^2 + 0.001 ;
        cov_mat = cov_exp_quad(x, amplitude, 1/volatility) ;
        for(i in 1:n_x){
            cov_mat[i,i] = amplitude_sq_plus_jitter ;
        }
        return(cholesky_decompose(cov_mat) * normal01 ) ;
    }
}

data {

    # n_y: number of observations in y
    int<lower=1> n_y ;

    # y: vector of observations for y
    #     should be scaled to mean=0,sd=1
    vector[n_y] y ;

    # n_x: number of unique x values
    int<lower=1> n_x ;

    # x: unique values of x
    #     should be scaled to min=0,max=1
    real x[n_x] ;

    # x_index: vector indicating which x is associated with each y
    int x_index[n_y] ;

    # n_w: number of columns in predictor matrix w
    int n_w ;

    # w: predictor matrix (each column gets its own GP)
    matrix[n_y,n_w] w ;

}

parameters {

    # noise: measurement noise
    real<lower=0> noise ;

    # volatility_helper: helper for cauchy-distributed volatility (see transformed parameters)
    vector<lower=0,upper=pi()/2>[n_w] volatility_helper ;

    # amplitude: amplitude of GPs
    vector<lower=0>[n_w] amplitude ;

    # f_normal01: helper variable for GPs (see transformed parameters)
    vector[n_x] f_normal01[n_w] ;

}

transformed parameters{

    # volatility: volatility of GPs (a.k.a. inverse-lengthscale)
    vector[n_w] volatility ;

    # f: GPs
    vector[n_x] f[n_w] ;

    #next line implies volatility ~ cauchy(0,10)
    volatility = 10*tan(volatility_helper) ;

    # loop over predictors, computing GPs for each predictor
    for(wi in 1:n_w){
        f[wi] = GP(
            volatility[wi]
            , amplitude[wi]
            , f_normal01[wi]
            , n_x , x
        ) ;
    }

}

model {

    # noise prior
    noise ~ weibull(2,1) ; #peaked at .8ish

    # amplitude prior
    amplitude ~ weibull(2,1) ; #peaked at .8ish

    # normal(0,1) priors on GP helpers
    for(wi in 1:n_w){
        f_normal01[wi] ~ normal(0,1);
    }

    # loop over observations
    for(yi in 1:n_y){
        real temp ;
        temp = 0 ;
        # loop over predictors to gather sum for this observation's mean
        for(wi in 1:n_w){
            temp = temp + w[yi,wi] * f[wi][x_index[yi]] ;
        }
        # y_scaled as temp with gaussian measurement noise
        y[yi] ~ normal( temp , noise ) ;
    }

}

#2

Use vectorization. For example, this

for (wi in 1:n_w)
  f_normal01[wi] ~ normal(0, 1);

can just be

to_vector(f_normal01) ~ normal(0, 1);

if you declare it as a matrix. But then that problably causes problem with a multi-normal somewhere. Sometimes it’s not worth the transforming cost for a minor gain like this.

For the big loop, you can write

for(yi in 1:n_y){
        real temp ;
        temp = 0 ;
        # loop over predictors to gather sum for this observation's mean
        for(wi in 1:n_w){
            temp = temp + w[yi,wi] * f[wi][x_index[yi]] ;
        }
        # y_scaled as temp with gaussian measurement noise
        y[yi] ~ normal( temp , noise ) ;
    }

as

for (i in 1:n_y)
  y[i] ~ normal(sum(w[i] .* f[ , x_index[i]]), noise);

though I’d suggest using a variable like sigma or sigma_y instead of noise (what you write as noise is actually the parameter for the scale of the “noise”).

P.S. May I ask what inspires creative formatting like this:

        f[wi] = GP(
            volatility[wi]
            , amplitude[wi]
            , f_normal01[wi]
            , n_x , x
        ) ;

It makes it very hard for those of us used to conventional formatting to follow.


#3

Thanks Bob! And sorry for the atypical use of whitespace; when I post for help I usually try to remove the reasoned-but-admittedly-idiosyncratic features of my coding style from my code lest, as you say, its atypicality cause confusion, but forgot in this case.

I had f_normal01 and f as an array of vectors as a hold-over from a previous attempt at a more complicated model (hierarchical GPs), but now realize that making them matrices instead makes things easier (indeed, it also makes the hierarchical version easier, but I’ll post on that later).

The sum(w[i] .* f[,x_index[i]]) trick to avoid that awkward loop also works great. I also found a much more substantial speed-up whereby (inspired by this thread) I take advantage of the fact that in the typical scenario I’ve been imagining for this, there should be lots of redundancy in the matrix w, so I can compute on only the unique rows in w and index into the result when updating the lp. This yields a nearly 10x speedup and much higher effective sample sizes for the key variables.

Here’s a gist with a full example of what this code is intended to achieve, including fitting some fake data using both this more efficient approach as well as the less-efficient approach. Note that where I’ve been using “w” to refer to the predictor matrix above, in that example I switched to “z”.


#4

Thanks much for reporting back.

When there’s structure to the data, by all means exploit it for efficiency. It’s always worth finding things not to compute or for which simple sufficient stat calculations over duplicated data itesm can reduce calls to expensive functions.

P.S. The biggest reason for using standard formatting conventions is that one of the most important aspects of code quality is readability.