Non Centered Parameterization in GP Model

Hello,

I have been reading though previous posts about non-centered paramaterization for hierarchical parameters. Following Section 2.2 of the Rstan manual I have tried to do the non-centered for two random effect standard deviation parameters (slope and intercept), but am getting very different results, and without any speed up in computation time. I am sure there are cases where the non-centering does not help, but I am surprised about the different results. Is there something obvious I am missing in the non-centering? Any insight would be much appreciated.

Thank you,
Colin

data{
  int<lower=1> L; // number of locations per year
  int<lower=1> T; // number of years, assume independence across years
  matrix [L,T] Y; // response vector
  matrix [L,T] X; // model matrix
  matrix [L,L] d; //distance matrix
}

parameters {
  vector[2] betafix; # fixed coefficients (bo, b1)
  vector[T] alpha_raw; // random intercept
  vector[T] beta_raw; // random slope
  real<lower=0> sigma; // spatial process sd
  real<lower=0> tau; // error sd 
  real<lower=0> phi; // range/decay parameter
  real<lower=0> sigma_alpha;  // sd for intercept random effect
  real<lower=0> sigma_beta;  // sd for slope random effect
}


transformed parameters{
  vector[T] alpha;
  vector[T] beta;
  real<lower=0> sigma2;
  real<lower=0> tau2;
  sigma2 = sigma^2;
  tau2 = tau^2;
  alpha = sigma_alpha*alpha_raw;
  beta = sigma_beta*beta_raw;
}

model {
  matrix[L,L] H; 
  
  for (i in 1:(L-1)) {
    for (j in (i+1):L) {
      H[i,j] = sigma2 * exp(-d[i,j]/phi);
      H[j,i] = H[i,j]; // do not calculate twice, utilize symmetry
    }
  }
  for (k in 1:L) {
    H[k, k] = sigma2 + tau2;
  }
  
  //likelihood
  for (i in 1:T){
    Y[,i] ~ multi_normal(betafix[1] + betafix[2]*X[,i] + alpha[i] + X[,i]*beta[i], H);
  }
  
  //priors
  sigma ~ cauchy(0,1);
  tau ~ cauchy(0,1);
  sigma_alpha ~ cauchy(0,1);
  sigma_beta ~ cauchy(0,1);
  beta_raw ~ normal(0,1);
  alpha_raw ~ normal(0,1);
  phi ~ lognormal(150, 100);
}

[edited to put code in block]

Any chance you can post the 2nd model you’re testing here? Maybe the difference will be more obvious with both of them. Also if you wrap your code in 3 backticks (https://help.github.com/articles/creating-and-highlighting-code-blocks/) like this:

model {
  y ~ normal(0, 1);
}

That indents the code and makes it easier to read.

Hi bbbales2,

I didn’t realize the new forum does markdown. Very cool. Yes, here is the original code without the non-centered parameterization. It seems to run fine, but was quite slow, hence the transformation trick. There are not a ton of parameters so HMC might be like hitting the problem with a sledge hammer, but I enjoy using Stan, so want to stick with it if possible.

data{
  int<lower=1> L; // number of locations per year
  int<lower=1> T; // number of years, assume independence across years
  matrix [L,T] Y; // response vector
  matrix [L,T] X; // covariates
  matrix [L,L] d; //distance matrix, same across all years
}

parameters {
  vector[2] betafix; # fixed coefficients (bo, b1)
  vector[T] alpha;
  vector[T] beta;
  real<lower=0> sigma; // spatial process sd
  real<lower=0> tau; // error sd 
  real<lower=0> phi; // range/decay parameter
  real<lower=0> sigma_alpha;  // sd for intercept random effect
  real<lower=0> sigma_beta;  // sd for slope random effect
}


transformed parameters{
  real<lower=0> sigma2;
  real<lower=0> tau2;
  sigma2 = sigma^2;
  tau2 = tau^2;
}

model {
  matrix[L,L] H; // Same Spatial Paramters for Each Year
  
  for (i in 1:(L-1)) {
    for (j in (i+1):L) {
      H[i,j] = sigma2 * exp(-d[i,j]/phi);
      H[j,i] = H[i,j]; // do not calculate twice, utilize symmetry
    }
  }
  for (k in 1:L) {
    H[k, k] = sigma2 + tau2; 
  }
  
  //likelihood
  for (i in 1:T){
    Y[,i] ~ multi_normal(betafix[1] + betafix[2]*X[,i] + alpha[i] + X[,i]*beta[i], H);
  }
  
  //year random effects for slope and intercept 
  for (i in 1:T){
    alpha[i] ~ normal(0, sigma_alpha);
    beta[i] ~ normal(0, sigma_beta);
  }
  
  //priors
  sigma ~ cauchy(0,1);
  tau ~ cauchy(0,1);
  sigma_alpha ~ cauchy(0,1);
  sigma_beta ~ cauchy(0,1);
  phi ~ lognormal(150, 100);
}

Cool beans, if it’s performance you’re worried with, probably best to mess around with your options on the multi_normal first. Every time you have a multi_normal sampling statement, you end up needing to do a Cholesky factorization. Given the Sigma isn’t changing in your loop here:

for (i in 1:T){
    Y[,i] ~ multi_normal(betafix[1] + betafix[2]*X[,i] + alpha[i] + X[,i]*beta[i], H);
  }

You could probably make that faster either by switching to the multi_normal_cholesky functions like so:

matrix[L, L] C = cholesky_decompose(H);

for (i in 1:T){
    Y[,i] ~ multi_normal_cholesky(betafix[1] + betafix[2]*X[,i] + alpha[i] + X[,i]*beta[i], C);
  }

Or if you rework your matrix Y as a list of vectors and your means the same way, then you can vectorize the multi_normal and Stan will automatically share the work of the Cholesky (something like):

vector[L] Y[T];
vector[L] mus[T];
matrix[L, L] H;
  
Y ~ multi_normal(mus, H);
1 Like

Also is there any chance this is a mistake?:

phi ~ lognormal(150, 100);

Without plotting it, mu = 150 and sigma = 100 on a log scale seem like really really big things. Might be worth playing around with tighter priors on the other things if this model isn’t behaving nicely.

Rob/Michael/Aki have a paper on priors for the length scale where you can look at the effect of different choices: https://drive.google.com/file/d/0B3WHb3BabixAYlptTVBWUGdyVEE/view .

Thanks for all the great suggestions. I will give those a shot. Yes, you are right about the prior, that shouldn’t be on the log scale. Thanks again,
Colin

You should use the built-in cov_exp_quad() function instead of the first for-loop in the model section; does the same thing but should be faster. You’ll still need the second for-loop adding tau to the diagonal.

Oh, and using cov_exp_quad will require you not pre-compute the distance matrix. Non-intuitive as it is, it’ll still be faster.

Thanks, Mike. I was making a distance matrix in km using the latitude and longitude of the points. It looks like cov_exp_quad() can take either a matrix or two vectors of points. Maybe I can take the projected coordinates and pass each one as a vector to cov_exp_quad(). I’d like to keep the scale in km rather than lat/long for interpretation purposes.

Hmm, careful there. Back on the Google group mailing list you’ll find a query I made on doing GPs on the surface of a sphere, and I suggested doing exactly this (pre-computing the great-circle distances) and was told that this wasn’t sufficient to do it properly, but I never understood why.

Hi Mike,

Yes, you are right about issues with projection. There is a paper by Banerjee that shows that if the spatial domain is large this can introduce bias due to the spherical shape of the earth. However, for small spatial ranges I don’t think the scale of the points should make a difference.

It remains me to Heisenbergs uncertainty principle. Is that in “real” world problems the lengthscale
can considered independent from the signal/noise ratio? In their paper they only vary the length-scale.
They can do, because they used predefined, eg. simulated data.
Why not use a truncated multivariate normal distribution as Ben Goodrich published already
and adjust the length-scale parameter within, with the gamma-distribution to make it t-distributed.
At least a few dependency plots, correlation, among the parameters is a must.

Andre

Why not use a truncated multivariate normal distribution as Ben Goodrich published already

Got a link for this? I’m curious.

https://groups.google.com/d/msg/stan-users/GuWUJogum1o/LvxjlUBnBwAJ

Andre

1 Like

This can be problematic lognormal(150, 100); because it’s not unit scaled (it’s 150 scaled on the unconstrained parameterization).

Does anyone ever use hierarchical models for the lengthscales phi? Surprisingly, it can be a lot faster if Stan can find the right priors than trying to fight against inconsistent priors. Also, 100 is a huge scale in log space, whhen you exp() transform the normal(150, 100) back to positive-constrained it implies exp(-200) to exp(200) as a 95% interval, which is ridiculously large. You may have meant log(100) there, but even that’s a rather fat prior.

For speed you want to vectorize,

Y ~ multi_normal(betafix[1] + betafix[2] * X + alpha + X * beta, H);

That invokes the solver for H only once.

You also want to vectorize alpha and beta sampling statements, but the multi-normal will probably dominate.

I like to think of HMC more as a precision Japanese carpentry tool than a sledgehammer; your analogies may vary :-)

One more thing: it’d be more efficient to use X * (beta + betafix[2]) than betafix[2] * X + X * beta as it reduces L * L additions to L additions.

Hi Bob,

Thanks a lot for your suggestions. The code is significantly faster now, and selecting more reasonable priors based on some of the literature mentioned above is producing much more stable results. One more quick question: The manual has a good example about how to sample from the joint distribution (y_new, y_obs) for new locations, x’s. However, is there a simple way to simulate new values of the observed y (y_obs) using the posterior draws from the model? Perhaps in the generated quantities block rather than augmenting the dependent variable vector Y with unknown values?

And yes, the carpentry tool is probably a more accurate analogy!

1 Like

I think y_new is just the notation Andrew uses for generating new y. Here’s a simple example:

data {
  int<lower=0> N;
  vector[N] y;
}
parameters {
  real mu;
}
model {
  y ~ normal(mu, 1);
  mu ~ normal(0, 10);
}

To generate new observations, you’d just add

data {
  int<lower=0> N;
  vector[N] y;
  int<lower=0> N_new;
}
parameters {
  real mu;
}
model {
  y ~ normal(mu, 1);
  mu ~ normal(0, 10);
}
generated quantities {
  vector[N_new] y_new;
  for (n in 1:N_new) y_new[n] = normal_rng(mu, 1);
}

Thanks again, Bob. Below is the code if any one else is interested.

generated quantities {
  vector [L] y_new [T];
  matrix[L,L] H_samp; 
  vector[L] mus_samp[T];
  
  for (i in 1:T){
  mus_samp[i] = (betafix[1] + alpha[i]) + X[i]*betafix[2];
  }
  
  
  for (i in 1:(L-1)) {
    for (j in (i+1):L) {
      H_samp[i,j] = sigma2 * exp(-d[i,j]*rho);
      H_samp[j,i] = H_samp[i,j]; 
    }
  }
  for (k in 1:L) {
    H_samp[k, k] = sigma2 + tau2; 
  }
  
  for (i in 1:T){
  y_new[i] = multi_normal_rng(mus_samp[i], H_samp);
  }
}