I have been reading though previous posts about non-centered paramaterization for hierarchical parameters. Following Section 2.2 of the Rstan manual I have tried to do the non-centered for two random effect standard deviation parameters (slope and intercept), but am getting very different results, and without any speed up in computation time. I am sure there are cases where the non-centering does not help, but I am surprised about the different results. Is there something obvious I am missing in the non-centering? Any insight would be much appreciated.
Thank you,
Colin
data{
int<lower=1> L; // number of locations per year
int<lower=1> T; // number of years, assume independence across years
matrix [L,T] Y; // response vector
matrix [L,T] X; // model matrix
matrix [L,L] d; //distance matrix
}
parameters {
vector[2] betafix; # fixed coefficients (bo, b1)
vector[T] alpha_raw; // random intercept
vector[T] beta_raw; // random slope
real<lower=0> sigma; // spatial process sd
real<lower=0> tau; // error sd
real<lower=0> phi; // range/decay parameter
real<lower=0> sigma_alpha; // sd for intercept random effect
real<lower=0> sigma_beta; // sd for slope random effect
}
transformed parameters{
vector[T] alpha;
vector[T] beta;
real<lower=0> sigma2;
real<lower=0> tau2;
sigma2 = sigma^2;
tau2 = tau^2;
alpha = sigma_alpha*alpha_raw;
beta = sigma_beta*beta_raw;
}
model {
matrix[L,L] H;
for (i in 1:(L-1)) {
for (j in (i+1):L) {
H[i,j] = sigma2 * exp(-d[i,j]/phi);
H[j,i] = H[i,j]; // do not calculate twice, utilize symmetry
}
}
for (k in 1:L) {
H[k, k] = sigma2 + tau2;
}
//likelihood
for (i in 1:T){
Y[,i] ~ multi_normal(betafix[1] + betafix[2]*X[,i] + alpha[i] + X[,i]*beta[i], H);
}
//priors
sigma ~ cauchy(0,1);
tau ~ cauchy(0,1);
sigma_alpha ~ cauchy(0,1);
sigma_beta ~ cauchy(0,1);
beta_raw ~ normal(0,1);
alpha_raw ~ normal(0,1);
phi ~ lognormal(150, 100);
}
I didn’t realize the new forum does markdown. Very cool. Yes, here is the original code without the non-centered parameterization. It seems to run fine, but was quite slow, hence the transformation trick. There are not a ton of parameters so HMC might be like hitting the problem with a sledge hammer, but I enjoy using Stan, so want to stick with it if possible.
data{
int<lower=1> L; // number of locations per year
int<lower=1> T; // number of years, assume independence across years
matrix [L,T] Y; // response vector
matrix [L,T] X; // covariates
matrix [L,L] d; //distance matrix, same across all years
}
parameters {
vector[2] betafix; # fixed coefficients (bo, b1)
vector[T] alpha;
vector[T] beta;
real<lower=0> sigma; // spatial process sd
real<lower=0> tau; // error sd
real<lower=0> phi; // range/decay parameter
real<lower=0> sigma_alpha; // sd for intercept random effect
real<lower=0> sigma_beta; // sd for slope random effect
}
transformed parameters{
real<lower=0> sigma2;
real<lower=0> tau2;
sigma2 = sigma^2;
tau2 = tau^2;
}
model {
matrix[L,L] H; // Same Spatial Paramters for Each Year
for (i in 1:(L-1)) {
for (j in (i+1):L) {
H[i,j] = sigma2 * exp(-d[i,j]/phi);
H[j,i] = H[i,j]; // do not calculate twice, utilize symmetry
}
}
for (k in 1:L) {
H[k, k] = sigma2 + tau2;
}
//likelihood
for (i in 1:T){
Y[,i] ~ multi_normal(betafix[1] + betafix[2]*X[,i] + alpha[i] + X[,i]*beta[i], H);
}
//year random effects for slope and intercept
for (i in 1:T){
alpha[i] ~ normal(0, sigma_alpha);
beta[i] ~ normal(0, sigma_beta);
}
//priors
sigma ~ cauchy(0,1);
tau ~ cauchy(0,1);
sigma_alpha ~ cauchy(0,1);
sigma_beta ~ cauchy(0,1);
phi ~ lognormal(150, 100);
}
Cool beans, if it’s performance you’re worried with, probably best to mess around with your options on the multi_normal first. Every time you have a multi_normal sampling statement, you end up needing to do a Cholesky factorization. Given the Sigma isn’t changing in your loop here:
for (i in 1:T){
Y[,i] ~ multi_normal(betafix[1] + betafix[2]*X[,i] + alpha[i] + X[,i]*beta[i], H);
}
You could probably make that faster either by switching to the multi_normal_cholesky functions like so:
matrix[L, L] C = cholesky_decompose(H);
for (i in 1:T){
Y[,i] ~ multi_normal_cholesky(betafix[1] + betafix[2]*X[,i] + alpha[i] + X[,i]*beta[i], C);
}
Or if you rework your matrix Y as a list of vectors and your means the same way, then you can vectorize the multi_normal and Stan will automatically share the work of the Cholesky (something like):
Without plotting it, mu = 150 and sigma = 100 on a log scale seem like really really big things. Might be worth playing around with tighter priors on the other things if this model isn’t behaving nicely.
Thanks for all the great suggestions. I will give those a shot. Yes, you are right about the prior, that shouldn’t be on the log scale. Thanks again,
Colin
You should use the built-in cov_exp_quad() function instead of the first for-loop in the model section; does the same thing but should be faster. You’ll still need the second for-loop adding tau to the diagonal.
Thanks, Mike. I was making a distance matrix in km using the latitude and longitude of the points. It looks like cov_exp_quad() can take either a matrix or two vectors of points. Maybe I can take the projected coordinates and pass each one as a vector to cov_exp_quad(). I’d like to keep the scale in km rather than lat/long for interpretation purposes.
Hmm, careful there. Back on the Google group mailing list you’ll find a query I made on doing GPs on the surface of a sphere, and I suggested doing exactly this (pre-computing the great-circle distances) and was told that this wasn’t sufficient to do it properly, but I never understood why.
Yes, you are right about issues with projection. There is a paper by Banerjee that shows that if the spatial domain is large this can introduce bias due to the spherical shape of the earth. However, for small spatial ranges I don’t think the scale of the points should make a difference.
It remains me to Heisenbergs uncertainty principle. Is that in “real” world problems the lengthscale
can considered independent from the signal/noise ratio? In their paper they only vary the length-scale.
They can do, because they used predefined, eg. simulated data.
Why not use a truncated multivariate normal distribution as Ben Goodrich published already
and adjust the length-scale parameter within, with the gamma-distribution to make it t-distributed.
At least a few dependency plots, correlation, among the parameters is a must.
This can be problematic lognormal(150, 100); because it’s not unit scaled (it’s 150 scaled on the unconstrained parameterization).
Does anyone ever use hierarchical models for the lengthscales phi? Surprisingly, it can be a lot faster if Stan can find the right priors than trying to fight against inconsistent priors. Also, 100 is a huge scale in log space, whhen you exp() transform the normal(150, 100) back to positive-constrained it implies exp(-200) to exp(200) as a 95% interval, which is ridiculously large. You may have meant log(100) there, but even that’s a rather fat prior.
For speed you want to vectorize,
Y ~ multi_normal(betafix[1] + betafix[2] * X + alpha + X * beta, H);
That invokes the solver for H only once.
You also want to vectorize alpha and beta sampling statements, but the multi-normal will probably dominate.
I like to think of HMC more as a precision Japanese carpentry tool than a sledgehammer; your analogies may vary :-)
Thanks a lot for your suggestions. The code is significantly faster now, and selecting more reasonable priors based on some of the literature mentioned above is producing much more stable results. One more quick question: The manual has a good example about how to sample from the joint distribution (y_new, y_obs) for new locations, x’s. However, is there a simple way to simulate new values of the observed y (y_obs) using the posterior draws from the model? Perhaps in the generated quantities block rather than augmenting the dependent variable vector Y with unknown values?
And yes, the carpentry tool is probably a more accurate analogy!
Thanks again, Bob. Below is the code if any one else is interested.
generated quantities {
vector [L] y_new [T];
matrix[L,L] H_samp;
vector[L] mus_samp[T];
for (i in 1:T){
mus_samp[i] = (betafix[1] + alpha[i]) + X[i]*betafix[2];
}
for (i in 1:(L-1)) {
for (j in (i+1):L) {
H_samp[i,j] = sigma2 * exp(-d[i,j]*rho);
H_samp[j,i] = H_samp[i,j];
}
}
for (k in 1:L) {
H_samp[k, k] = sigma2 + tau2;
}
for (i in 1:T){
y_new[i] = multi_normal_rng(mus_samp[i], H_samp);
}
}