Disclaimer: I’m new to Stan and have a low level of math training. I followed the examples here to experiment with an application that would be useful for my research and to learn by applying.
I would like to create a general model to estimate multiple \theta s in a gaussian process regression context with an anisotropic, squared exponential kernel of the form
K_{\theta}(x, x') = \alpha^2 \exp\{-\sum^p_{k=1}(x_k - x'_k)^2/\theta_k\}
I basically want to know the ideal shape of the kernel in multi-dimensional space.
To do so, I modified the ARD example in the Stan user guide and got the following implementation:
functions {
matrix cov(
vector[] x,
real alpha,
vector theta
) {
// prepare variables
int N = size(x);
matrix[N, N] K;
matrix[N, N] K_cholesky;
real sq_alpha = square(alpha);
// calculate covariance matrix
for (i in 1:(N-1)) {
K[i, i] = sq_alpha;
for (j in (i + 1):N) {
K[i, j] = sq_alpha * exp((-1) * dot_self((x[i] - x[j]) ./ theta));
K[j, i] = K[i, j];
}
}
K[N, N] = sq_alpha;
// apply cholesky decomposition on covariance matrix
K_cholesky = cholesky_decompose(K);
return K_cholesky;
}
}
data {
int<lower=1> N;
int<lower=1> D;
vector[D] x[N];
vector[N] y;
}
parameters {
real<lower=0> alpha;
vector<lower=0>[D] theta;
real<lower=0> sigma;
vector[N] eta;
}
model {
vector[N] f;
{
matrix[N, N] L_K = cov(x, alpha, theta);
f = L_K * eta;
}
theta ~ uniform(0, 3);
sigma ~ std_normal();
eta ~ std_normal();
y ~ normal(f, sigma);
}
When I run the following test code with Rstan, I get results that generally make sense to me.
ind <- expand.grid(
x1 = 1:3,
x2 = 1:3,
x3 = 1:3
)
dep <- c(rep(-100, 9), rep(0, 9), rep(100, 9))
fit <- rstan::stan(
file = "code/bayesian_inference/gpr.stan",
data = list(
D = 3,
x = ind,
N = length(dep),
y = dep
),
chains = 1,
cores = 1
)
\theta_1 and \theta_2 are consistently bigger than \theta_3 which is in my opinion expected with this test data, because the differences on the x3-axis are much bigger then the ones on x1 and x2.
I now have a series of very basic questions and I hope you could give me some advice:
- Is this the right approach to estimate the parameters I’m interested in?
- I have a very high degree of divergent transitions and I wonder if the reason for this is a bad model?
- I adopted \alpha from the ARD example and kept it to avoid an error with the Cholesky decomposition where the matrix tends to get not positive definite. In fact I would like to avoid \alpha to keep the model simpler. Is there a way to achieve this?