I implemented a minimal version of ADVI in TensorFlow (I’m aware of Edward but I needed something a bit more low-level):
http://www.jmlr.org/papers/volume18/16-107/16-107.pdf
The implementation passes basic tests but I wanted to check on a more involved model by comparing my inference results to Stan’s ADVI and NUTS ones.
I considered the IFM of Bhattacharya and Dunson:
http://people.ee.duke.edu/~lcarin/BhattacharyaDunson.pdf
The results from Stan are off, for instance the covariance matrix omega is not recovered in a 10-dimensional case (I generate the data) and as I am a complete beginner in Stan programming I blame this entirely on some implementation mistake. Further evidence of this is that Stan’s ADVI and NUTS inference results shows some level of agreement.
Follow my implementation:
data {
int<lower=0> N;
int<lower=0> P;
int<lower=0> H;
vector[P] y[N];
}
parameters {
real<lower=0> a1;
real<lower=0> a2;
vector<lower=0>[H] delta;
matrix<lower=0>[P, H] phi;
matrix[P, H] lam;
vector<lower=0>[P] s2i;
}
transformed parameters {
vector[H] theta;
vector[P] s2;
matrix[P, P] omega;
theta[1] = delta[1];
for (h in 2:H)
theta[h] = theta[h-1] * delta[h];
s2 = 1.0 ./ s2i;
omega = diag_matrix(s2) + lam * lam';
}
model {
vector[P] zero;
for (p in 1:P)
zero[p] = 0.0;
a1 ~ gamma(2.0, 1.0);
a2 ~ gamma(2.0, 1.0);
delta[1] ~ gamma(a1, 1.0);
for (h in 2:H)
delta[h] ~ gamma(a2, 1.0);
for (p in 1:P)
for (h in 1:H)
phi[p, h] ~ gamma(3.0 / 2.0, 3.0 / 2.0);
for (p in 1:P)
for (h in 1:H)
lam[p, h] ~ normal(0.0, 1.0 / phi[p, h] / theta[h]);
for (p in 1:P)
s2i[p] ~ gamma(1.0, 0.3);
for (n in 1:N)
y[n] ~ multi_normal(zero, omega);
}
Are there any obvious issues, aside from missed vectorization opportunities in the distributions ?
I’m also aware that the 2-level hierarchical structure might not help inference, but I consistently recover omega in my ADVI implementation and the algorithm should be very close, so I must be doing something silliy here…
As an aside, is it expected for zero
to be initialized to nan
?