This is an attempt to improve my model after suggestions from @Bob_Carpenter, but I hope this post is self-contained.
I’ve made a series of simple models and I’m generating synthetic data in Stan.
Multilogit
I took the multilogit documentation’s example and wrote a generator to match. Based on the results it seems the prior is sufficient to constrain the estimates and I don’t need to use the K-1
trick for identifiability.
Generator:
data {
int<lower=2> K;
int<lower=0> N;
int<lower=1> D;
}
generated quantities {
matrix[N, D] x;
for (n in 1:N) {
for (d in 1:D) {
x[n,d] = normal_rng(0,1);
}
}
matrix[D, K] beta;
for (d in 1:D) {
for (k in 1:K) {
beta[d,k] = normal_rng(0,5);
}
}
matrix[N, K] x_beta = x * beta;
array[N] int<lower=1,upper=K> y;
for (n in 1:N) y[n] = categorical_rng(softmax(x_beta[n]'));
}
Inference:
// https://mc-stan.org/docs/2_28/stan-users-guide/multi-logit.html
data {
int<lower=2> K;
int<lower=0> N;
int<lower=1> D;
array[N] int<lower=1,upper=K> y;
matrix[N, D] x;
}
parameters {
matrix[D, K] beta;
}
model {
matrix[N, K] x_beta = x * beta;
to_vector(beta) ~ normal(0, 5);
for (n in 1:N) {
y[n] ~ categorical(softmax(x_beta[n]'));
}
}
So it seems the prior works well in this case.
Adding another variable
I add another variable which is unobserved, so I expect the problem to be much more difficult.
data {
int<lower=2> K;
int<lower=0> N;
int<lower=1> D;
}
generated quantities {
matrix[N, D] x;
matrix[N, D] u;
for (n in 1:N) {
for (d in 1:D) {
x[n,d] = normal_rng(0,1);
u[n,d] = normal_rng(0,1);
}
}
matrix[D, K] beta;
matrix[D, K] gamma;
for (d in 1:D) {
for (k in 1:K) {
beta[d,k] = normal_rng(0,5);
gamma[d,k] = normal_rng(0,5);
}
}
matrix[N, K] x_beta = x * beta;
matrix[N, K] u_gamma = u * gamma;
array[N] int<lower=1,upper=K> y;
for (n in 1:N) y[n] = categorical_rng(softmax(x_beta[n]' + u_gamma[n]'));
}
This produces 2x3 matrices for beta
and gamma
; beta=>gamma
pairs might look like:
3.06798=>2.66563 3.33377=>-1.79673 1.62864=>-1.48669
3.41788=>0.824499 -8.223=>3.01853 -4.39284=>2.05907
I tried sampling with and without the K-1
trick described in multi-logit identifiability. In the graphs below, the colors identify the chains; I added jitter to make the draws more visible. The black dot is the true value of each parameter generated by Stan.
With K-1 trick:
data {
int<lower=2> K;
int<lower=0> N;
int<lower=1> D;
array[N] int<lower=1,upper=K> y;
matrix[N, D] x;
}
transformed data {
vector[D] zeros = rep_vector(0, D);
}
parameters {
matrix[D, K-1] beta_raw;
matrix[D, K-1] gamma_raw;
matrix[N, D] u;
}
transformed parameters {
matrix[D, K] beta = append_col(beta_raw, zeros);
matrix[D, K] gamma = append_col(gamma_raw, zeros);
}
model {
matrix[N, K] x_beta = x * beta;
matrix[N, K] u_gamma = u * gamma;
to_vector(u) ~ normal(0,1);
to_vector(beta) ~ normal(0, 5);
to_vector(gamma) ~ normal(0, 5);
for (n in 1:N) {
y[n] ~ categorical(softmax(x_beta[n]' + u_gamma[n]'));
}
}
Without K-1 trick:
data {
int<lower=2> K;
int<lower=0> N;
int<lower=1> D;
array[N] int<lower=1,upper=K> y;
matrix[N, D] x;
}
parameters {
matrix[D, K] beta;
matrix[D, K] gamma;
matrix[N, D] u;
}
model {
matrix[N, K] x_beta = x * beta;
matrix[N, K] u_gamma = u * gamma;
to_vector(u) ~ normal(0,1);
to_vector(beta) ~ normal(0, 5);
to_vector(gamma) ~ normal(0, 5);
for (n in 1:N) {
y[n] ~ categorical(softmax(x_beta[n]' + u_gamma[n]'));
}
}
In the sans-trick model, the posterior distributions look pretty good but visually it is apparent that one of the chains ( #1
) is different from the others, and the means of each chain confirm the difference.
2×3×4 Array{Float64, 3}:
[:, :, 1] =
-2.59055 1.61997 0.706378
3.61502 1.82376 -5.2643
[:, :, 2] =
5.16026 0.0104177 -5.63252
-0.674936 -3.08028 3.58979
[:, :, 3] =
5.16026 0.0104177 -5.63252
-0.674936 -3.08028 3.58979
[:, :, 4] =
5.16026 0.0104177 -5.63252
-0.674936 -3.08028 3.58979
From what I remember, it’s bad for the chains to be so different, and the R_hat diagnostic is intended to report such issues. However, stansummary
doesn’t give any indication of a problem:
% stansummary /tmp/jl_A5VVFT/train*.csv
Inference for Stan model: train_model
4 chains: each with iter=(1000,1000,1000,1000); warmup=(0,0,0,0); thin=(1,1,1,1); 4000 iterations saved.
Warmup took (72, 72, 72, 72) seconds, 4.8 minutes total
Sampling took (74, 74, 74, 74) seconds, 4.9 minutes total
Mean MCSE StdDev 5% 50% 95% N_Eff N_Eff/s R_hat
lp__ -1.2e+03 5.0 64 -1310 -1.2e+03 -1108 162 5.5e-01 1.0
accept_stat__ 0.90 3.8e-03 1.1e-01 0.68 0.93 1.00 8.2e+02 2.8e+00 1.0e+00
stepsize__ 0.027 1.7e-03 2.4e-03 0.026 0.026 0.032 2.0e+00 6.8e-03 2.3e+13
treedepth__ 7.0 nan 1.2e-13 7.0 7.0 7.0 nan nan nan
n_leapfrog__ 127 nan 1.8e-12 127 127 127 nan nan nan
divergent__ 0.00 nan 0.0e+00 0.00 0.00 0.00 nan nan nan
energy__ 2199 5.3e+00 7.1e+01 2100 2192 2323 1.8e+02 6.0e-01 1.0e+00
beta[1,1] 3.9e+00 0.084 3.1 -1.1 3.8e+00 9.0 1365 4.6e+00 1.0
beta[1,2] 2.7e+00 0.077 3.0 -2.1 2.7e+00 7.8 1563 5.3e+00 1.0
beta[1,3] -6.5e+00 0.11 3.4 -12 -6.5e+00 -0.88 920 3.1e+00 1.0
beta[2,1] 2.7e+00 0.091 3.1 -2.4 2.5e+00 7.8 1170 4.0e+00 1.0
beta[2,2] -2.3e+00 0.092 3.0 -7.4 -2.3e+00 2.6 1059 3.6e+00 1.00
beta[2,3] -1.1e+00 0.088 3.0 -6.1 -1.1e+00 3.8 1174 4.0e+00 1.00
gamma[1,1] 3.2e+00 2.5 4.9 -5.6 3.8e+00 10 3.8 1.3e-02 1.5
gamma[1,2] 4.1e-01 0.36 3.4 -5.3 4.5e-01 6.0 88 3.0e-01 1.0
gamma[1,3] -4.0e+00 2.0 4.7 -11 -4.4e+00 4.8 5.3 1.8e-02 1.4
gamma[2,1] 4.0e-01 1.4 4.4 -6.6 1.8e-01 8.0 10 3.5e-02 1.2
gamma[2,2] -1.9e+00 1.6 3.8 -7.9 -1.9e+00 4.6 5.7 1.9e-02 1.2
gamma[2,3] 1.4e+00 2.9 5.8 -8.9 2.0e+00 10 4.1 1.4e-02 1.4
// u[...] not shown
Questions
- Is there a benefit to using the K-1 technique if I’m already using a prior? Based on these results it doesn’t seem to help here.
- Is it bad for the chains to be different visually and in their means? Why are the chains so different? How can I fix it?