Hi,
I’m looking for advice on how to specify an ANOVA model, where I can estimate the finite population variances of batches of coefficients. Suppose I have two factors, X and Z, each with four levels. My general strategy is to make three design matrices X_d (the main effects of X), Z_d (the main effects of Z), and I_d (the interactions between X and Z). I generally use ‘effects’ or ‘dummy’ coding to generate these matrices and keep the model identifiable (although it appears not to matter which I use for estimates of finite population variances for coefficient batches). I then fit the following model:
anova_model = """
data{
int<lower=1> N;
int<lower=1> K1;
int<lower=1> K2;
int<lower=1> K3;
matrix[N, K1] X_d;
matrix[N, K2] Z_d;
matrix[N, K3] I_d;
vector[N] y;
}
parameters{
vector[K1] B_x;
vector[K2] B_z;
vector[K3] B_i;
real<lower=0> sd_y;
real<lower=0> sd_x;
real<lower=0> sd_z;
real<lower=0> sd_i;
real mu;
}
transformed parameters{
vector[N] yhat;
yhat = mu + X_d*B_x + Z_d*B_z + I_d*B_i ;
}
model{
y ~ normal(yhat, sd_y) ;
B_x ~ normal(0,sd_x);
B_z ~ normal(0,sd_z);
B_i ~ normal(0,sd_i);
}
generated quantities{
vector[N] e_y;
real s_e;
real s_x;
real s_z;
real s_i;
e_y = y - yhat;
s_e = sd(e_y);
s_x = sd(B_x);
s_z = sd(B_z);
s_i = sd(B_i);
}
"""
No problems there, everything typically works as expected. The model converges, the results make sense, etc.
The issue I run into is when one of the factors only has 2 levels, 1 df. For example, if Z only has two levels, using effects or dummy coding results in an N x 1 design matrix that has no finite population variance. I’ve been reading Gelman (2005), Hill and Gelman, and other references. Both sources can do this kind of ANOVA when a factor only has one level (i.e. the latin square example), and both sources also use unconstrained coefficients. I’ve tried making three new design matrices: X_d2, Z_d2, I_d2, which have no intercept and are fully specified (i.e. X_d2 has four columns, one per level, Z_d2 has two columns, one per level, and I_d2 has 8 columns for the interactions). But the model performs poorly. So I modified the code to constrain the effects:
anova_model = """
data{
int<lower=1> N;
int<lower=1> K1;
int<lower=1> K2;
int<lower=1> K3;
matrix[N, K1] X_d2;
matrix[N, K2] Z_d2;
matrix[N, K3] I_d2;
vector[N] y;
}
parameters{
vector[K1] B_x;
vector[K2] B_z;
vector[K3] B_i;
real<lower=0> sd_y;
real<lower=0> sd_x;
real<lower=0> sd_z;
real<lower=0> sd_i;
real mu;
}
transformed parameters{
vector[N] yhat;
vector[K1] B_x_adj;
vector[K2] B_z_adj;
vector[K3] B_i_adj;
real B0;
B_x_adj = B_x - mean(B_x) ;
B_z_adj = B_z - mean(B_z);
B_i_adj = B_i - mean(B_i);
B0 = mu + mean(B_x) + mean(B_z) + mean(B_i);
yhat = B0 + X_d2*B_x_adj + Z_d2*B_z_adj + I_d2*B_i_adj ;
}
model{
y ~ normal(yhat, sd_y) ;
B_z ~ normal(0,sd_z);
B_x ~ normal(0,sd_x);
B_i ~ normal(0,sd_i);
mu ~ normal(0,1); # the response has been standardized, so this prior makes sense
}
generated quantities{
vector[N] e_y;
real s_e;
real s_x;
real s_z;
real s_i;
e_y = y - yhat;
s_e = sd(e_y);
s_x = sd(B_x_adj);
s_z = sd(B_z_adj);
s_i = sd(B_i_adj);
}
"""
This model converges much more rapidly and gives results that make sense. I guess I’m wondering if this is appropriate. Should I be using the constrained parameters both for estimation (i.e. yhat) and for finite population variances (s_x, s_z, s_i)? The finite population variances are the same either way, because s_x = sd(B_x_adj) = sd(B_x - mean(B_x)). But I’m not clear if this is the same model referenced in Gelman (2005), and how that model was able to use unconstrained coefficients. I hope I’m on the right track.