Dear community,
I am working on a model of failure rates for a manufacturing process, and I was hoping for input regarding the model choice and the stan implementation.
I want to model the (binary) failure probability of production orders. Each production order is of a certain category, and each category has a supercategory, such that the data has a nice 2level hierarchical structure (i.e. each production order can be assigned a node on a tree with depth 2, where the top nodes are the “supercategories” and the intermediate nodes are “categories”).
Additionally, I have a big number of binary labels which can be assigned to every production order (e.g. “machine X was used”), and I want to investigate whether these are important drivers of the risk, while controlling for the risk variations across the above mentioned categories.
The risk seems to vary considerably across the categories, and the number of production orders per category varies also widely (from a handful to thousands). It seems therefore reasonable to make a hierarchical logistic model, to induce some partial pooling across categories.
I consider the following model
\text{logit}(p_i) = \alpha_{j[i]} + \sum_l X_{il} \beta_l
where i runs across production orders, j[i] labels the category. X_{il} is a sparse binary matrix which indicates if the binary label l is true for row i, and \beta_m is the corresponding effect parameter.
I further impose some the hierarchical structure on \alpha, such that
\alpha_{j} \sim \text{normal}(\mu_1^{k[j]},\sigma_1^{k[j]})\\ \mu_1^{k} \sim \text{normal}(\mu_0, \sigma_0)\\
where k[j] labels the supercategory. And consider the following priors,
\mu_0 \sim \text{normal}(0, 10)\\ \beta \sim \text{normal}(0, 10)
All other parameters take flat improper priors.
My questions are as follows:

Is my model assumptions sound in light of the above description? I am a bit confused about the asymmetry between \alpha and \beta w.r.t. the hierarchical grouping. Would it make more sense to also group \beta a la section 9.9 in the stan manual? It is reasonable to assume that the covariates have a different impact depending on the categories, and I would like to avoid bias in the estimation of \beta due to the large asymmetry between the categories in the data. Ultimately, however, I want one effect parameter for each covariate? Imposing the grouping to the \beta would greatly expand the number of parameters (see next question).

I have ~100k records, ~10k categories, ~10 supercategories and would like to use as many binary labels as possible (at least ~100 preferably ~1000). To manage to crunch through the data set, I use the ADVI mode. Is there any reason to be skeptical about using the variational approach for this type of model?
For reference, my implementation is:
data {
// I >= J >= K >= 1
int<lower=1> K; // number of supercategories
int<lower=K> J; // number of categories
int<lower=J> I; // number of data
int<lower=1> L; // number of binary predictors
matrix[I,L] X; // binary covariates
int<lower=1, upper=J> i2j[I];
int<lower=1, upper=K> j2k[J];
int<lower=0, upper=1> y[I]; // failure indicator
}
parameters {
real mu0;
real<lower=0.001> sigma0;
vector[K] mu1;
vector<lower=0.001>[K] sigma1;
vector[J] alpha;
vector[L] beta;
}
model {
// tmp variables for vectorization
vector[I] tmpI;
vector[J] tmpJ_1;
vector[J] tmpJ_2;
// priors
beta ~ normal(0, 10);
mu0 ~ normal(0, 10);
mu1 ~ normal(mu0, sigma0);
for (j in 1:J){
tmpJ_1[j] = mu1[ j2k[j] ];
tmpJ_2[j] = sigma1[ j2k[j] ];
}
alpha ~ normal(tmpJ_1, tmpJ_2);
for (i in 1:I){
tmpI[i] = alpha[ i2j[i] ];
}
y ~ bernoulli_logit(tmpI + X*beta);
}