Hi all, I am trying to model a binary outcome using some factor variables each having various levels. One can simply fit such a model with glm like:

fit = glm(y ~ . , family = binomial)

But then glm automatically assigns one level of each factor variables to the intercept to avoid singularity and that causes an arbitrary reference for comparison which is not desirable. Instead I want to formulate the problem as follows:

data {
int<lower=0> N;
int<lower=0,upper=1> y[N];
int<lower=1,upper=2> x_gender[N];
}
parameters {
real mu;
real<lower=0> sigma;
vector[2] beta;
}
model {
mu ~ normal(0, 100);
for (l in 1:2){
beta[l] ~ normal(0, sigma);
}
for (n in 1:N){
y[n] ~ bernoulli(inv_logit(mu + beta[x_gender[n]]));
}
}

This hierarchical structure allows the effect for each level of the factors to be compared to the overall average (here for simplicity I only included one factor variable which is gender). But then I learned that this structure is not good because of the Neal’s funnel. So I re-parameterised the model as it is recommended in the Stan manual to deal with that issue:

data {
int<lower=0> N;
int<lower=0,upper=1> y[N];
int<lower=1,upper=2> x_gender[N];
}
parameters {
real mu;
real<lower=0, upper=1> sigma;
vector[2] beta_raw;
}
transformed parameters {
vector[2] beta;
// implies: beta ~ normal(0, sigma_beta)
beta = sigma * beta_raw;
}
model {
mu ~ normal(0, 100);
beta_raw ~ std_normal();
for (n in 1:N){
y[n] ~ bernoulli(inv_logit(mu + beta[x_gender[n]]));
}
}

Now the problem is that I am getting unexpected uncertainty intervals for betas, and very high correlation between the posterior samples of betas and mu which I think causes the problem. Is there any solution to this problem? I also noticed that in the “Gelman and Hill Data Analysis using regression and hierarchical models” book page 303 they used this structure but not using a hierarchical structure for the variables with only two levels. Does anyone know why?

I see two issues. First, you are trying to model a distribution of betas when you only have 2, which is difficult and not advised. Second, you have a mu term that to my eyes serves no purpose and is essentially leaving your model unidentified but for your very flat Normal(0,100) prior. In a classical model with no prior your model would be straight up unidentified, and in this case mu is so weakly regularized that the sampler is free to roam from mu being negative several hundred to positive several hundred with minimal impact on the posterior. The upshot of that is mu+beta is essentially two free parameters solving one equation, so mu and beta will be almost perfectly negatively correlated.

Fundamentally, you should probably stop trying to model beta unless you are going to get more than 2 values, and you should remove mu, or more strongly identify it with some data.

@jack_monroe The purpose of mu is to represent the average response in the population and group effects are reported as deviation from the average. In other words, instead of taking a subgroup as reference (say males, employed, white, etc) to report the other groups effects with respect to this arbitrary reference, you look for effects that significantly deviate from the population mean and this way the results would be more insightful compared to reporting with reference to an arbitrary subgroup.

Regarding the identifiability issue, yes if we assign a mu_beta to the beta parameters to be sampled from then the model doesn’t even converge. But here we are centring them at zero to avoid that and my model converges with no problem at all.

Regarding sigma, that’s right, with only two groups it is not possible to estimate the variance, and that’s why I am using an informative prior for it by constraining it to (0-1), otherwise the model won’t converge.

@Aminsn hello, under this structure, I think that a very high sample correlation between beta and mu is guaranteed, because mu contains the information that scales the value of beta to the observations.

I agree with @jack_monroe that this seems too poorly identified to do anything with. If you’re set against dummy coding then there are other encoding options for categorical variables including deviation coding (R syntax for them described here). I’m not sure I agree that these give more useful results in general though. Using posterior prediction rather than the coefficients of the model directly, you should be able to obtain estimates in terms that you want. I think the utility of hierarchical modelling for variables like this is to introduce partial pooling, rather than change the expression of the results.