- I am estimating a logistic model using Stan. Since Stan does not natively support factor variables, I have to assign a separate coefficient to each category of my categorical predictors—and I am currently puzzled about how to interpret these coefficients.
in my case, the data definition is as follows,
| Variable label | Value labels | Frequency |
|---|---|---|
| Whether one is willing to have a third child | 0:no;1: yes | 0:4758; 1:484 |
| Province | 1:Anhui; 2:Beijing; 3:Fujian; 4:Gansu; 5:Guangdong; 6:Guangxi; 7:Guizhou; 8:Hainan; 9:Hebei; 10:Henan; 11:Heilongjiang; 12:Hubei; 13:Hunan; 14:Jilin; 15:Jiangsu; 16:Jiangxi; 17:Liaoning; 18:Neimenggu; 19:Ningxia; 20:Qinghai; 21:Shandong; 22:Shanxi; 23:Shaanxi; 24:Shanghai; 25:Sichuan; 26:Tianjin; 27:Xinjiang; 28:Yunnan; 29:Zhejiang; 30:Chongqing | 1:217;2:158;3:129;4:117;5:295;6:182;7:188;8:2;9:212;10:316;11:263;12:220;13:261;14:222;15:234;16:244;17:190;18:44;19:63;20:35;21:299;22:179;23:166;24:203;25:224;26:89;27:9;28:218;29:154;30:109 |
| Age | 1: [20,25); 2: [25,30); 3: [30,35); 4: [35,40);5: [40,45); 6: [45,50) | 1:551; 2:738; 3:839; 4:901; 5:993; 6:1220 |
| Gender | 1:Male; 2:Female | 1:2434; 2:2808 |
| Hukou | 1:Urban; 2:Rural | 1:1832; 2:3410 |
my codes is as follows,
data {
int<lower=0> N;
array[N] int<lower=1, upper=30> state;
array[N] int<lower=1, upper=6> age;
array[N] int<lower=1, upper=2> gender, urban;
array[N] int<lower=0, upper=1> y;
real tau1,tau2;
}
parameters {
real alpha,delta,epsilon;
real<lower=0,upper=10> sigma_beta;
vector[30] beta;
real<lower=0,upper=10> sigma_gamma;
vector[6] gamma;
}
model {
vector[N] eta;
for (i in 1:N) {
eta[i] = alpha +
beta[state[i]] +
gamma[age[i]] +
(gender[i] == 1 ? delta : -delta) +
(urban[i] == 1 ? epsilon : -epsilon);
}
y ~ bernoulli_logit(eta);
{alpha,delta,epsilon} ~ normal(0, tau1);
beta ~ normal(0, sigma_beta);
gamma ~ normal(0, sigma_gamma);
{sigma_beta,sigma_gamma} ~ gamma(2, 2 / tau2);
}
the result using table format is as follows,
| variable | mean | sd | q5 | q95 | rhat | ess_bulk |
|---|---|---|---|---|---|---|
| alpha | -0.207 | 0.934 | -1.728 | 1.327 | 1.003 | 4424.29 |
| delta | 0.082 | 0.05 | 0.002 | 0.165 | 1.002 | 6526.39 |
| epsilon | -0.185 | 0.059 | -0.283 | -0.087 | 1.002 | 5847.307 |
| gamma[1] | -0.492 | 0.243 | -0.915 | -0.12 | 1.002 | 1804.64 |
| gamma[2] | -0.24 | 0.224 | -0.614 | 0.122 | 1.002 | 1396.916 |
| gamma[3] | 0.095 | 0.217 | -0.259 | 0.454 | 1.002 | 1280.061 |
| gamma[4] | 0.04 | 0.216 | -0.316 | 0.392 | 1.004 | 1330.041 |
| gamma[5] | 0.164 | 0.21 | -0.176 | 0.516 | 1.003 | 1228.037 |
| gamma[6] | 0.387 | 0.21 | 0.058 | 0.741 | 1.003 | 1240 |
| zeta[1] | -0.007 | 0.362 | -0.622 | 0.55 | 1.003 | 1609.265 |
| zeta[2] | 0.06 | 0.361 | -0.514 | 0.643 | 1.002 | 1653.061 |
| zeta[3] | -0.166 | 0.366 | -0.847 | 0.314 | 1 | 1692.856 |
my explanation is as follows,
The coefficient of 0.082 for males corresponds to an odds ratio (OR) of 1.184 (exp(0.082 × 2)). This doubling arises from the symmetric coding scheme in the model specification, where the contrast between male (coded as 1) and female (coded as 2) is represented as (+δ) versus (-δ), yielding a total difference of 2δ. This suggests that men have a stronger preference for having a third child than women. Similarly, a coefficient of -0.185 for urban hukou indicates that individuals with urban registration are less inclined toward a third child than their rural counterparts, with an OR of 0.691 (exp(-0.185 × 2)). For the 45-49 age group, the coefficient of 0.387 (OR = 1.47) should be interpreted relative to the average preference level across all age groups, as the model employs effect coding that centers the age coefficients around zero.
Following the convention established by (Gelman Hill, 2007), we divide each coefficient by four to approximate the average marginal effects on the probability scale, yielding 4.1%, -9.3%, and 9.7%, respectively. Thus, ceteris paribus, the estimated probability of supporting a third child is approximately 4.1 percentage points higher among men than women, 9.3 percentage points lower among urban residents than rural residents, and 9.7 percentage points higher among adults aged 45-49 than the average across all age groups.
I am not sure my explanation is correct. Any comment is appreciated greatly.