Priors and variable-scaling in logistic regression - categorical variables with more than two factors?

I am a PhD student using logistic regression to investigate mental health epidemiology. Since participants in my cohort study have a diagnosis or not (coded 1 or 0) - I’m using logistic regression to estimate the assocation of mental health disorders with some categorical exposures

Reading Gelman et al. (2008), I understand one approach to Bayesian logistic regression (not hierarchical) is to standardize the input variables. They say, scale variables before setting priors by doing the following:

  • “Binary inputs shifted to have a mean of 0 and to differ by 1 in their lower and upper conditions. (For example, if a population is 10% African-American and 90% other, we would define the centered “African-American” variable to take on the values 0.9 and −0.1.)”
  • “Other inputs are shifted to have a mean of 0 and scaled to have a standard deviation of 0.5. This scaling puts continuous variables on the same scale as symmetric binary inputs (which, taking on the values ±0.5, have standard deviation 0.5).”

Once data is scaled in this way, Gelman et al. (2008) and Gelman again in Stan Prior Choice Guidance recomend using a (scaled) Student’s t distribution with 3<\nu<7 as a weakly informative prior for the coefficients in the predictor.

BUT what if you have a multi-category variable: say ethnicity? Let me illustrate with an example:
What if I have the following ethnicities (UK-context, and by no means a full list of ethnicities!):

  • White
  • Black-Caribbean
  • Black-African
  • South-Asian
  • East-Asian
  • Other

Now say I wanted to look at ethnicity as a predictor of the presence a mental health disorder (a dichotomous variable: 1 or 0). I would normally fit a model using dummy variables for

log(\frac{p_i}{1-p_i}) = \beta_0 + \beta_1x_{1i} + \beta_2x_{2i} + \beta_3x_{3i} + \beta_4x_{4i} + \beta_5x_{5i}

where x_1 is whether you are Black-Caribbean (1) or not (0) … up to x_5 (East-Asian or not) and x_6 other or not WITH the references category being white. This makes sense in terms of the research question as we are comparing mental health of minority ethnicities with the dominant ethnicity.

This is where I get stuck. If I change my binary/dummy variables for each ethnicity by shifting the variables (90% ones and 10% zeros would be shifted to take on the values 0.9 and −0.1). When I have done this scaling - this changes the value of the intercept. Previous the intercept was the logit (logit(.)) of the probability p that white subjects had a mental health disorder. Since the output coefficient estimates variables can only be interpreted as the ratio of log-odds as compared to white participants - I am confused about what the benefit is of scaling? Below is the STAN code for the simplest possible version of this model. There are K predictors which represents P-1 categories.

To summarize:
Scaling the variables by shifting them as Gelman et al. (2008) descibes changes ths intercept. How is this intercept now interpreted?

Should this method forscaling variables be used multi-category variables (so that a weakly informative prior can be set on them all)?

If this is not suitable, are there any other guides/references on how to set weakly informative priors for categories with more than two factors?

Any ideas about how weakly informative priors are extended to hierarchical logistic regression models?


data {
  int<lower=0> N;
  int<lower=0> K; //number of fixed effect predictors (inc intercept)
  matrix[N, K] X_mat;
  int<lower=0,upper=1> y[N];
}
parameters {
  //FE coeffiecents in mean function for y_repeat
  vector[K] beta;
}

model {
  //priors
  beta ~ student_t(5, 0, 1);
  //binomial likelihood
  y ~ bernoulli_logit(X_mat * beta);

}
****
1 Like

Hi, sorry for not getting to you earlier, your question is relevant and well written.
This is slightly out of my expertise, but since nobody else responded I’ll give it a try.

I’ll start by noting that there are many ways to code factor predictors for regression. One related (or maybe even identical) to the approach described by Gelman is IMHO “effect coding”, but see also an alternative at Symmetric weakly informative priors for categorical predictors - #2 by jsocolar

If you do effect coding with balanced predictors, then the intercept corresponds to the mean value for the whole population. I don’t think that the rescaling above maintains the interpretation, but I think it could be close.

In any case, I often find it hard to interpret model coefficients on their own and prefer using model predictions for interpretation.

If you have enough data to inform the coefficients well, then I don’t think exact choice of coding should affect your results much. I don’t think this option is particularly bad in any case, but if you find the choice of priors/coding influences your inferences, it is definitely worth investigating.

I think a bit more context would be necessary - what kind of hierarchical model do you have in mind? And priors for what coefficients do you care about?

Also tagging @andrewgelman as he might be the best person to summarise his views :-D

Hope this helps at least a bit!

1 Like

Many thanks @martinmodrak - I found the links you shared very useful.

Specifically I am looking to conduct an individual patient data meta-analysis, which is essentially a hierarchical model with each study/trial representing a group/cluster within the hierarchical model, and with a treatment variable allowed to be heterogeneous across studies, but also with participant covariates stratified or with random-effects.

I was just wondering if centering had to be within-study/cluster or can it be across-study/group.

So that’s definitely out of my experience. I’ve definitely seen there is some heated debates on whether centering is good at all and then how to center. IMHO a good thing to keep in mind is that that centering within cluster will mean you are actually fitting a different model then with centering for the whole data. The question then is which model better represents your question. In some context, it could be argued that you really want the predictors to mean the same thing in all clusters and so if you center, the transformation should be the same for all clusters. But I can also see that for some questions you want to center (and maybe even standardize) each cluster separately, so that your effect is against a cluster-specific baseline, because you actually don’t believe that the predictors represent the same thing in different clusters.

I also think that distinction between different centering strategies becomes less relevant as you allow more heterogeneity between studies (e.g. as more ter,s are “varying” instead of “fixed”).

In any case, unless you are very sure you’ve hit the “one correct approach”, I would definitely do some sort of multiverse / robustness check to see how sensitive are your conclusions to different centering (or not) strategies.

Best of luck with your model!