Bayesian logistic regression workflow when facing complete separation

With one year of experience in frequentist statistics, I recently dived into the world of Bayesian regression as my dataset has complete separation and confounding issues. Most discussions on solving the separation issue suggest using Bayesian models to regularise the inflated standard error to make meaningful inferences on the model output. However, I’m unsure what the workflow is for Bayesian regression in terms of prior decision, model selection method, and data transformation.

To my understanding, different priors will influence the posterior estimation, so the prior selection is crucial, especially if my dataset has complete separation issues. As for model selection, there are mainly two ways: one is building from a simple model, and comparing loo() of having and not having each predictor to see which predictors stay in the model (similar to the stepping-up approach in frequentist models). The other method is Projection predictive feature selection, having a reference model and seeing which submodel performs similarly to the reference model (kind of like backward deletion in the frequentist approach, but not really). But I’m unsure what is the first thing to do here.

Let’s say my logistic regression model looks like this

y ~ x1 + x2 + x3 + x4 + (1|group)

with x1, x2, x3 being categorical predictors with multiple categories, and x1 and x3 have complete separation issues. x4 is a continuous predictor that I’m unsure whether I should do data-transformation on and group is the random effect.

If I decided to take the “build from simple” approach (i.e, start from y ~ 1), should I:

  • First, decide which predictor to stay with a weakly informative prior (e.g. norm(0,2.5)), and then, after having the final model, test which prior works better for complete separation, or

  • Testing whether x1 should stay with a weakly informative prior, if it stays, test which prior works better for x1, and then move on to x2?

and if I decided to use projection predictive feature selection, should I:

  • Test which prior works best for all the predictors in the reference model first, and then do projection predictive feature selection, or

  • Do projection predictive feature selection with weakly informative priors first, and after I have the best submodel, test which prior works better for the remaining predictors?

My next question is, where does data-transformation (e.g. whether or not to log or sqrt x4) land in this model selection workflow in terms of both “build from simple” and projection predictive feature selection approaches?

My final questions are for complete separation:

  • What are the common priors for dealing with complete separation? Some people say weakly informative priors such as norm(0,2.5) or student_t(7,0,2.5) could work, and some say using a horseshoe prior and other more “penalised “ priors.
  • Should I use a universal prior for all the predictors, or is it better to use special priors only on the predictors that have complete separation?
  • How do I know I actually “solve” the complete separation issue?

Sorry for spamming questions in this essay-long post. I must admit that as a beginner in Bayesian statistics, I still lack a huge chunk of knowledge on how to deal with Bayesian regression, and often when I read posts online, I get further more questions in mind. But I’m eager to learn more, and if you feel like I haven’t done enough background research for the questions I asked, please don’t hesitate to point me to the reference.

Welcome to Stan discourse. You have many questions, but they are all sensible and you have structured your post nicely. Here are some quick answers

  • The first step in model selection is to think whether model selection is needed at all. If the goal is prediction, it’s best to use all covariates and a good prior. If the goal is posterior inference for one or few interpretable parameters, it’s best to use all covariates and a good prior. If there is a significant cost for future covariate measurements or cost in explaining a complex model, model selection maybe useful.
  • If the computation is not an issue, it’s best to start by building the model including all the components that are assumed to have non-zero probability of being relevant and use a good prior.
  • If there is complete separation, you can use prior predictive checking to choose such priors that don’t put too much prior mass near probabilities very close to 0 an 1.
  • normal(0, 2.5) is a good prior for a single covariate, but if you have many covariates, then the total prior predictive variance of the linear predictor is 2.5 times the number of predictors which is not a good prior with many predictors
  • Too wide prior on coefficients is strongly favoring predictive probabilities very close to 0 and 1
  • In case of complete separation, the prior determines how close to 0 and 1 the predictive probabilities can get. The separation issue is solved, when you have a prior that matches your assumption on how close to 0 and 1 the predictive probabilities could be.
  • Regularized horseshoe and R2D2 priors are good as they are joint priors on coefficients and the prior predictive distribution behaves nicelu when the number of covariates increase.
  • After you have your best big model, you can do posterior and LOO predictive checking and calibration checking. When you are happy with the big model, you may proceed to variable selection if needed.
  • projpred is the best approach for variable selection (when computationally applicable and specific observation model is supported by software)
  • If you have only a few models, you may also use cross-validation for model comparison
  • Instead of data transformation, consider using spline (s(x4)), which makes non-linear transformation automatically. The default splines in brms are really good.
  • See several case studies on variable selection at Aki Vehtari - Case studies
  • For further reading about cross-validation and many models, see Cross-validation FAQ
  • Ask more in case I missed something
5 Likes

@avehtari gave a great answer. I just want to chime in to add that:

If you have complete separation along both x1 and x3 separately, you will never be able to use model selection to understand which one (or both) is likely to be causal. Model selection will generally prefer whichever of x1 or x3 has fewer levels, but it won’t be able to tell you anything further. It also won’t be able to tell you whether any other parameters are important or not in a model that also contains either x1 or x3. And unless x1 and x3 both have a large number of levels, it is pretty unlikely that the selection procedures you describe would entertain models that omit both x1 and x3 simultaneously.

I also want re-affirm your caution about the choice of prior. Here’s a dataset with complete separation along a single continuous variable. The likelihood basically rules out the red lines, and it says that the blue lines are more likely than the orange lines. But it has nothing to say about how arbitrarily steep the blue lines should be. That’s what the prior is going to do: set a probabilistic bound on how steep the blue lines should get. There’s literally no other information to be gathered from the data about how steep these blue lines might be; so you’ll just probabilistically trim off blue lines that you deem “too steep” via the prior.

3 Likes

I support @jsocolar ‘s point about the role of the prior in complete separation or other tasks of out-of-sample inference/prediction. For me, this really needs an informative prior, so the information you use to aid the decision-maker will be a mix of data and expert opinions. I favour a consensus approach to a panel of experts, though I realise that is a time-consuming, potentially expensive process that might even be deemed a threat to intellectual property. Still, I like it: it’s not my money or IP, and I charge by the hour (#humor) You might like to look at the SHELF website https://shelf.sites.sheffield.ac.uk/ and their book “Uncertain Judgements”. As time has gone by, I moved from less to more informative in my priors (I didn’t start that way, I am not a “boomer statistician” just fyi). It’s sensitivity analysis doubleplus.