Question: Diagnostic status used in both interaction and derived predictor — better to use split models?

Hi all,

I’m working on a brms model analyzing social behavior (Gregariousness, binary outcome) in a population of reptiles. Two key predictors of interest are:

  • Disease status of the focal individual (Diagnostic: Healthy vs Symptomatic)
  • Local disease prevalence (Local_Prop_Diseased), defined as the number of nearby diseased individuals divided by the total number of nearby individuals

In my initial full model:

modelB <- brm(
  bf(Gregariousness ~ Diagnostic * scale(Local_Prop_Diseased) + 
       scale(Local_Num_Total) + 
       scale(ISCE) + 
       SEASON + Sex + 
       (1 | ID) + (1 | datetime) + (1 | FIELD_SEAS) + (1 | Location)),
  family = bernoulli(),
  data = final_data1,
  iter = 5000, warmup = 1000, chains = 4, cores = corecount, thin = 4
)

I became concerned that this model might be conceptually circular: the Diagnostic variable is used both as a predictor and to construct Local_Prop_Diseased (via Local_Num_Diseased). Since they are also interacted, this raises statistical and biological concerns — e.g., potential for inflated or confounded effects.

Split Model Approach: Healthy and Symptomatic Individuals Separately

To address this, I ran separate models for Healthy and Symptomatic individuals:

# Healthy individuals only
model_healthy <- brm(
  Gregariousness ~ scale(Local_Prop_Diseased) + 
    scale(Local_Num_Total) + 
    Sex + SEASON + 
    (1 | ID) + (1 | FIELD_SEAS) + (1 | Location),
  data = subset(final_data1, Diagnostic == "Healthy"),
  family = bernoulli(),
  iter = 4000, warmup = 2000, chains = 4, cores = corecount
)

# Symptomatic individuals only
model_diseased <- brm(
  Gregariousness ~ scale(Local_Prop_Diseased) + 
    scale(Local_Num_Total) + 
    Sex + SEASON + 
    (1 | ID) + (1 | FIELD_SEAS) + (1 | Location),
  data = subset(final_data1, Diagnostic == "Symptomatic"),
  family = bernoulli(),
  iter = 4000, warmup = 2000, chains = 4, cores = corecount
)

These split models avoid the circularity, and the results are consistent and interpretable. For example, Healthy individuals increase gregariousness with local disease prevalence, while Symptomatic individuals decrease — aligning with biological expectations.

Optional: If someone would like to and has the time :). Here’s a minimal simulated example to illustrate the structure and the concern.

set.seed(123)
n <- 100
sim_data <- data.frame(
  ID = factor(rep(1:20, each = 5)),
  Diagnostic = sample(c("Healthy", "Symptomatic"), n, replace = TRUE),
  Local_Num_Total = sample(5:10, n, replace = TRUE)
)

# Local_Num_Diseased depends on Diagnostic
sim_data$Local_Num_Diseased <- ifelse(sim_data$Diagnostic == "Symptomatic",
                                      rbinom(n, sim_data$Local_Num_Total, 0.5),
                                      rbinom(n, sim_data$Local_Num_Total, 0.2))

sim_data$Local_Prop_Diseased <- sim_data$Local_Num_Diseased / sim_data$Local_Num_Total
sim_data$Gregariousness <- rbinom(n, 1, prob = 0.5)

# The model that reflects the structure in question
model_sim <- brm(
  Gregariousness ~ Diagnostic * scale(Local_Prop_Diseased),
  data = sim_data,
  family = bernoulli(),
  chains = 2, iter = 1000
)

My Questions:

  1. Is it statistically valid and acceptable to use split models in this context to avoid circularity, even though I lose the explicit interaction?
  2. Would a better solution be to restructure the full model — e.g., redefine Local_Prop_Diseased using external diagnostic data or treat Diagnostic as a latent variable?
  3. Is the gain in interpretability and stability from the split models sufficient justification, especially when the interaction in the full model is not meaningful due to confounding?

I’d be grateful for any advice or shared experience with similar modeling structures.

Thank you so much for your time and insights — I’d really appreciate hearing how others have approached similar situations!

Thanks in advance!

Hello Felix,

I do not understand what you mean by circularity. If Diagnostic is an individual’s disease status, how is it being used in Local_Prop_Diseased? Do you mean that the probality of an individual being symptomatic increases with prevalence?

Is Local_Num_Total a measure of population size/density? What is ISCE?

What question are you trying to answer with your model?