I’m trying to work up how to do a test of conditional independence with categorical variables in a multivariate model with brms. If we have the following set of relationships from palmerpenguins
body_mass_g ~ species
flipper_length_mm ~ body_mass_g + species
bill_length_mm ~ body_mass_g
There are four conditional independence relationships
ggdag::dagify(
body_mass_g ~ species ,
flipper_length_mm ~ body_mass_g,
bill_length_mm ~ body_mass_g
) |>
dagitty::impliedConditionalIndependencies()
bl__ _||_ fl__ | bd__
bl__ _||_ spcs | bd__
fl__ _||_ spcs | bd__
Establishing conditional independence for bill_length_mm || flipper_length_mm given body mass is straightforward enough - fit a model with a link between them, then look at the posterior of the coefficient and - I’ve been looking at the overlap of the ROPE to get a conditional independence test.
But - the species ones are harder, as it’s categorical. In thinking about the problem and looking at the literature on conditional independence, one bit that comes back is that, let’s say we were looking at a conditional independence relationship between Y and X. This implies with likelihood that
L(Y | X,Z) = L(Y|Z)
So, in a Bayesian framework, this would imply that the log posterior of a model with versus without a link should be the same, no? In which case one could evaluate, say, the fraction of the posterior (or log posterior) of the more saturated model overlaps with the 95% compatibility interval of the posterior of the sparse model. Which… is actually nice as it’s a general framework for ANY variable type. And we can do this nicely with bayestestR::overlap()
Example:
###
library(brms)
library(palmerpenguins)
library(ggplot2)
library(patchwork)
library(ggplot2)
library(dplyr)
# model with missing links
penguin_bf <-
bf(body_mass_g ~ species) +
bf(flipper_length_mm ~ body_mass_g) +
bf(bill_length_mm ~ body_mass_g) +
set_rescor(FALSE)
penguin_sparse_brm <- brm(penguin_bf,
data = penguins)
# test of conditional independence for species
# on flippers
penguin_flipper_species_bf <-
bf(body_mass_g ~ species) +
bf(flipper_length_mm ~ body_mass_g + species) +
bf(bill_length_mm ~ body_mass_g) +
set_rescor(FALSE)
penguin_flipper_species_brm <-
brm(penguin_flipper_species_bf,
data = penguins)
# check the overlap of log posteriors
bayestestR::overlap( log_posterior(penguin_flipper_species_brm)$Value,
log_posterior(penguin_sparse_brm)$Value)
Would love to have a way to do this without bayestestR to reduce specialty packages. Hrm…
We can also plot to see this
posts <- bind_rows(
log_posterior(penguin_flipper_species_brm) |>
mutate(mod = "flipper_species"),
log_posterior(penguin_sparse_brm) |>
mutate(mod = "original"))
ggplot(posts, aes(x = Value,
fill = mod)) +
geom_density(alpha = 0.5)
Note, we could of course also look at posterior checks to see if there is some oddness - which is helpful - but the test is the thing.
penguin_bf$responses |>
purrr::map(~pp_check(penguin_sparse_brm,
resp = .x) +
labs(title = .x)) |>
purrr::accumulate(`+`)
So, three questions
-
Valid? Yes? No? Better approach here?
1a) Is the posterior the right target here? We could also look at change in residual sigma… although that might not generalize to other distributions, which is why I like looking at the posterior. -
This would have to be done sequentially, as including one path could alter another conditional independence relationship and would influence the posterior. Yes? Although, could a saturated v. unsaturated model provide a global test of conditional independence claims and model structure here? OR
-
I realize this is touching on the territory of Bayes Factors, but, I’m trying to avoid that for now in order to stay in the realm of evaluating probabilities here. And not bringing in the baggage of Bayes Factors, which I quite like, but know can be problematic. LOO is also good here, but, it’s about prediction, and not model structure.
Anyway - would love to know y’alls thoughts.