Overcoming the Simpson's paradox in a multivariate linear model with within and between site gradients and repeated measurements

Data: Let’s assume a dataset where a variable Y was measured multiple times at several sites. Each site is positioned along a gradient of condition X1. Within each site, data was collected along a gradient X2. The relations between Y and X1, and Y and X2 are shown in the following graph.

Objective: The objective is to model Y with X1 and X2 while accounting for site: Y = f(X1, X2, site). An assumption is that the relation of interest between Y and X1 is the one across the sites (that is to say along the gradient used for picking the sites, see the black line on panel A of the figure), not the one that may arise within the sites because there were multiple measurements at each site (apparent Simpson’s paradox). In other words, for a given site, the variation on X1 is just noise, and we assume X1 and X2 are neither causally related, nor statistically correlated. On the contrary, the relation of interest for X2 is the one occurring at the site level (color lines on panel B of the figure).

Problem: Modeling Y as a function of X2 can be done with a classical multilevel model where we allow the intercept (and eventually the slope) to vary by site. However, it is not as straightforward for Y against X1 as if we employ a multilevel model with varying intercept (and eventually slope) by site, we will loose the relation of interest along the gradient of sites. I see two imperfect solutions if I was only considering Y and X1: 1) ignore site to capture the relation of interest, but this model would be mispecified as points taken at a given site are not independent; or 2) use the mean of X1 for each site (eventually propagating sd in the model), but this somewhat is a loss of information as we would not make full use of all data. When considering a model of Y using both X1 and X2, only the solution 2 mentioned above would work. I could also discretize X2 and then model Y against X1 with varying intercept by the X2 category.

Question: Are there other ways to model Y with X1 and X2 while accounting for site and keeping all raw data?

Start by writing down your desired model for the data within a site conditional on the site mean. Then add a hierarchical prior for the site mean, but let this hierarchical prior have the form of a linear regression of the site mean against the covariates of interest at the between-site level.

1 Like

Oh, I see… it makes so much sense! So the model could be written this way, for example:

\begin{aligned} y_{i} &\sim \text{Normal}(\mu_{i}, \sigma)\\ \mu_{i} &= \alpha_{\text{site}[i]} + \beta_2 x_{2i}\\ \alpha_{\text{site}[i]} &\sim \text{Normal}(\nu_{i}, \sigma_{\alpha})\\ \nu_i &= \alpha_1 + \beta_1 x_{1i}\\ \alpha_{1}, \beta_2, \beta_1 &\sim \text{Normal}(\ldots)\\ \sigma, \sigma_{\alpha_{1}} &\sim \text{Exponential}(\ldots) \end{aligned}

I leave the prior details empty as I don’t have time to bother simulating priors on this fake data ;-)

Hi @jsocolar, I am finally fully testing this relatively simple version as a step to build an even more complex model, and I was wondering if my calculation of y_rep and log_lik were correct.

data {
  int<lower=0> N;
  int<lower=0> N_site;
  array[N] int<lower=1> site;
  vector[N] y;
  vector[N] x1;
  vector[N] x2;
}

parameters {
  real alpha1;
  real beta1;
  real<lower=0> sigma_alphaS;
  real beta2;
  vector[N_site] alphaS;
  real<lower=0> sigma;
}

model {
  alpha1 ~ std_normal();
  beta1 ~ std_normal();
  sigma_alphaS ~ exponential(1);
  beta2 ~ std_normal();
  alphaS[site] ~ normal(alpha1 + beta1 * x1, sigma_alphaS);
  sigma ~ exponential(1);
  y ~ normal(alphaS[site] + beta2 * x2, sigma);
}

generated quantities {
  
  // posterior predictive distribution for replications y_rep of the original data set y given model parameters
  array[N] real y_rep = normal_rng(alphaS[site] + beta2 * x2, sigma);
  
  // pointwise log-likelihood
  vector[N] log_lik;
  for (i in 1:N) {
    log_lik[i] = normal_lpdf(y[i] | alphaS[site[i]] + beta2 * x2[i], sigma);
  }
  
}

Shall the log_lik be something like the following instead? I got this idea reading this post.

  for (i in 1:N) {
    log_lik[i] = normal_lpdf(y[i] | alpha1 + beta1 * x1[i] + beta2 * x2[i], sqrt(sigma^2 + sigma_alphaS^2));
  }