Overcoming the Simpson's paradox in a multivariate linear model with within and between site gradients and repeated measurements

Unlearned5341 · October 16, 2023, 12:02am

Data: Let’s assume a dataset where a variable Y was measured multiple times at several sites. Each site is positioned along a gradient of condition X1. Within each site, data was collected along a gradient X2. The relations between Y and X1, and Y and X2 are shown in the following graph.

Objective: The objective is to model Y with X1 and X2 while accounting for site: Y = f(X1, X2, site). An assumption is that the relation of interest between Y and X1 is the one across the sites (that is to say along the gradient used for picking the sites, see the black line on panel A of the figure), not the one that may arise within the sites because there were multiple measurements at each site (apparent Simpson’s paradox). In other words, for a given site, the variation on X1 is just noise, and we assume X1 and X2 are neither causally related, nor statistically correlated. On the contrary, the relation of interest for X2 is the one occurring at the site level (color lines on panel B of the figure).

Problem: Modeling Y as a function of X2 can be done with a classical multilevel model where we allow the intercept (and eventually the slope) to vary by site. However, it is not as straightforward for Y against X1 as if we employ a multilevel model with varying intercept (and eventually slope) by site, we will loose the relation of interest along the gradient of sites. I see two imperfect solutions if I was only considering Y and X1: 1) ignore site to capture the relation of interest, but this model would be mispecified as points taken at a given site are not independent; or 2) use the mean of X1 for each site (eventually propagating sd in the model), but this somewhat is a loss of information as we would not make full use of all data. When considering a model of Y using both X1 and X2, only the solution 2 mentioned above would work. I could also discretize X2 and then model Y against X1 with varying intercept by the X2 category.

Question: Are there other ways to model Y with X1 and X2 while accounting for site and keeping all raw data?

jsocolar · October 16, 2023, 2:53am

Start by writing down your desired model for the data within a site conditional on the site mean. Then add a hierarchical prior for the site mean, but let this hierarchical prior have the form of a linear regression of the site mean against the covariates of interest at the between-site level.

Unlearned5341 · October 16, 2023, 3:24pm

Oh, I see… it makes so much sense! So the model could be written this way, for example:

\begin{aligned} y_{i} &\sim \text{Normal}(\mu_{i}, \sigma)\\ \mu_{i} &= \alpha_{\text{site}[i]} + \beta_2 x_{2i}\\ \alpha_{\text{site}[i]} &\sim \text{Normal}(\nu_{i}, \sigma_{\alpha})\\ \nu_i &= \alpha_1 + \beta_1 x_{1i}\\ \alpha_{1}, \beta_2, \beta_1 &\sim \text{Normal}(\ldots)\\ \sigma, \sigma_{\alpha_{1}} &\sim \text{Exponential}(\ldots) \end{aligned}

I leave the prior details empty as I don’t have time to bother simulating priors on this fake data ;-)

Unlearned5341 · October 19, 2023, 12:41am

Hi @jsocolar, I am finally fully testing this relatively simple version as a step to build an even more complex model, and I was wondering if my calculation of y_rep and log_lik were correct.

data {
  int<lower=0> N;
  int<lower=0> N_site;
  array[N] int<lower=1> site;
  vector[N] y;
  vector[N] x1;
  vector[N] x2;
}

parameters {
  real alpha1;
  real beta1;
  real<lower=0> sigma_alphaS;
  real beta2;
  vector[N_site] alphaS;
  real<lower=0> sigma;
}

model {
  alpha1 ~ std_normal();
  beta1 ~ std_normal();
  sigma_alphaS ~ exponential(1);
  beta2 ~ std_normal();
  alphaS[site] ~ normal(alpha1 + beta1 * x1, sigma_alphaS);
  sigma ~ exponential(1);
  y ~ normal(alphaS[site] + beta2 * x2, sigma);
}

generated quantities {
  
  // posterior predictive distribution for replications y_rep of the original data set y given model parameters
  array[N] real y_rep = normal_rng(alphaS[site] + beta2 * x2, sigma);
  
  // pointwise log-likelihood
  vector[N] log_lik;
  for (i in 1:N) {
    log_lik[i] = normal_lpdf(y[i] | alphaS[site[i]] + beta2 * x2[i], sigma);
  }
  
}

Shall the log_lik be something like the following instead? I got this idea reading this post.

  for (i in 1:N) {
    log_lik[i] = normal_lpdf(y[i] | alpha1 + beta1 * x1[i] + beta2 * x2[i], sqrt(sigma^2 + sigma_alphaS^2));
  }

Topic		Replies	Views
Multiple variable intercepts in a two-staged multilevel model Modeling techniques , fitting-issues , specification	7	417	October 25, 2023
Calculating the pointwise log-likelihood in a two-staged multilevel model Modeling loo , stan	1	365	December 13, 2023
"Pairwise" alternative to multivariate normal isn't behaving for the hierarchical case; help? Modeling techniques , specification	13	885	June 10, 2021
Multivariate hierarchical outcomes correlated within a group Modeling techniques , performance	1	657	June 27, 2018
Hierarchical Model for Causal Inference Modeling	2	446	August 6, 2021

Overcoming the Simpson's paradox in a multivariate linear model with within and between site gradients and repeated measurements

Related topics