I am building a GLM to model the fraction of infiltrating immune cells in given sections of tumour, with multiple samples taken from different tissues at multiple timepoints from the same patient. To tackle this, I use a negative binomial GLM with a log link function, with a hierarchical patient specific level (a below). The model seems to run okay on the data, with sensible posteriors, no divergences, reasonable Rhats and good ppc_loo_pit plots.
One of the factors that we control for is whether the sample includes sections of the invasive edge of the tumour, as you’d expect these areas to contain a higher proportion of infiltrating immune cells. Bizarrely, the model predicts that samples on the invasive edge contain fewer Lymphocytes! (fig. stan_regression_Lymphocyte) Quickly double checking this in the raw data, whilst the low number of samples that are not on the invasive edge (16/134) prevent any difference reaching significance (p=0.054, Mann-Whitney U), the samples that contain areas of the invasive edge on average contain almost twice as many lyphocytes per tumour cell than those that don’t. (fig. invasive_edge_ratio_violin)
I know that controlling for additional factors can reverse the direction of trends, ala Simpson’s paradox, but the truly odd thing is that when I remove all other factors and just include the hierarchical patient structure and whether a sample includes the invasive edge, the inference indicates the same result! (stan_regression_Lymphocyte_edgeonly)
Please could you help me work out why this hierarchical GLM approach is giving such counter-intuitive results? I’ve also attached the data used in the fitting below.
data {
int<lower=0> N; // Number of samples
int<lower=0> S; // Number of patients
int<lower=0> K; // Number of met groups
int<lower=0> B; // Number of control coeffs to control for
int<lower=1,upper=S> patient[N]; // Which patient a sample belongs to
int<lower=0,upper=K> group[N]; // Met location (0 colon)
matrix<lower=0, upper=1>[B, N] control_factors;
int<lower=0> y[N];
int<lower=0> tumour[N];
}
transformed data{
real log_tumour[N];
for (i in 1:N){
log_tumour[i] = log(tumour[i]);
}
}
parameters {
real mean_a;
real<lower=0> sigma_a;
row_vector[B] control_coeffs;
vector[K] coeffs;
real<lower=0> sqrt_phi_inv;
vector[S] a_raw;
}
transformed parameters{
real<lower=0> phi;
vector[S] a;
vector[N] mu;
vector[N] mu_scaled;
phi = 1 / sqrt_phi_inv^2;
a = mean_a + sigma_a * a_raw;
for (i in 1:N){
if (group[i] == 0){
mu[i] = (a[patient[i]] +
control_coeffs * control_factors[:, i]);
}
else {
mu[i] = (a[patient[i]] +
control_coeffs * control_factors[:, i] +
coeffs[group[i]]);
}
mu_scaled[i] = mu[i] + log_tumour[i];
}
}
model {
sigma_a ~ normal(0, 1);
mean_a ~ normal(0, 1);
sqrt_phi_inv ~ normal(0, 1);
a_raw ~ normal(0, 1);
control_coeffs ~ normal(0, 1);
coeffs ~ normal(0, 1);
y ~ neg_binomial_2_log(mu_scaled, phi);
}
LymphocyteData.csv (10.4 KB)