GLM coefficient in multivariate model opposite to trend in data

I am building a GLM to model the fraction of infiltrating immune cells in given sections of tumour, with multiple samples taken from different tissues at multiple timepoints from the same patient. To tackle this, I use a negative binomial GLM with a log link function, with a hierarchical patient specific level (a below). The model seems to run okay on the data, with sensible posteriors, no divergences, reasonable Rhats and good ppc_loo_pit plots.

One of the factors that we control for is whether the sample includes sections of the invasive edge of the tumour, as you’d expect these areas to contain a higher proportion of infiltrating immune cells. Bizarrely, the model predicts that samples on the invasive edge contain fewer Lymphocytes! (fig. stan_regression_Lymphocyte) Quickly double checking this in the raw data, whilst the low number of samples that are not on the invasive edge (16/134) prevent any difference reaching significance (p=0.054, Mann-Whitney U), the samples that contain areas of the invasive edge on average contain almost twice as many lyphocytes per tumour cell than those that don’t. (fig. invasive_edge_ratio_violin)

I know that controlling for additional factors can reverse the direction of trends, ala Simpson’s paradox, but the truly odd thing is that when I remove all other factors and just include the hierarchical patient structure and whether a sample includes the invasive edge, the inference indicates the same result! (stan_regression_Lymphocyte_edgeonly)

Please could you help me work out why this hierarchical GLM approach is giving such counter-intuitive results? I’ve also attached the data used in the fitting below.

data {
    int<lower=0> N; // Number of samples
    int<lower=0> S; // Number of patients
    int<lower=0> K; // Number of met groups
    int<lower=0> B; // Number of control coeffs to control for 
    int<lower=1,upper=S> patient[N]; // Which patient a sample belongs to
    int<lower=0,upper=K> group[N]; // Met location (0 colon)
    matrix<lower=0, upper=1>[B, N] control_factors; 
    int<lower=0> y[N];
    int<lower=0> tumour[N];

transformed data{
    real log_tumour[N];
    for (i in 1:N){
        log_tumour[i] = log(tumour[i]);

parameters {
    real mean_a;
    real<lower=0> sigma_a;
    row_vector[B] control_coeffs;
    vector[K] coeffs;
    real<lower=0> sqrt_phi_inv;
    vector[S] a_raw;

transformed parameters{
    real<lower=0> phi;
    vector[S] a;
    vector[N] mu;
    vector[N] mu_scaled;

    phi = 1 / sqrt_phi_inv^2;

    a = mean_a + sigma_a * a_raw;

    for (i in 1:N){
        if (group[i] == 0){
            mu[i] = (a[patient[i]] + 
                        control_coeffs * control_factors[:, i]);
        else {
            mu[i] = (a[patient[i]] + 
                        control_coeffs * control_factors[:, i] + 
        mu_scaled[i] = mu[i] + log_tumour[i]; 

model {
    sigma_a ~ normal(0, 1);
    mean_a ~ normal(0, 1);

    sqrt_phi_inv ~ normal(0, 1); 

    a_raw ~ normal(0, 1);

    control_coeffs ~ normal(0, 1);
    coeffs ~ normal(0, 1);

    y ~ neg_binomial_2_log(mu_scaled, phi);

LymphocyteData.csv (10.4 KB)

How do you derive your “inferred regression coefficients”? If these are just the coefficients, then a null effect should be zero, no? These are all positive.

Apologies for not making that clear, I have taken the exp() if the log-scale coefficients to plot the relative effect of each factor. Hence a null effect exp(0)=1.

Can you show your call to the Stan model (i.e. your script to pass the data)?

Of course, here is my python stcript. I use pystan in Python 3.9.2.

Note that unlike the snippet posted above, I’ve included a generated quantities block to calculate y_hat and the loglikelihood per data point, lp. However, I don’t think this should have any effect on the inference itself. (3.4 KB)

Interestingly, when I run the inference without the hierarchical patient structure and let edge be the only factor, the result is in line with what we expect (fig. stan_regression_Lymphocyte_edge_nopatient.png - samples containing the invasive edge potentially contain more lymphocytes, but not strong evidence). Because of this, I also plotted Edge against Ratio for each individual patient (fig. edge_ratio_perpatient_Lymphocyte.png), although the general trend across the cohort is for Ratio to increase with Edge, within each patient the majority have only nonEdge samples, and of the 5 that have both, 1 is positively correlated, 1 has no relation and 3 have a negative correlation (although numbers are very low). Given the limited numbers involved in the intra-patient comparison, I am surprised the initial inference on Edge was so confident in stan_regression_Lymphocyte.png above.