GLM coefficient in multivariate model opposite to trend in data

Calum13395 · August 13, 2021, 10:50am

I am building a GLM to model the fraction of infiltrating immune cells in given sections of tumour, with multiple samples taken from different tissues at multiple timepoints from the same patient. To tackle this, I use a negative binomial GLM with a log link function, with a hierarchical patient specific level (a below). The model seems to run okay on the data, with sensible posteriors, no divergences, reasonable Rhats and good ppc_loo_pit plots.

One of the factors that we control for is whether the sample includes sections of the invasive edge of the tumour, as you’d expect these areas to contain a higher proportion of infiltrating immune cells. Bizarrely, the model predicts that samples on the invasive edge contain fewer Lymphocytes! (fig. stan_regression_Lymphocyte) Quickly double checking this in the raw data, whilst the low number of samples that are not on the invasive edge (16/134) prevent any difference reaching significance (p=0.054, Mann-Whitney U), the samples that contain areas of the invasive edge on average contain almost twice as many lyphocytes per tumour cell than those that don’t. (fig. invasive_edge_ratio_violin)

I know that controlling for additional factors can reverse the direction of trends, ala Simpson’s paradox, but the truly odd thing is that when I remove all other factors and just include the hierarchical patient structure and whether a sample includes the invasive edge, the inference indicates the same result! (stan_regression_Lymphocyte_edgeonly)

Please could you help me work out why this hierarchical GLM approach is giving such counter-intuitive results? I’ve also attached the data used in the fitting below.

data {
    int<lower=0> N; // Number of samples
    int<lower=0> S; // Number of patients
    int<lower=0> K; // Number of met groups
    int<lower=0> B; // Number of control coeffs to control for 
    int<lower=1,upper=S> patient[N]; // Which patient a sample belongs to
    int<lower=0,upper=K> group[N]; // Met location (0 colon)
    matrix<lower=0, upper=1>[B, N] control_factors; 
    int<lower=0> y[N];
    int<lower=0> tumour[N];
} 

transformed data{
    real log_tumour[N];
    for (i in 1:N){
        log_tumour[i] = log(tumour[i]);
    }
}

parameters {
    real mean_a;
    real<lower=0> sigma_a;
    row_vector[B] control_coeffs;
    vector[K] coeffs;
    real<lower=0> sqrt_phi_inv;
    vector[S] a_raw;
} 

transformed parameters{
    real<lower=0> phi;
    vector[S] a;
    vector[N] mu;
    vector[N] mu_scaled;

    phi = 1 / sqrt_phi_inv^2;

    a = mean_a + sigma_a * a_raw;

    for (i in 1:N){
        if (group[i] == 0){
            mu[i] = (a[patient[i]] + 
                        control_coeffs * control_factors[:, i]);
        }
        else {
            mu[i] = (a[patient[i]] + 
                        control_coeffs * control_factors[:, i] + 
                        coeffs[group[i]]);
        }
        
        mu_scaled[i] = mu[i] + log_tumour[i]; 
    }
}

model {
    sigma_a ~ normal(0, 1);
    mean_a ~ normal(0, 1);

    sqrt_phi_inv ~ normal(0, 1); 

    a_raw ~ normal(0, 1);

    control_coeffs ~ normal(0, 1);
    coeffs ~ normal(0, 1);

    y ~ neg_binomial_2_log(mu_scaled, phi);
}

LymphocyteData.csv (10.4 KB)

jsocolar · August 13, 2021, 2:40pm

How do you derive your “inferred regression coefficients”? If these are just the coefficients, then a null effect should be zero, no? These are all positive.

Calum13395 · August 13, 2021, 2:42pm

Apologies for not making that clear, I have taken the exp() if the log-scale coefficients to plot the relative effect of each factor. Hence a null effect exp(0)=1.

jsocolar · August 13, 2021, 2:45pm

Can you show your call to the Stan model (i.e. your script to pass the data)?

Calum13395 · August 13, 2021, 3:02pm

Of course, here is my python stcript. I use pystan 2.19.1.1 in Python 3.9.2.

Note that unlike the snippet posted above, I’ve included a generated quantities block to calculate y_hat and the loglikelihood per data point, lp. However, I don’t think this should have any effect on the inference itself.

glm_metastasis_example.py (3.4 KB)

Calum13395 · August 13, 2021, 3:11pm

Interestingly, when I run the inference without the hierarchical patient structure and let edge be the only factor, the result is in line with what we expect (fig. stan_regression_Lymphocyte_edge_nopatient.png - samples containing the invasive edge potentially contain more lymphocytes, but not strong evidence). Because of this, I also plotted Edge against Ratio for each individual patient (fig. edge_ratio_perpatient_Lymphocyte.png), although the general trend across the cohort is for Ratio to increase with Edge, within each patient the majority have only nonEdge samples, and of the 5 that have both, 1 is positively correlated, 1 has no relation and 3 have a negative correlation (although numbers are very low). Given the limited numbers involved in the intra-patient comparison, I am surprised the initial inference on Edge was so confident in stan_regression_Lymphocyte.png above.

Topic		Replies	Views
Hierarchical negative binomial - pareto k > 0.5 Modeling fitting-issues , specification , loo	2	700	November 25, 2021
GLM with growth function - problem of indetermiinability when slope = 0 Modeling techniques	12	1649	May 28, 2018
Building a hierarchical multivariate cumulative logit model Modeling fitting-issues	0	212	May 28, 2024
Hierarchical model stubborn after adding multivariate prior Modeling hierarchical-model	4	212	May 9, 2024
Hierarchical overdispersed count models Modeling binomial-response , count-data	2	738	November 10, 2021

GLM coefficient in multivariate model opposite to trend in data

Related topics