I am working on a relatively large dataset regarding the genetics and clinical manifestation of mutations in a certain type of neuronal receptor. It consists of two layers:
-
For the wildtype as well as several genetic variants, I have measurements from different electrophysiological experiments (such as patch clamp) for this receptor, with a low number of replicates (usually 4).
-
A cohort of patients, each carrying one of the variants, with a large list of binary clinical phenotypes, i.e. presence and absence (e.g. seizures, cortical blindness etc.).
I am interested on both the distribution of functional parameters as well as their relative influence on clinical phenotypes.
My model seemed pretty straightforward at first: model the functional measurements with a hierarchical multivariate normal (thus be able to derive the posterior for the correlation matrix as well as the measurements per variant), compute z-scores as a transformed quantity (to get them on the same scale), and then use these in logistic regressions on the phenotypes to get posteriors for the regression coefficients:
data{
int<lower=1> nr_patients;
int<lower=1> nr_experiments;
int<lower=1> nr_measurements;
int<lower=1> nr_variants;
int<lower=1> nr_phenotypes;
int<lower=1> nr_phenotype_obs;
// FUNCTIONAL DATA
// value of functional measurement
array[nr_measurements] real measurement;
// indicator for the type of functional experiment
array[nr_measurements] int<lower=1, upper=nr_experiments> experiment;
// identifier of variant
array[nr_measurements] int<lower=1, upper=nr_variants> variant;
// PHENOTYPE DATA
// phenotype identifier (ataxia, status epilepticus etc.)
array[nr_phenotype_obs] int<lower=1, upper=nr_phenotypes> phenotype_obs;
// binary phenotype status (0=absent, 1=present)
array[nr_phenotype_obs] int<lower=0, upper=1> phenotype_status_obs;
// variant ID
array[nr_phenotype_obs] int<lower=1, upper=nr_variants> variant_obs;
}
parameters {
// FUNCTIONAL DATA
// mean of measurement
matrix[nr_variants, nr_experiments] mu;
// SD of measurement
matrix<lower=0>[nr_variants, nr_experiments] sigma;
// PHENOTYPE DATA
// logistic regression coefficients for phenotype outcome, one for each experiment type plus intercept
array[nr_phenotypes] vector[nr_experiments] beta_logit;
array[nr_phenotypes] real intercept_logit;
// scale component of covariance matrix
vector<lower=0>[nr_experiments] tau;
correlation matrix component of covariance matrix
corr_matrix[nr_experiments] Omega;
// mean params
vector[nr_experiments] mu_bar;
// lower diagonal matrix of the Cholesky factorization of the correlation matrix Omega
cholesky_factor_corr[nr_experiments] Lcorr;
// scale component of covariance matrix
vector<lower=0>[nr_experiments] tau;
}
transformed parameters {
// z-score-like transformation: number of standard deviations that a mean
// functional measurement deviates from wild-type, plus an intercept column
matrix[nr_variants, nr_experiments] z;
for (v in 1:nr_variants){
z[v] = (mu[v] - mu[1])./sigma[1]; // NOTE variant=1 is WT (wild-type)
}
}
model {
// FUNCTIONAL DATA
// hyperpriors for means
mu_bar ~ normal(0, 1.5);
// hyperpriors for means covariance matrix
tau ~ cauchy(0, 5);
Lcorr ~ lkj_corr_cholesky(1);
for (v in 1:nr_variants){
mu[v] ~ multi_normal_cholesky(mu_bar, diag_pre_multiply(tau, Lcorr));
for (e in 1:nr_experiments){
sigma[v,e] ~ exponential(1);
}
}
for (m in 1:nr_measurements){
int v = variant[m];
int e = experiment[m];
measurement[m] ~ normal(mu[v,e], sigma[v,e]);
}
// PHENOTYPE DATA
// priors
for (f in 1:nr_phenotypes){
// standard-normal marginals
beta_logit[f] ~ normal(0, inv(sqrt(1 - inv(nr_experiments+1)))) T[-6,6]; // truncate do deal with separability
intercept_logit[f] ~ normal(0, inv(sqrt(1 - inv(nr_experiments+1)))) T[-6,6];
}
for (i in 1:nr_phenotype_obs){
int f = phenotype_obs[i];
int v = variant_obs[i];
phenotype_status_obs[i] ~ bernoulli_logit( dot_product( beta_logit[f], z[v] ) + intercept_logit[f]);
}
}
generated quantities {
corr_matrix[nr_experiments] Omega;
Omega = multiply_lower_tri_self_transpose(Lcorr);
}
If I only use the functional part of the model, i.e. comment out the phenotype part, it works like a charm. The measurement posteriors are where I expect them to be, and the correlation matrix makes sense from the way the experiments are related.
However, if I run the joint model, all hell breaks loose: I get huge divergence, bad rhat, functional posteriors that are widely off and appear like different mixture components. I’ve tried adapting delta, stepsize and all that good stuff, but to no avail. I do have issues with separability, which I tried to adress in various ways (such as using sum_to_zero_vectors for beta_logit for identifiability etc., the model above is just one of many similar attempts.), but I suspect this alone is not the culprit.
I think part of the reason might be that the functional parameters are informed by the phenotype data as well, whereas in reality there is a strict one-way causal relation between genotype and phenotype. This kind of situation seems to be common, and I have seen multiple threads in this forum related to this. The usual recommendation is to run separate models, and use the functional posterior (linear regression) as input to the phenotype model (logistic regression).
However, there MUST be a way to model this jointly somehow, but I cannot figure out a way to get this to work and “shield” the functional paremeters from the phenotype observations. So I’m stuck, and would greatly appreciate any ideas and input on the matter :-)