Linear regression posterior as logistic regression predictor in joint model

I am working on a relatively large dataset regarding the genetics and clinical manifestation of mutations in a certain type of neuronal receptor. It consists of two layers:

  1. For the wildtype as well as several genetic variants, I have measurements from different electrophysiological experiments (such as patch clamp) for this receptor, with a low number of replicates (usually 4).

  2. A cohort of patients, each carrying one of the variants, with a large list of binary clinical phenotypes, i.e. presence and absence (e.g. seizures, cortical blindness etc.).

I am interested on both the distribution of functional parameters as well as their relative influence on clinical phenotypes.

My model seemed pretty straightforward at first: model the functional measurements with a hierarchical multivariate normal (thus be able to derive the posterior for the correlation matrix as well as the measurements per variant), compute z-scores as a transformed quantity (to get them on the same scale), and then use these in logistic regressions on the phenotypes to get posteriors for the regression coefficients:

data{

  int<lower=1> nr_patients;
  int<lower=1> nr_experiments;
  int<lower=1> nr_measurements;
  int<lower=1> nr_variants;
  int<lower=1> nr_phenotypes;
  int<lower=1> nr_phenotype_obs;


  // FUNCTIONAL DATA

  // value of functional measurement
  array[nr_measurements] real measurement;

  // indicator for the type of functional experiment
  array[nr_measurements] int<lower=1, upper=nr_experiments> experiment;

  // identifier of variant
  array[nr_measurements] int<lower=1, upper=nr_variants> variant;


  // PHENOTYPE DATA
  
  // phenotype identifier (ataxia, status epilepticus etc.)
  array[nr_phenotype_obs] int<lower=1, upper=nr_phenotypes> phenotype_obs;
  
  // binary phenotype status (0=absent, 1=present)
  array[nr_phenotype_obs] int<lower=0, upper=1> phenotype_status_obs;
  
  // variant ID
  array[nr_phenotype_obs] int<lower=1, upper=nr_variants> variant_obs;


}

parameters {

  // FUNCTIONAL DATA

  // mean of measurement
  matrix[nr_variants, nr_experiments] mu;

  // SD of measurement
  matrix<lower=0>[nr_variants, nr_experiments] sigma;

  // PHENOTYPE DATA
  
  // logistic regression coefficients for phenotype outcome, one for each experiment type plus intercept
  array[nr_phenotypes] vector[nr_experiments] beta_logit;
  array[nr_phenotypes] real intercept_logit;


  // scale component of covariance matrix
  vector<lower=0>[nr_experiments] tau;

  correlation matrix component of covariance matrix
  corr_matrix[nr_experiments] Omega;

  // mean params
  vector[nr_experiments] mu_bar;


  // lower diagonal matrix of the Cholesky factorization of the correlation matrix Omega
  cholesky_factor_corr[nr_experiments] Lcorr;

  // scale component of covariance matrix
  vector<lower=0>[nr_experiments] tau;

}


transformed parameters {

  // z-score-like transformation: number of standard deviations that a mean
  // functional measurement deviates from wild-type, plus an intercept column
  matrix[nr_variants, nr_experiments] z;
  for (v in 1:nr_variants){
    z[v] =  (mu[v] - mu[1])./sigma[1];	// NOTE variant=1 is WT (wild-type)
  }


}

model {


  // FUNCTIONAL DATA

  // hyperpriors for means
  mu_bar ~ normal(0, 1.5);

  // hyperpriors for means covariance matrix
  tau ~ cauchy(0, 5);
  Lcorr ~ lkj_corr_cholesky(1);


  for (v in 1:nr_variants){

    mu[v] ~ multi_normal_cholesky(mu_bar, diag_pre_multiply(tau, Lcorr));

    for (e in 1:nr_experiments){
      sigma[v,e] ~ exponential(1);
    }
  }


  for (m in 1:nr_measurements){

    int v = variant[m];
    int e = experiment[m];

    measurement[m] ~ normal(mu[v,e], sigma[v,e]);

  }

  
    // PHENOTYPE DATA
  
    // priors
    for (f in 1:nr_phenotypes){
    
        // standard-normal marginals
        beta_logit[f] ~ normal(0,	inv(sqrt(1 - inv(nr_experiments+1)))) T[-6,6]; // truncate do deal with separability
        intercept_logit[f] ~ normal(0,	inv(sqrt(1 - inv(nr_experiments+1)))) T[-6,6];
      }
    
    
      for (i in 1:nr_phenotype_obs){
          int f = phenotype_obs[i];
          int v = variant_obs[i];
          phenotype_status_obs[i] ~ bernoulli_logit( dot_product( beta_logit[f], z[v] ) +	intercept_logit[f]);
        }
  
}

generated quantities {
  corr_matrix[nr_experiments] Omega;
  Omega = multiply_lower_tri_self_transpose(Lcorr);

}

If I only use the functional part of the model, i.e. comment out the phenotype part, it works like a charm. The measurement posteriors are where I expect them to be, and the correlation matrix makes sense from the way the experiments are related.

However, if I run the joint model, all hell breaks loose: I get huge divergence, bad rhat, functional posteriors that are widely off and appear like different mixture components. I’ve tried adapting delta, stepsize and all that good stuff, but to no avail. I do have issues with separability, which I tried to adress in various ways (such as using sum_to_zero_vectors for beta_logit for identifiability etc., the model above is just one of many similar attempts.), but I suspect this alone is not the culprit.

I think part of the reason might be that the functional parameters are informed by the phenotype data as well, whereas in reality there is a strict one-way causal relation between genotype and phenotype. This kind of situation seems to be common, and I have seen multiple threads in this forum related to this. The usual recommendation is to run separate models, and use the functional posterior (linear regression) as input to the phenotype model (logistic regression).

However, there MUST be a way to model this jointly somehow, but I cannot figure out a way to get this to work and “shield” the functional paremeters from the phenotype observations. So I’m stuck, and would greatly appreciate any ideas and input on the matter :-)

This is called “cut” in the Bayesian inference literature and it’s very commonly applied in pharmacokinetic/pharmacodynamic (PK/PD) models where the PK model is very tight and well supported, but adding the PD model will distort the estimates of the PK parameters unrealistically. The first tack is to try to make a better PD model. Barring that, people use cut to do exactly as you say. Martyn Plummer (author of JAGS) wrote a very nice overview.

There simply may not be a fix that doesn’t involve a lot of innovation on the second model. You can look at how the second model fits for given values of the first model parameters to get a sense of what’s going on.

It is important to make sure everything’s identifiable in the likelihood to the extent possible. Our new sum_to_zero_vector data type has a much better transform and is more robust than doing it yourself.

I would strongly advise you against the T[-6, 6] solution to bound parameters. That’s almost never a good idea in a Bayesian model. If the constraint does something, it’s going to impart bias relative to the unconstrained model. If it doesn’t do anything statistically, it just complicates the posterior geometry by introducing he constraints which involve evaluating and differentiating cdf functions.

Rather than inv(sqrt(x)) you can write x^-0.5.

I would generally recommend trying to simplify your model first and build up from a non-hierarchical model to something more complicated. I don’t see anything obvious that’s a problem here, but you need to be careful about identifiability. That’s why I’m suggesting starting with something simpler and seeing what it looks like.

P.S. This can be done in one line:

corr_matrix[nr_experiments] Omega = multiply_lower_tri_self_transpose(Lcorr);

For efficiency, it would help if you could replace the repeated dot products in the Bernouli_logit for phenotype status with a matrix multiply. But I’d worry about getting things working first.

1 Like

Thanks for your kind reply and the link to the cuts article. I had not heard of this concept previously, and I will check it out!

Indeed, I did use sum-to-zero vectors for exactly the purpose of ensuring identifiability, but that did not help. The reason the intercept is separated out in this model is because I tried to only constrain the actual coefficients in one attempt, so this is a leftover in the version shown here. I did have full matrix versions as well, but they all suffer from the same problems, only some would run a bit faster. The truncation constraints were one attempt to keep coefficients from drifting towards infinity for perfect separation cases. But I can see how this can cause issues.

Of course, it would be nice if some future versions of Stan would support that kind of cut in an intuitive fashion, maybe by allowing another model block after generated quantities or something :-) For now, if I were to separate the models, would using my sampled z from the first model suffice as input in the second model, or would there be issues of double-dipping?

P.S.

You can look at how the second model fits for given values of the first model parameters to get a sense of what’s going on.

I am not sure what I would be looking for here. The second model is basically a bunch of straightforward logistic regressions that share some input, but their coefficients are independent. Would you expect some strange effects going on at this level?

Truncation’s not an effective way to do that because the posterior probability mass will pile up at the boundary, which really drags down the computation and introduces bias (negative bias if you hit the upper bound, positive bias if you hit the lower bound). It’s indicating that your data’s compatible with these values.

You can use a continuous prior if you really want to drive down values.

If you mean using the data twice, then no, it doesn’t. It’s just not a fully Bayesian model. And yes, the way to code it now is with multiple imputation.

We are working on a system that will support this in general for the future, but it’ll be at least a year until that rolls through the interfaces—we haven’t even completed the design yet. The idea’s to put a block like generated quantities before the model block (this is very similar functionally to putting it after the generated quantities block—it would’ve worked either way).

I’m just suggesting taking some posterior draws from model 1, then use those fitted values as data constants in model 2. Fit model 2 and look at result. Then do it for some more posterior draws to get a sense of how sensitive model 2 is to changes in the output parameters from model 1. It may not be worth the effort if they’re very similar.

1 Like