Divergences in a cophylogenetic comparative model

I’ve recently gone back to working model for cophylogenetic regression marginalising out the uncertainty in both phylogenies, which I originally posted about on here a while ago. It seems to return simulated values, but approximately 5% of the post warmup draws show divergences (in the lower section of the pairs plot, so I can’t just adapt-delta them away) *, and chains occasionally give low E-BFMIs. Obviously, it’s not currently useable. I have another version of this model that doesn’t use the simplex parameterisation, so priors are placed over each individual variance component rather than the total variance, and while I’m still continuing tests with that version it doesn’t seem to show the same issues. That would be fine, but for reasons I can’t quite grok, it runs about 10x slower than this implementation, which makes it impractically slow to actually use on real data.

*the manual says that simplexes are particularly nasty to sample from, so I suspect that’s the cause?

The simplex implementation (the one I’ve done considerably more testing on) is shown below. Basically, I was wondering if anyone could see any simple reparameterisations of this model that might help with the divergences? If not, I’ll put up the other version of the model after I’m 100% convinced it really is passing all the tests, and see if anyone can see any obvious speed ups I could use there.

Thanks for all the help I’ve had with this model so far.

//a comparative model looking at the associations between hosts and viruses
//per Hadfield et al - A Tale of Two Phylogenies marginalising
//phylogenetic uncertainty -- see that paper for descriptions of all terms
//and formulas for calculation of covariance matrices
//code based on and inspired by that of Diogo Melo - originally licensed
//under the MIT Licence 
//github.com/diogro/stanAnimal
//marginalisation of phylogenetic uncertainty implemented per solution
//on Stan forums and latent discrete parameters section of the manual
//discourse.mc-stan.org/t/sampling-over-an-uncertain-variance-covariance-matrix
//pooled binomial likelihood implemented per Thompson 1971
data {
  int<lower=1> datapoints;
  int<lower=1> virusspecies;
  int<lower=1> hostspecies;
  int<lower=1> virushostcombinations;
  int<lower=1> pools;
  int<lower=1> spatialcompositions;
  int<lower=1> matrices;
  int<lower=0, upper=1> run_estimation;
  
  int<lower=0, upper=1> y[datapoints];
  int<lower=1, upper=hostspecies> host[datapoints];
  int<lower=1, upper=virusspecies> virus[datapoints];
  int<lower=1, upper=virushostcombinations> combination[datapoints];
  int<lower=1, upper=pools> pool[datapoints];
  int<lower=1, upper=spatialcompositions> spatialcomposition[datapoints];
  //for OLRE - additive overdispersion
  int<lower=1, upper=datapoints> ID[datapoints];
  int<lower=1> samples[datapoints];
  
  matrix[hostspecies, hostspecies] HostPhy[matrices];
  matrix[virusspecies, virusspecies] VirusPhy[matrices];
  matrix[virushostcombinations, virushostcombinations] HostInter[matrices];
  matrix[virushostcombinations, virushostcombinations] VirusInter[matrices];
  matrix[virushostcombinations, virushostcombinations] CoevoInter[matrices];
}
transformed data{
  real log_unif;
  matrix[hostspecies, hostspecies] CD_HostPhy[matrices];
  matrix[virusspecies, virusspecies] CD_VirusPhy[matrices];
  matrix[virushostcombinations, virushostcombinations] CD_HostInter[matrices];
  matrix[virushostcombinations, virushostcombinations] CD_VirusInter[matrices];
  matrix[virushostcombinations, virushostcombinations] CD_CoevoInter[matrices];
  
  //for constant correction to posterior density
  log_unif = -log(matrices);
  
  //for speed cholesky decompose all covariance matrices
  for (i in 1:matrices) {
    CD_HostPhy[i] = cholesky_decompose(HostPhy[i]);
    CD_VirusPhy[i] = cholesky_decompose(VirusPhy[i]);
    CD_HostInter[i] = cholesky_decompose(HostInter[i]);
    CD_VirusInter[i] = cholesky_decompose(VirusInter[i]);
    CD_CoevoInter[i] = cholesky_decompose(CoevoInter[i]);
  }
}
parameters{  
  //total variance
  real<lower=0> sigma;
  
  //for partitioning of total variance
  simplex[11] sim;
  
  //non-centering
  vector[virusspecies] virus_tilde;
  vector[hostspecies] host_tilde;
  vector[virushostcombinations] inter_tilde;
  vector[pools] pool_tilde;
  vector[spatialcompositions] spatial_tilde;
  vector[datapoints] residual_tilde;
  vector[virusspecies] vphy_tilde;
  vector[hostspecies] hphy_tilde;
  vector[virushostcombinations] vinter_tilde;
  vector[virushostcombinations] hinter_tilde;
  vector[virushostcombinations] coevointer_tilde;
  
  //posterior latent mean
  real alpha; 
}
transformed parameters{
  //predicted probability for each datapoint for each matrix
  vector[datapoints] predprob[matrices];
  vector[datapoints] theta[matrices];
  
  //predicted means for each partially pooled effect
  vector[virusspecies] mean_virus;
  vector[hostspecies] mean_host;
  vector[virushostcombinations] mean_inter;
  vector[pools] mean_pool;
  vector[spatialcompositions] mean_spatial;
  vector[datapoints] mean_residual;
  vector[virusspecies] mean_virusphy[matrices];
  vector[hostspecies] mean_hostphy[matrices];
  vector[virushostcombinations] mean_virusinter[matrices];
  vector[virushostcombinations] mean_hostinter[matrices];
  vector[virushostcombinations] mean_coevointer[matrices];
  
  //vector for calculating marginalised log likelihoods
  vector[matrices] lp;
  
  //add exact correction for marginalisation
  lp = rep_vector(log_unif, matrices);
  
  //non-centred calculation of means for non-phylogenetically associated effects
  mean_virus = sqrt(sigma * sim[1]) * virus_tilde;
  mean_host = sqrt(sigma * sim[2]) * host_tilde;
  mean_inter = sqrt(sigma * sim[3]) * inter_tilde;
  mean_pool = sqrt(sigma * sim[4]) * pool_tilde;
  mean_spatial = sqrt(sigma * sim[5]) * spatial_tilde;
  mean_residual = sqrt(sigma * sim[6]) * residual_tilde;
  
  //non-centered calculation of means for phylogenetically associated effects
  for (i in 1:matrices) {
    mean_virusphy[i] = sqrt(sigma * sim[7]) * (CD_VirusPhy[i] * vphy_tilde);
    mean_hostphy[i] = sqrt(sigma * sim[8]) * (CD_HostPhy[i] * hphy_tilde);
    mean_virusinter[i] = sqrt(sigma * sim[9]) * (CD_VirusInter[i] * vinter_tilde);
    mean_hostinter[i] = sqrt(sigma * sim[10]) * (CD_HostInter[i] * hinter_tilde);
    mean_coevointer[i] = sqrt(sigma * sim[11]) * (CD_CoevoInter[i] * coevointer_tilde);
  }
  
  //precalculate expected probability for each data point
  for (i in 1:datapoints) {
    for (j in 1:matrices) {
      //predprob saved for later posterior predictive simulations
      predprob[j,i] = inv_logit(alpha + mean_virus[virus[i]] +
      mean_host[host[i]] + mean_inter[combination[i]] +
      mean_virusphy[j,virus[i]] + mean_hostphy[j,host[i]] +
      mean_virusinter[j,combination[i]] + mean_hostinter[j,combination[i]] +
      mean_coevointer[j,combination[i]] + mean_pool[pool[i]] +
      mean_spatial[spatialcomposition[i]] + mean_residual[ID[i]]);
      theta[j,i] = 1 - (1 - predprob[j,i])^samples[i];
    }
  }
  
  //calculate model likelihood marginalising over the phylogenetic uncertainty
  if (run_estimation==1) {
    for (i in 1:matrices) {
        //likelihood of at least 1 success in n trials with underlying
        //probability of success theta is modified Bernoulli with success
        //probability = 1-(1-p)^n -- quantity precalculated above in theta
        //see Thompson 1971 Biometrics
        lp[i] = lp[i] + bernoulli_lpmf(y | theta[i]);
    }
  }
}
model {
  //logistic prior over global mean - uniform on the probability scale
  target += logistic_lpdf(alpha | 0,1);
  
  //normal(0,1) priors over variables used for for non-centreing
  target += normal_lpdf(vphy_tilde | 0,1);
  target += normal_lpdf(hphy_tilde | 0,1);
  target += normal_lpdf(vinter_tilde | 0,1);
  target += normal_lpdf(hinter_tilde | 0,1);
  target += normal_lpdf(coevointer_tilde | 0,1);
  target += normal_lpdf(virus_tilde | 0,1);
  target += normal_lpdf(host_tilde | 0,1);
  target += normal_lpdf(inter_tilde | 0,1);
  target += normal_lpdf(pool_tilde | 0,1);
  target += normal_lpdf(spatial_tilde | 0,1);
  target += normal_lpdf(residual_tilde | 0,1);
  
  //prior over global variance - prior mean 6.25 (implying prior mean standard
  //deviation of 2.5) - approximately 99% of pdf below 25 (implying 99% of pdf
  //below standard deviation of 5)
  target += gamma_lpdf(sigma | 1.33,0.2128);
  
  //dirichlet prior over simplex
  target += dirichlet_lpdf(sim | rep_vector(1, 11));
  
  //increment the log posterior by the summed likelihood
  if (run_estimation==1) {
    target += log_sum_exp(lp);
  }
}
generated quantities {
  //define variance parameters
  real<lower=0> sigma_virus;
  real<lower=0> sigma_virusphy;
  real<lower=0> sigma_host;
  real<lower=0> sigma_hostphy;
  real<lower=0> sigma_virusinter;
  real<lower=0> sigma_hostinter;
  real<lower=0> sigma_inter;
  real<lower=0> sigma_coevointer;
  real<lower=0> sigma_pool;
  real<lower=0> sigma_spatial;
  real<lower=0> sigma_residual;
  real<lower=0> denominator;
  
  //define ICC parameters
  real<lower=0, upper=1> ICC_virus;
  real<lower=0, upper=1> ICC_virusphy;
  real<lower=0, upper=1> ICC_host;
  real<lower=0, upper=1> ICC_hostphy;
  real<lower=0, upper=1> ICC_virusinter;
  real<lower=0, upper=1> ICC_hostinter;
  real<lower=0, upper=1> ICC_inter;
  real<lower=0, upper=1> ICC_coevointer;
  real<lower=0, upper=1> ICC_pool;
  real<lower=0, upper=1> ICC_spatial;
  real<lower=0, upper=1> ICC_residual;
  real<lower=0, upper=1> ICC_nonphylogenetic;
  real<lower=0, upper=1> ICC_phylogenetic;

  //define parameters for data simulation
  int<lower=0, upper=1> y_sim[datapoints];
  
  //define parameters for model comparision 
  vector[matrices] uncertainty_log_lik;
  vector[datapoints] log_lik;
  real<lower=0> calc;
  
  //define for the case where matrices == 1 and calc 
  //would otherwise be undefined
  calc = 0;
  
  //generate implied variance explained by each model component
  sigma_virus = sigma * sim[1];
  sigma_host = sigma * sim[2];
  sigma_inter = sigma * sim[3];
  sigma_pool = sigma * sim[4];
  sigma_spatial = sigma * sim[5];
  sigma_residual = sigma * sim[6];
  sigma_virusphy = sigma * sim[7];
  sigma_hostphy = sigma * sim[8];
  sigma_virusinter = sigma * sim[9];
  sigma_hostinter = sigma * sim[10];
  sigma_coevointer = sigma * sim[11];
  //denominator of ICC calculation equal to total variance plus 
  //variance of the logistic distribution
  denominator = sigma + (pi()^2)/3;
  
  //generate proportion of variance explained by each model component as well as
  //all phylogenetically and all non-phylogenetically associated effects
  ICC_virus = sigma_virus/denominator;
  ICC_virusphy = sigma_virusphy/denominator;
  ICC_host = sigma_host/denominator;
  ICC_hostphy = sigma_hostphy/denominator;
  ICC_virusinter = sigma_virusinter/denominator;
  ICC_hostinter = sigma_hostinter/denominator;
  ICC_inter = sigma_inter/denominator;
  ICC_coevointer = sigma_coevointer/denominator;
  ICC_pool = sigma_pool/denominator;
  ICC_spatial = sigma_spatial/denominator;
  ICC_residual = sigma_residual/denominator;
  ICC_nonphylogenetic = (sigma_virus + sigma_host + sigma_inter +
  sigma_spatial + sigma_pool)/denominator;
  ICC_phylogenetic = (sigma_virusphy + sigma_hostphy + sigma_virusinter +
  sigma_hostinter + sigma_coevointer)/denominator;

  //generate pointwise log-likelihoods for PSIS-LOO model comparison 
  //see Vehtari, Gelman and Gabry 2017
  //as total model likelihood equal 
  //sum from 1:K -log uniform density + log likelihood
  //log likelihood of a single data point equals sum of log likelihoods
  //of that point over each matrix combination
  //see case with a single data point
  for (i in 1:datapoints) {
    for (j in 1:matrices) {
      //likelihood of at least 1 success in n trials with underlying
      //probability of success theta is modified Bernoulli with success
      //probability = 1-(1-p)^n -- quantity precalculated above in theta
      //see Thompson 1971 Biometrics
      //calculate log likelihood for ith datapoint on each matrix
      uncertainty_log_lik[j] = log(((1 - theta[j,i])^samples[i])^(1 - y[i]) *
      (1 - (1 - theta[j,i])^samples[i])^y[i]);
    }
	//log_sum_exp to get log-likelihood
    log_lik[i] = log_sum_exp(uncertainty_log_lik);
  }
  
  //generate data simulations, using infection probability weighted by posterior
  //probability of each tree (VCV) in the phylogenetic uncertainty case
  if (matrices == 1) {
    for(i in 1:datapoints) {
      y_sim[i] = bernoulli_rng(theta[1,i]);
    }
  } else {
    calc = 0;
    for(i in 1:datapoints) {
      for (j in 1:matrices) {
        calc = calc + predprob[j,i] * exp(lp[i] - log_sum_exp(lp));
      }
      y_sim[i] = bernoulli_rng(1 - (1 - calc)^samples[i]);
    }
  }
}

Have you tried using a stronger prior for the simplex? If that doesn’t work, it may be worth trying a softmax version.

I don’t know why I didn’t think of trying the softmax version. I’ll give that a go now, thanks.

This could be really great if you get it to work. I’m writing up a manuscript on a (somewhat) similar model right now, and finding that you’ve done this is exciting on one hand, and frustrating on the other (because redundant efforts are no fun and because you’ve incorporated some neat tricks that I’m wishing I had done!).

I have two ideas about your divergences. The first is that the different terms in Hadfield model have the potential to be highly colinear - the ‘effects’ of tips in the phylogeny with MRCAs near the root are essentially redundant among the various interaction terms (‘coevolutionary’, ‘non-phylogenetic specificity’, ‘host evolutionary interaction’, etc.). I think this has the potential to create a highly multimodal posterior, which will lead to divergences during a transition between modes? I had identified this potential problem when I first started writing my own model, and was planning to implement something where the redundant effects are collapsed to a single effect whose variance is instead defined based on sums of the various VCVs.

The other is related to the marginalization of the various input phylogenies. I also tried this early on. I’m not as sure about this, but I think the sharing of parameters among the various marginalized phylogenies may also contribute to a strongly multimodal posterior. For instance, the cophylogenetic effect of Virus A in Host B could have strong prior probability according to half of your trees but be extremely unlikely in the other half. I wound up going with the run-the-model-many-times-and-combine-results-afterwards approach, but it’s only scalable for my dataset to about ten draws from the host phylogeny and a compromise of a single ML tree for the microbes. So another compromise that I made was to allow the branch lengths of the tree to vary during the model fitting - although it doesn’t consider completely alternate topologies, it incorporates a bit more uncertainty into the model.

1 Like

also, you may run into numerical issues when using inv_logit - that could possibly be changed to something like:

predlogprob[j,i] = log_inv_logit(…)
loginvtheta[j,i] = log1m_exp(predlogprob[j,i]) * samples[i];
lp[i] = lp[i] + bernoulli_logit_lpmf(y | log1m_exp(loginvtheta[i]) - loginvtheta[i]);

and you can calculate theta itself in generated quantities for the rest of the calculations that require it, or change those calculations similarly

I did eventually get it going with no divergences at a reasonable speed. I never solved the simplex implementation, but through various tricks I sped up the non-centered hierarchical parameterisation enough for it to be usable (which was never giving any divergences). I only managed to run it marginalising over 10 host and 10 virus phylogenies from the posterior due to speed though, the exponential increase in compute time you get if you increase the number of host and virus phylogenies at the same rate was killer. Though that’s a problem that, if you have access to a cluster, should hopefully cease being an issue now that Stan can do parallel computing. The marginalisation easily parallelisable. I’d definitely noticed partial non-identification between some of the terms, but despite the fact that some of our phylogenies are really uncertain and the uncertainty is being marginalised, it seems to be driven mostly by the data in our case. The paper from it is planning on going on bioRxiv at some point before Christmas. But the model itself is here, under CompleteCophylogeneticNC.stan.

Awesome!

How’s the memory load, and what size of phylogenies are you working with?

They’re really small, which I guess is an advantage and a curse. My data’s at the individual rather than the species level. So it’s 12 species for the hosts, 18 for the viruses, as I remember. But that’s larger than the size of the phylogenies Jarrod originally validated the method on. We had limited ability to estimate the non-interaction variances though, because the group sizes are so small, so some of our posteriors are pretty diffuse, but the model did converge and there are clear large effects with sensible priors, so it’s a win on that front.

But yeah, I’d pretty happy with how it worked out computationally. I can run the analysis on my laptop, which has 16GB of memory, with no issues and plenty of room to spare (but I didn’t because the run time is still annoying with the marginalisation). For that kind of data size, it’s fine. Because (as far as I know) Stan can’t handle sparse matrices, it really falls down on larger datasets. I can’t even get the non-sparse version of Jarrod’s original analysis to load into memory to test, whereas that’s not an issue with the sparse MCMCglmm version. I know Jarrod’s dataset was taking about 3 months to run in MCMCglmm on whatever he was running it on, but it’s just not feasible to do the analogous analysis here at the moment.

I don’t have time at the moment to reply in detail (but this is all awesome so I’ll get back to it soon!) - but for now, yes, Stan has sparse matrix methods! Although, oddly, they didn’t reduce runtime or memory at all for me.

I actually avoided the Kronecker products entirely, though, by instead sampling the parameters in a matrix rather than a vector, and then multiplying them by their ancestries on both dimensions. It still seems to me that a single multiplication step with the sparse Kronecker should have been faster and better for memory, but for whatever reason, it wasn’t.