Multiple outcome model: conflicting LOO results?

I have quite a complicated model (which I can provide additional information for as well as a reprex), but essentially it has two outcomes that it ouputs from a single dataset.

The dataset has a not too small (~1000) number of competition experiments between two bacteria. Basically, there are two growth curves as both bacteria grow in the same test tube, and we can use those growth curves to estimate who is out-competing who. Each competition has 16 timepoints, so the actual number of rows in the full dataset is actualy ~16,000.

What the stan program does:

  • Estimate competition outcomes using an ODE solver, to report the relative competitive strength of both bacteria in each experiment (the estimand is 1000 vectors of length 2: two competition coefficients for each competitor, for each experiment).
  • Since we know the genotype of each of the bacteria, we break down the competition coefficients inferred from the ODEs into the genotypic diffferences between the bacteria. So for example, the effect of gene A on competitive ability gets estimated from every bacteria that has gene A in any of the 1000 competitions.

Thus we can notate the two outcomes in this way:
For raw growth curves of two bacteria y_obs

  1. y_obs ~ N(mu,sigma), where mu was estimated using ode_rk45

Then we have the competition coefficients s, a data of type array[1000] vector[2], which are parameters estimated by the numerical integrator used to get mu above. These are broken down into the effects of the genes which we call the ‘costs’ of the genes, and the outcome is modelled as a bivariate normal:
2. s ~ multi_normal_cholesky(costs,L_Cov), where costs is a vector with the effects of each gene, and L_Cov is the covariance matrix (cholesky’d of course)

Now I have been interested in doing some model comparison, where the noise for outcome 1 is modelled as lognormal rather than normal, for example. So I made two scripts and ran them both on my data, and I have two log-lik matrices:

  1. for outcome 1, I have 16,000 datapoints for which I calculate log likelihood
  2. for outcome 2, I have 1000 datapoints for which I calculate log likelihood but each datapoint is a parameter extracted from a competition experiment (16 datapoints, or 32 since the performance of a bacteria is estimated together with its competitor).

In the end I am a little more interested about the performance for expaining outcome 2: I am interested in genetic differences. But when I ran the two models, I got this funny result where, lognormally distributed errors fit the ODE outcome better (LOO scores are much better for lognormal rather than normal noise in outcome 1), but the model for the normal noise for outcome 1 performs better on outcome 2 (though they actually both perform rather badly on that one for some reason).

Here are the loo comparison for outcome 1, comparing lognormal to normal errors in the ode outcome
Lognormal first one, normal second one

Here are the loo comparison for outcome 2, comparing again the same models
Now it’s the normal that’s the first and lognormal second (lognormal is worse)

Note there are 30 genes, so the p_loo is very inflated for both models.

I found that when I simulate from this data generating process I get also bad loo scores for outcome 1 even though I simulate from the exact data generating process that I use to write the model. But the estimates themselves are good; I think my simulations may be underdetermined because I use very little data points when I fit simulations (my actual data takes hours to run each model, the model is large and slow as you can imagine).

As I mentioned, I can provide the code and reprex happily, it’s just sunday evening here and I wanted to get the question out and hopefully follow up if someone is interested enough to take a dive into this rather large complicated model. I would be in any case most happy with any advice generic or otherwise that I missed that would be relevant here. Like, can I do the joint likelihood, even though one log-lik matrix is way bigger than another, and if I care about one slightly more than the other?

Thanks and good week to all… may all your stories eventually fit your data

Seeing the Stan code would help to clarify some things

This is clear.

This is not clear, as it seems like s is not actually data (declared in data block) but transformed parameter (declared in transformed parameters block or in model block). It is not clear what cross-validation would mean in this case.

If I understood correctly, outcome 2 is not actually observed data, and usual cross-validation is not valid for it. You can still investigate the amount of information in the posterior, and you can assess whether normal distribution is appropriate, but you can’t estimate elpd as it is usually defined.

The second row looks suspicious, hinting there is some error in your code. It is very rare that p_loo would be much larger than the number of observations (you have 33447 >> 16000) and since lognormal model has p_loo about 115, it seems the model complexity is not very high (it helps if you tell the total number of parameters in the model).

I need to see the code, as now I don’t understand how did you compute log_lik values from the multivariate normal.

This also hints that you have an error in your code.

Hi @avehtari thank you very much for digging into my problem.

Here is my annotated code in hopes that it addresses all your questions.

Here is some more info that may be relevant:

  • both normal and lognormaal models fit the data well in terms of convergence (rhat and ess), except that they take quite long(5 hrs for 4 chains*). I sort of expected it due to the 16,000 datapoints, two models in one and ODE solver all going into the same script.
  • *One notable thing about the lognormal model is that one in four chains ends up taking 4-5x as long as the others and hits max treedepth often in sampling phase, I’m persuming because of difficulties during warmup. As you can imagine, 5x 4 hrs is almost a day, so I have 3 chains that finish after 5 hours and 1 chain that takes almost a day. I’m not sure if there is a way to bring all the chains more closer together: for example, I would prefer all chains to be as slow as 12 hours instead of 3 chains at 4hrs and 1 chain at 24hrs. I am using initial values but I haven’t tried increasing adapt_delta or max_treedepth beyond the cmdstanr defaults but not sure if that would help.
  • I used to fit ODEs separately using an ODE solver, obtain values of s1 and s2, and then use them in a stan code that only breaks down those competitive abilities. But I just recently found I can combine them both into one model and the model actually samples ok (if slow). So in my mind I still treat them as two separate models, also becaus I take s1 and s2 and use a separate design matrix to calculate a set of predictors: not sure if I am wrong and one would call this a transformed parameter in this case?
  //ODE system
  vector logistic(real t, //time
                         vector y, // state
                         real r, //average growth rate
                         real K, //carrying capacity / maximum total density of both y
                         real s1, // relative growth (dis)advantage of y1
                         real s2 // relative growth (dis)advantage of y2
                        ) {
      vector[2] dydt;
      dydt[1] = r*(1+s1)*y[1]*(1-(y[1]+y[2])/K);
      dydt[2] = r*(1+s2)*y[2]*(1-(y[1]+y[2])/K);
    return dydt;

//data for first model: fitting ODEs to get global (r,K,y0) and competition-specific (s1,s2) parameters
  int<lower=1> N; //rows in data
  int<lower=1> N_ID; //number of bacterial competition experiments
  array[N] int<lower=1,upper=N_ID> IDX; //experiment ID
  int<lower=1> N_dil; //number of dilution factors / initial density of bacteria added : we had two different values
  array[N_ID] int<lower=1,upper=N_dil> ID_dil; // dilution factor ID
  real t0; //initial timepoint (0)
  array[N] vector[2] yobs; //observed bacterial densities
  array[N] real ts; // timepoints
  int<lower=1> culturing_time; // number of timepoints per competition: same for all competitions
  // second model:  breaking down the selection coefficients s1,s2
  int<lower=0> N_X; //number of genotype components
  matrix[N_ID,N_X] X1; // design matrix for genotype of bacteria 1
   matrix[N_ID,N_X] X2; //design matrix for genotype of bacteria 2
  int<lower=0> N_pred; // number of yhat points I want to Stan to calculate for me (posterior prediction)
  matrix[N_pred,N_X] X_pred; //design matrix for predictions
  matrix[N_pred,N_X] X_pred_no_epistasis; // additional design matrix to test some hypothesis (not important to explain)
  //parameters for fitting logistic ODE model
  array[N_dil] real<lower=0> inoculum; //y0 is the same for both competitors (experimental constraint) but different for different dilutions: dictated by ID_dil column of dummy variables.
  real<lower=0> r; // average growth rate of all competition experiments:
  real<lower=0> K; // carrying capacity / max density of culture
  array[N_ID] vector[2]  s; //array of growth (dis)advantages for each strain
  real<lower=0> sigma; //experimental noise
  //parameters for genotype effects breakdown model
  vector[N_X] dc; //vector of genotype effects on competitive fitness
  real<lower=0> sigma_dc; //effect residual standard error
  cholesky_factor_corr[2] L; //cholesky factor of correlation between competitors
transformed parameters{
    //vector of multivariate means: 
// note s1 and s2 are interchangable (the order of who we call 1 and 2 doesn't matter)
// therefore both s1 and s2 go into estimating the same genotypic effects vector dc
   array[N_ID] vector[2] geffects;
    for(i in 1:N_ID){
      geffects[i,1] = X1[i]*dc;
      geffects[i,2] = X2[i]*dc;
  //identical sigmas for both selection coefficients (as, again, their position as competitor 1/competitor 2 is arbitrary)
  vector[2] L_Sigma = rep_vector(sigma_dc,2);

  // covariance matrix
  matrix[2,2] L_Cov = diag_pre_multiply(L_Sigma,L);

  sigma ~ exponential(1);  
  r ~ student_t(3,0.7,0.25); //expect 20-40% less than 1 doubling per hour since using minimal media
  K ~ normal(1,0.25); //expect 0.5-1.5
  s[,1] ~ normal(0,0.1);  // expecting small differences from mean growth rate
  s[,2] ~ normal(0,0.1); 
  inoculum ~ normal(0.01,0.03);  

  //manual ODE fitting for each one of the N_ID competition experiments, all with identical number of timepoints
  for(i in 1:N_ID){
//indexing first and last timepoint of culturing for vectors of length N like the yobs and ts (time) vector
    int start = ((i-1)*culturing_time) + 1; 
    int end = i*culturing_time;
//initialize y0 parameter constrained (experimentally) to be same for both competitors
    vector[2] y0 = rep_vector(inoculum[ID_dil[i]],2);  
//numerical integration step
    array[culturing_time] vector[2] mu = ode_rk45(logistic,y0,t0,ts[start:end],r,K,s[i,1],s[i,2]); 
//fitting with experimental noise:
    target +=  normal_lpdf(yobs[start:end,1]|mu[,1],sigma); //likelihood for competitor y1
    target +=  normal_lpdf(yobs[start:end,2]|mu[,2],sigma); //likelihood for competitor y2
//NOTE for lognormal version these would be the likelihoods
   // target +=  lognormal_lpdf(yobs[start:end,1]|log(mu[,1]),sigma);
    //target +=  lognormal_lpdf(yobs[start:end,2]|log(mu[,2]),sigma); 

  //model 2: breaking down s1, s2 using design matrix X1,X2 to model differential genotypic effects
  dc ~ normal(0,0.1); //small genotypic effects
  L ~ lkj_corr_cholesky(2);
  sigma_dc ~ exponential(1);

    target+= multi_normal_cholesky_lpdf(s|geffects,L_Cov);
generated quantities {
  //convert cholesky matrix to correlation matrix
    corr_matrix[2] rho = multiply_lower_tri_self_transpose(L);

//here is how I calculate the two likelihoods
  vector[N] log_lik_ODE;

    for(i in 1:N) {
    int start = ((IDX[i]-1)*culturing_time) + 1;
    int end = IDX[i]*culturing_time;
    vector[2] y0 = rep_vector(inoculum[ID_dil[IDX[i]]],2);  //initialize y0 parameter constrained to be same for both competitors
    array[culturing_time] vector[2] mu = ode_rk45(logistic,y0,t0,ts[start:end],r,K,s[IDX[i],1],s[IDX[i],2]); //numerical integration step
    int position = i%culturing_time + 1;
//not sure if adding these here is the real problem or not: just realized it's probably not correct
    log_lik_ODE [i] = normal_lpdf(yobs[position,1]|mu[position,1],sigma) +normal_lpdf(yobs[position,2]|mu[position,2],sigma); 

//model 2 likelihood, not pointwise with respect to data, but pointwise with respect to experiment I guess
// I called these costs in the previous post (log_like_cost) but just changing it here to genotype effects (could be costly or beneficial to have the gene, plus tried to make it more consistent but maybe I made it more confusing. hope not)
 vector[N_ID] log_lik_genotype_effect; 
for(i in 1:N_ID){
    log_lik_genotype_effect[i] = multi_normal_cholesky_lpdf(s[i] | geffects[i],L_Cov); 
    //posterior predictions: can ignore for my question
    vector[N_pred] y_pred = X_pred*dc;
    array[N_pred] real l_pred = normal_rng(y_pred,sigma_dc);
    vector[N_pred] y_pred_no_epi = X_pred_no_epistasis*dc;
    array[N_pred] real l_pred_no_epi = normal_rng(X_pred_no_epistasis*dc,sigma_dc);

Increasing these would make things only slower.

Good initial values is probably the easiest and fastest way to go.

Based on your model code, s is a parameter and thus using log_lik_genotype_effect doesn’t correspond to computing LOO-CV, but negative log entropy (plus possible constant) which measures just the concentration of the distribution but not whether the location is good.

log_lik_ODE [i] = normal_lpdf(yobs[position,1]|mu[position,1],sigma) +normal_lpdf(yobs[position,2]|mu[position,2],sigma);

This line looks right, but I’m not able to follow the indexing differences between the model block and generated quantitities block. I recommend to double check those.

You could get a significant speed-up for using a bigger tolerance threshold for the ODE solver. See more in [2205.09059] An importance sampling approach for reliable and efficient inference in Bayesian ordinary differential equation models with links to code.