LOO on large ODE fit: Am I interpreting it correctly?

Continuing the discussion from Multiple outcome model: conflicting LOO results?:

I am fitting a system of 2 ODEs simultaneously to ~1200 experiments, each experiment with ~16 timepoints; some parameters are local to each separate experiment, some are global, so the number of parameters is large. All in all N ~ 1200*2*16 ~ 38,000 and p~2500 (simple model) and p~3600 (more complex model). I am trying to compare 2 models, where in one there is a global parameter K and in the other the same parameter is modelled as as a multilevel parameter. I include the model code for the non-multilevel K (simpler) model the end of the post.

When I perform loo-cv on both models, each one has most of the 30,000 values in the good region, about 30 in the ok region, then about 4-6 in the bad and 1-3 in the very bad. The posterior prediction (how the ode’s trace the trajectory of the observed data) looks great for both models, a little better for the more complex model. The more complex model also has a better elpd score and “wins” in loo_compare:

I checked the few bad observations, and they seem like outliers: they occur when the observed data is very low, and when observed values are low the signal to noise ratio is worst (detection limit kind of issue). So I thought, OK, maybe the model is fine and these are just outliers. I have a hard time thinking about how could outliers have a very large effect in the case of this model, with so many datapoints. A single experiment (a single ODE system) may be affected by a weird datapoint but how could that possibly affect the more global parameters by very much…

I tried to view the LOO-PIT plots and they looked sort of horrendous…
They both look something like this

It appears that if those outliers are influential they may be very very influential. I tried to use the PIT quantiles and they also kind of showed that those few bad apples may have a disproportionate influence?

When I use the compare="normal" functionality in the ppc_loo_pit_qq function, I seem to get much more information of the alignment between my data and LOO-PIT, however:
One K simpler model:

Multilevel K more complex model:

As for refits:

I tried to refit the simple (one K parameter) model twice:

Firstly, I tried to remove the 6 bad/very bad datapoints and refit the model, and I got a loo-cv with 1 bad datapoint:

I also tried to remove the 6 bad/very bad datapoints and also the ~30 ok datapooints and refit the model, and I got 3 new bad datapoints:

For reference here’s the full printout of the original loo-cv

The LOO-PITs of the refits don’t change. So maybe they were not real outliers, but my model is too flexible? It is possible that my model is misspecified in terms of some features. The posterior predictions matching pretty well, however, made me think that the model did a good enough job for my purposes.

Running this model takes about 1.5hrs (it was originally 8hrs but runtimes improved thanks to the ode_rk45_tol and other useful suggestions by this great paper suggested by @avehtari), which is kind of long for testing every single features and trying a bunch of different models. I am not sure I want to go into the rabbit hole of model selection and diving into each individual feature of the model. I am a biologist and the fit looks good enough: I have a picture of the posterior predictive distribution over data below. What I am wondering is how to interpret this information correctly, and/or, what kind of additional information is needed to decide if going to this rabbit hole is necessary or not.

Further information:

Here is a picture of the posterior prediction for the more complex model:

These are bacterial growth curves, the points were observed, and the lines are the posterior predictions.

Here 's the model code

  //ODE system
  vector logistic(real t, //time
                         vector y, // state
                         real r, //average growth rate
                         real K, //carrying capacity / maximum total density of both y
                         real s1, // relative growth (dis)advantage of y1
                         real s2 // relative growth (dis)advantage of y2
                        ) {
      vector[2] dydt;
      dydt[1] = (r+s1)*y[1]*(1-(y[1]+y[2])/K);
      dydt[2] = (r+s2)*y[2]*(1-(y[1]+y[2])/K);
    return dydt;

//data for first model: fitting ODEs to get global (r,K,y0) and competition-specific (s1,s2) parameters
  int<lower=1> N; //rows in data
  int<lower=1> N_ID; //number of bacterial competition experiments
  array[N] int<lower=1,upper=N_ID> IDX; //experiment ID
  int<lower=1> N_dil; //number of dilution factors / initial density of bacteria added : we had two different values
  array[N_ID] int<lower=1,upper=N_dil> ID_dil; // dilution factor ID
  real t0; //initial timepoint (0)
  array[N] vector[2] yobs; //observed bacterial densities
  array[N] real ts; // timepoints
  int<lower=1> culturing_time; // number of timepoints per competition: same for all competitions
  // second model:  breaking down the selection coefficients s1,s2
  int<lower=0> N_X; //number of genotype components
  matrix[N_ID,N_X] X1; // design matrix for genotype of bacteria 1
   matrix[N_ID,N_X] X2; //design matrix for genotype of bacteria 2
  int<lower=0> N_pred; // number of yhat points I want to Stan to calculate for me (posterior prediction)
  matrix[N_pred,N_X] X_pred; //design matrix for predictions
  matrix[N_pred,N_X] X_pred_no_epistasis; // additional design matrix to test some hypothesis (not important to explain)
  //parameters for fitting logistic ODE model
  array[N_dil] real<lower=0> inoculum; //y0 is the same for both competitors (experimental constraint) but different for different dilutions: dictated by ID_dil column of dummy variables.
  real<lower=0> r; // average growth rate of all competition experiments:
  real<lower=0> K; // carrying capacity / max density of culture
  array[N_ID] vector[2]  s; //array of growth (dis)advantages for each strain
  real<lower=0> sigma; //experimental noise
  //parameters for genotype effects breakdown model
  vector[N_X] dc; //vector of genotype effects on competitive fitness
  real<lower=0> sigma_dc; //effect residual standard error
  cholesky_factor_corr[2] L; //cholesky factor of correlation between competitors
transformed parameters{
    //vector of multivariate means: 
// note s1 and s2 are interchangable (the order of who we call 1 and 2 doesn't matter)
// therefore both s1 and s2 go into estimating the same genotypic effects vector dc
   array[N_ID] vector[2] geffects;
    for(i in 1:N_ID){
      geffects[i,1] = X1[i]*dc;
      geffects[i,2] = X2[i]*dc;
  //identical sigmas for both selection coefficients (as, again, their position as competitor 1/competitor 2 is arbitrary)
  vector[2] L_Sigma = rep_vector(sigma_dc,2);

  // covariance matrix
  matrix[2,2] L_Cov = diag_pre_multiply(L_Sigma,L);

  sigma ~ exponential(1);  
  r ~ student_t(3,0.7,0.25); //expect 20-40% less than 1 doubling per hour since using minimal media
  K ~ normal(1,0.25); //expect 0.5-1.5
  s[,1] ~ normal(0,0.1);  // expecting small differences from mean growth rate
  s[,2] ~ normal(0,0.1); 
  inoculum ~ normal(0.01,0.03);  

  //manual ODE fitting for each one of the N_ID competition experiments, all with identical number of timepoints
  for(i in 1:N_ID){
//indexing first and last timepoint of culturing for vectors of length N like the yobs and ts (time) vector
    int start = ((i-1)*culturing_time) + 1; 
    int end = i*culturing_time;
//initialize y0 parameter constrained (experimentally) to be same for both competitors
    vector[2] y0 = rep_vector(inoculum[ID_dil[i]],2);  
//numerical integration step
    array[culturing_time] vector[2] mu = ode_rk45(logistic,y0,t0,ts[start:end],r,K,s[i,1],s[i,2]); 
//fitting with experimental noise:
    target +=  normal_lpdf(yobs[start:end,1]|mu[,1],sigma); //likelihood for competitor y1
    target +=  normal_lpdf(yobs[start:end,2]|mu[,2],sigma); //likelihood for competitor y2
//NOTE for lognormal version these would be the likelihoods
   // target +=  lognormal_lpdf(yobs[start:end,1]|log(mu[,1]),sigma);
    //target +=  lognormal_lpdf(yobs[start:end,2]|log(mu[,2]),sigma); 

  //model 2: breaking down s1, s2 using design matrix X1,X2 to model differential genotypic effects
  dc ~ normal(0,0.1); //small genotypic effects
  L ~ lkj_corr_cholesky(2);
  sigma_dc ~ exponential(1);

    target+= multi_normal_cholesky_lpdf(s|geffects,L_Cov);
generated quantities {
  //convert cholesky matrix to correlation matrix
    corr_matrix[2] rho = multiply_lower_tri_self_transpose(L);

//here is how I calculate the two likelihoods
  vector[N] log_lik_ODE;

    for(i in 1:N) {
    int start = ((IDX[i]-1)*culturing_time) + 1;
    int end = IDX[i]*culturing_time;
    vector[2] y0 = rep_vector(inoculum[ID_dil[IDX[i]]],2);  //initialize y0 parameter constrained to be same for both competitors
    array[culturing_time] vector[2] mu = ode_rk45(logistic,y0,t0,ts[start:end],r,K,s[IDX[i],1],s[IDX[i],2]); //numerical integration step
    int position = i%culturing_time + 1;
//not sure if adding these here is the real problem or not: just realized it's probably not correct
    log_lik_ODE [i] = normal_lpdf(yobs[position,1]|mu[position,1],sigma) +normal_lpdf(yobs[position,2]|mu[position,2],sigma); 

//model 2 likelihood, not pointwise with respect to data, but pointwise with respect to experiment I guess
// I called these costs in the previous post (log_like_cost) but just changing it here to genotype effects (could be costly or beneficial to have the gene, plus tried to make it more consistent but maybe I made it more confusing. hope not)
 vector[N_ID] log_lik_genotype_effect; 
for(i in 1:N_ID){
    log_lik_genotype_effect[i] = multi_normal_cholesky_lpdf(s[i] | geffects[i],L_Cov); 
    //posterior predictions: can ignore for my question
    vector[N_pred] y_pred = X_pred*dc;
    array[N_pred] real l_pred = normal_rng(y_pred,sigma_dc);
    vector[N_pred] y_pred_no_epi = X_pred_no_epistasis*dc;
    array[N_pred] real l_pred_no_epi = normal_rng(X_pred_no_epistasis*dc,sigma_dc);

Comparing p_loo values and the number of observations, it’s not globally too flexible, but it can be sensitive at the edges of the data.

The elpd_diff is very big, and the ppc_loo_pit_qq clearly favors the more complex model, and the posterior predictions seem reasonable, so I think it’s safe to say the more complex model is better, even if it’s not perfect.

1 Like