Continuing the discussion from Multiple outcome model: conflicting LOO results?:
I am fitting a system of 2 ODEs simultaneously to ~1200 experiments, each experiment with ~16 timepoints; some parameters are local to each separate experiment, some are global, so the number of parameters is large. All in all N ~ 1200*2*16 ~ 38,000
and p~2500
(simple model) and p~3600
(more complex model). I am trying to compare 2 models, where in one there is a global parameter K
and in the other the same parameter is modelled as as a multilevel parameter. I include the model code for the non-multilevel K (simpler) model the end of the post.
When I perform loo-cv on both models, each one has most of the 30,000 values in the good region, about 30 in the ok region, then about 4-6 in the bad and 1-3 in the very bad. The posterior prediction (how the ode’s trace the trajectory of the observed data) looks great for both models, a little better for the more complex model. The more complex model also has a better elpd score and “wins” in loo_compare
:
I checked the few bad observations, and they seem like outliers: they occur when the observed data is very low, and when observed values are low the signal to noise ratio is worst (detection limit kind of issue). So I thought, OK, maybe the model is fine and these are just outliers. I have a hard time thinking about how could outliers have a very large effect in the case of this model, with so many datapoints. A single experiment (a single ODE system) may be affected by a weird datapoint but how could that possibly affect the more global parameters by very much…
I tried to view the LOO-PIT plots and they looked sort of horrendous…
They both look something like this
It appears that if those outliers are influential they may be very very influential. I tried to use the PIT quantiles and they also kind of showed that those few bad apples may have a disproportionate influence?
When I use the compare="normal"
functionality in the ppc_loo_pit_qq
function, I seem to get much more information of the alignment between my data and LOO-PIT, however:
One K simpler model:
Multilevel K more complex model:
As for refits:
I tried to refit the simple (one K parameter) model twice:
Firstly, I tried to remove the 6 bad/very bad datapoints and refit the model, and I got a loo-cv with 1 bad datapoint:
I also tried to remove the 6 bad/very bad datapoints and also the ~30 ok datapooints and refit the model, and I got 3 new bad datapoints:
For reference here’s the full printout of the original loo-cv
The LOO-PITs of the refits don’t change. So maybe they were not real outliers, but my model is too flexible? It is possible that my model is misspecified in terms of some features. The posterior predictions matching pretty well, however, made me think that the model did a good enough job for my purposes.
Running this model takes about 1.5hrs (it was originally 8hrs but runtimes improved thanks to the ode_rk45_tol
and other useful suggestions by this great paper suggested by @avehtari), which is kind of long for testing every single features and trying a bunch of different models. I am not sure I want to go into the rabbit hole of model selection and diving into each individual feature of the model. I am a biologist and the fit looks good enough: I have a picture of the posterior predictive distribution over data below. What I am wondering is how to interpret this information correctly, and/or, what kind of additional information is needed to decide if going to this rabbit hole is necessary or not.
Further information:
Here is a picture of the posterior prediction for the more complex model:
These are bacterial growth curves, the points were observed, and the lines are the posterior predictions.
Here 's the model code
functions{
//ODE system
vector logistic(real t, //time
vector y, // state
real r, //average growth rate
real K, //carrying capacity / maximum total density of both y
real s1, // relative growth (dis)advantage of y1
real s2 // relative growth (dis)advantage of y2
) {
vector[2] dydt;
dydt[1] = (r+s1)*y[1]*(1-(y[1]+y[2])/K);
dydt[2] = (r+s2)*y[2]*(1-(y[1]+y[2])/K);
return dydt;
}
}
data{
//data for first model: fitting ODEs to get global (r,K,y0) and competition-specific (s1,s2) parameters
int<lower=1> N; //rows in data
int<lower=1> N_ID; //number of bacterial competition experiments
array[N] int<lower=1,upper=N_ID> IDX; //experiment ID
int<lower=1> N_dil; //number of dilution factors / initial density of bacteria added : we had two different values
array[N_ID] int<lower=1,upper=N_dil> ID_dil; // dilution factor ID
real t0; //initial timepoint (0)
array[N] vector[2] yobs; //observed bacterial densities
array[N] real ts; // timepoints
int<lower=1> culturing_time; // number of timepoints per competition: same for all competitions
// second model: breaking down the selection coefficients s1,s2
int<lower=0> N_X; //number of genotype components
matrix[N_ID,N_X] X1; // design matrix for genotype of bacteria 1
matrix[N_ID,N_X] X2; //design matrix for genotype of bacteria 2
int<lower=0> N_pred; // number of yhat points I want to Stan to calculate for me (posterior prediction)
matrix[N_pred,N_X] X_pred; //design matrix for predictions
matrix[N_pred,N_X] X_pred_no_epistasis; // additional design matrix to test some hypothesis (not important to explain)
}
parameters{
//parameters for fitting logistic ODE model
array[N_dil] real<lower=0> inoculum; //y0 is the same for both competitors (experimental constraint) but different for different dilutions: dictated by ID_dil column of dummy variables.
real<lower=0> r; // average growth rate of all competition experiments:
real<lower=0> K; // carrying capacity / max density of culture
array[N_ID] vector[2] s; //array of growth (dis)advantages for each strain
real<lower=0> sigma; //experimental noise
//parameters for genotype effects breakdown model
vector[N_X] dc; //vector of genotype effects on competitive fitness
real<lower=0> sigma_dc; //effect residual standard error
cholesky_factor_corr[2] L; //cholesky factor of correlation between competitors
}
transformed parameters{
//vector of multivariate means:
// note s1 and s2 are interchangable (the order of who we call 1 and 2 doesn't matter)
// therefore both s1 and s2 go into estimating the same genotypic effects vector dc
array[N_ID] vector[2] geffects;
for(i in 1:N_ID){
geffects[i,1] = X1[i]*dc;
geffects[i,2] = X2[i]*dc;
}
//identical sigmas for both selection coefficients (as, again, their position as competitor 1/competitor 2 is arbitrary)
vector[2] L_Sigma = rep_vector(sigma_dc,2);
// covariance matrix
matrix[2,2] L_Cov = diag_pre_multiply(L_Sigma,L);
}
model{
//priors
sigma ~ exponential(1);
r ~ student_t(3,0.7,0.25); //expect 20-40% less than 1 doubling per hour since using minimal media
K ~ normal(1,0.25); //expect 0.5-1.5
s[,1] ~ normal(0,0.1); // expecting small differences from mean growth rate
s[,2] ~ normal(0,0.1);
inoculum ~ normal(0.01,0.03);
//manual ODE fitting for each one of the N_ID competition experiments, all with identical number of timepoints
for(i in 1:N_ID){
//indexing first and last timepoint of culturing for vectors of length N like the yobs and ts (time) vector
int start = ((i-1)*culturing_time) + 1;
int end = i*culturing_time;
//initialize y0 parameter constrained (experimentally) to be same for both competitors
vector[2] y0 = rep_vector(inoculum[ID_dil[i]],2);
//numerical integration step
array[culturing_time] vector[2] mu = ode_rk45(logistic,y0,t0,ts[start:end],r,K,s[i,1],s[i,2]);
//fitting with experimental noise:
target += normal_lpdf(yobs[start:end,1]|mu[,1],sigma); //likelihood for competitor y1
target += normal_lpdf(yobs[start:end,2]|mu[,2],sigma); //likelihood for competitor y2
//NOTE for lognormal version these would be the likelihoods
// target += lognormal_lpdf(yobs[start:end,1]|log(mu[,1]),sigma);
//target += lognormal_lpdf(yobs[start:end,2]|log(mu[,2]),sigma);
}
//model 2: breaking down s1, s2 using design matrix X1,X2 to model differential genotypic effects
//priors
dc ~ normal(0,0.1); //small genotypic effects
L ~ lkj_corr_cholesky(2);
sigma_dc ~ exponential(1);
//likelihood
target+= multi_normal_cholesky_lpdf(s|geffects,L_Cov);
}
generated quantities {
//convert cholesky matrix to correlation matrix
corr_matrix[2] rho = multiply_lower_tri_self_transpose(L);
//here is how I calculate the two likelihoods
vector[N] log_lik_ODE;
for(i in 1:N) {
int start = ((IDX[i]-1)*culturing_time) + 1;
int end = IDX[i]*culturing_time;
vector[2] y0 = rep_vector(inoculum[ID_dil[IDX[i]]],2); //initialize y0 parameter constrained to be same for both competitors
array[culturing_time] vector[2] mu = ode_rk45(logistic,y0,t0,ts[start:end],r,K,s[IDX[i],1],s[IDX[i],2]); //numerical integration step
int position = i%culturing_time + 1;
//not sure if adding these here is the real problem or not: just realized it's probably not correct
log_lik_ODE [i] = normal_lpdf(yobs[position,1]|mu[position,1],sigma) +normal_lpdf(yobs[position,2]|mu[position,2],sigma);
}
//model 2 likelihood, not pointwise with respect to data, but pointwise with respect to experiment I guess
// I called these costs in the previous post (log_like_cost) but just changing it here to genotype effects (could be costly or beneficial to have the gene, plus tried to make it more consistent but maybe I made it more confusing. hope not)
vector[N_ID] log_lik_genotype_effect;
for(i in 1:N_ID){
log_lik_genotype_effect[i] = multi_normal_cholesky_lpdf(s[i] | geffects[i],L_Cov);
}
//posterior predictions: can ignore for my question
vector[N_pred] y_pred = X_pred*dc;
array[N_pred] real l_pred = normal_rng(y_pred,sigma_dc);
vector[N_pred] y_pred_no_epi = X_pred_no_epistasis*dc;
array[N_pred] real l_pred_no_epi = normal_rng(X_pred_no_epistasis*dc,sigma_dc);
}```