I have quite a complicated model (which I can provide additional information for as well as a reprex), but essentially it has two outcomes that it ouputs from a single dataset.
The dataset has a not too small (~1000) number of competition experiments between two bacteria. Basically, there are two growth curves as both bacteria grow in the same test tube, and we can use those growth curves to estimate who is outcompeting who. Each competition has 16 timepoints, so the actual number of rows in the full dataset is actualy ~16,000.
What the stan program does:
 Estimate competition outcomes using an ODE solver, to report the relative competitive strength of both bacteria in each experiment (the estimand is 1000 vectors of length 2: two competition coefficients for each competitor, for each experiment).
 Since we know the genotype of each of the bacteria, we break down the competition coefficients inferred from the ODEs into the genotypic diffferences between the bacteria. So for example, the effect of gene A on competitive ability gets estimated from every bacteria that has gene A in any of the 1000 competitions.
Thus we can notate the two outcomes in this way:
For raw growth curves of two bacteria y_obs

y_obs ~ N(mu,sigma)
, where mu was estimated usingode_rk45
Then we have the competition coefficients s
, a data of type array[1000] vector[2]
, which are parameters estimated by the numerical integrator used to get mu
above. These are broken down into the effects of the genes which we call the ‘costs’ of the genes, and the outcome is modelled as a bivariate normal:
2. s ~ multi_normal_cholesky(costs,L_Cov)
, where costs is a vector with the effects of each gene, and L_Cov is the covariance matrix (cholesky’d of course)
Now I have been interested in doing some model comparison, where the noise for outcome 1 is modelled as lognormal rather than normal, for example. So I made two scripts and ran them both on my data, and I have two loglik matrices:
 for outcome 1, I have 16,000 datapoints for which I calculate log likelihood
 for outcome 2, I have 1000 datapoints for which I calculate log likelihood but each datapoint is a parameter extracted from a competition experiment (16 datapoints, or 32 since the performance of a bacteria is estimated together with its competitor).
In the end I am a little more interested about the performance for expaining outcome 2: I am interested in genetic differences. But when I ran the two models, I got this funny result where, lognormally distributed errors fit the ODE outcome better (LOO scores are much better for lognormal rather than normal noise in outcome 1), but the model for the normal noise for outcome 1 performs better on outcome 2 (though they actually both perform rather badly on that one for some reason).
Here are the loo comparison for outcome 1, comparing lognormal to normal errors in the ode outcome
Lognormal first one, normal second one
Here are the loo comparison for outcome 2, comparing again the same models
Now it’s the normal that’s the first and lognormal second (lognormal is worse)
Note there are 30 genes, so the p_loo is very inflated for both models.
I found that when I simulate from this data generating process I get also bad loo scores for outcome 1 even though I simulate from the exact data generating process that I use to write the model. But the estimates themselves are good; I think my simulations may be underdetermined because I use very little data points when I fit simulations (my actual data takes hours to run each model, the model is large and slow as you can imagine).
As I mentioned, I can provide the code and reprex happily, it’s just sunday evening here and I wanted to get the question out and hopefully follow up if someone is interested enough to take a dive into this rather large complicated model. I would be in any case most happy with any advice generic or otherwise that I missed that would be relevant here. Like, can I do the joint likelihood, even though one loglik matrix is way bigger than another, and if I care about one slightly more than the other?
Thanks and good week to all… may all your stories eventually fit your data