I was able to use Stan to fit an ODE-based model of gene regulation. Now I wonder what is a good way of deciding whether the model is a good fit and hence whether the proposed regulation is supported by the data. The motivating example is the contrast between those two fits. Lines are draws from the posterior predictive distribution, big dots are actual measurements:
Visually, the first example (codY) fits the data well, while the second example (pyrB) fails to capture the low expression around the 10 minute time mark. This indicates that the proposed regulation is plausible for codY, but not plausible for pyrB. Is there a way to formalize this line of reasoning? Everything I could find about model comparison seems not well suited for my case. Am I missing something basic and well understood? Or am I firmly in the territory of heuristics?
To be precise, I usually have multiple possible regulation variants for each target gene, increasing in complexity (number of parameters) and I want to be able to stop with the least complex model that “fits the data well”. When I search the literature/blogs/this forum for “model selection” the content focuses on prediction and cross validation, usually suggesting some kind of hyper-model representing a weighted average of all possible models. I am however not that much interested in prediction. In fact, for some of the data I work with, prediction improves even after the model ceases to be biologically interpretable/relevant, so optimizing prediction seems to be a wrong direction. What I want is to get some interpretable number that would reflect my confidence that the model is a plausible explanation for the data, so that a biologist can make some further decisions based on this information.
Note that in general, I cannot claim that my set of candidate models contain a model that is correct (not even approximately correct), which I think is an impediment to computing probability of the individual models in an encompassing hypermodel. I need to able to recognize (automatically) that none of the candidate models fit the data well.
Another fun problem is that when the data can be fit by one of the simple models, the more complex models may become non-identifiable and Stan diverges on them. So in general I cannot even fit all of the models (or - I think - a combined hyper model).
Some directions I’ve been thinking about:
Comparing the distribution of quantiles of measurements in the model data (e.g., ecdf(samples)(measured_value)) to uniform distribution (with NHST or KL-divergence)
Calculate some statistic of distance between the measurements and the samples
Use Kolmogorov-Smirnov test for the measured data either a) separately for each replicate or b) the whole bag of posterior samples
If you’re looking for something straightforward, look at AIC and related methods because you: 1) have a pre-specific set of models to compare; and 2) you’re looking for the simplest well-supported model with good fit.
You’ll run into the issue that there’s not clear way to say exactly where to stop but I like comparing a few top models anyway since they’re likely to be related.
Yes I think he’s completely wrong to say he’s not considering prediction. When you bring up goodness of fit you’re basically talking about in-sample prediction… but I don’t like to lead with that in my answers.
For the case you describe, there is no easy good solution. All good solutions here require thinking what does interpretably mean and what are possible further decisions to be made.
If the model is not able to predict at all, what further decisions a biologist could do? So it seems you want the model to be good enough for some prediction, but you are willing to accept reduced predictive performance if the model is easier to explain. Then you have a decision task where you have cost for lost predictive performance and cost for the explanation of the model. It’s not trivial to come up with good cost functions for these, especially if don’t know what is that further decision a biologist is going to make. I would recommend building a sequence of models from simpler to most complex. If you have sensible priors and do the Bayesian inference well then the most complex is also the one giving the best predictions. If you don’t know what is the further decision to be made, you need to consider giving the whole predictive distribution for the decision maker and then natural utility is log-score and corresponding cost is KL-divergence from the most complex model. Of course if you know which decisions biologists are going to make then you could use decision task specific utility or cost. When you have the sequence of models, you can start describing these models to a biologist and tell how much they are losing if they stay with that model and how much more accuracy they will gain if they use the next more complex model. Again it would be easier if you know what is the decision they want to make.
This sentence is not completely clear, but sounds like posterior predictive checking which is good.
But if you use those samples to fit models, then you are using data twice, and this gives you over-optimistic estimate. That’s why you need to use cross-validation (or “bias corrections” in *IC).
Thanks for all the inputs! In short, AIC/WAIC seems it could fulfill what I am after, will test and let you know about the results. Some more thoughts/rambling follow.
I am definitely open to the possibility that I am completely wrong :-D Also I don’t think I should ignore prediction, just that optimizing for prediction seems to lead me to more complex models than I think is justifiable by the data+context.
The main problem IMHO is that you generally want to extrapolate from a single/a few experiment(s) to what is actually happening in the cells. But the experiments have many degrees of freedom that cannot be randomized. Just choosing a medium the cells grow on is at least 5 continuous parameters with highly non-linear effects on outcome. From my understanding (I am not a biologist), the experimenter basically tweaks the setup until they get reproducible results (in practice this can reduce to something like “similarly looking result at least twice in a row” or even “the cells don’t die most of the time”). This is not to bash the practices of the field, it just happens to be really hard to do most experiments. Sometimes people spend years before they are able to get the experiment running.
For this reason, measures akin to cross validation are IMHO not that relevant, because you know for sure that your data is not representative and squeezing it too hard is just chasing noise. On the other hand, we have a reasonably good understanding on how gene regulation happens at the level of molecules. You have to simplify a lot to be able to do inference over the model, but it lets us believe that extrapolating from the few observed cases is not completely doomed to failure.
Most often you just want to rank the possible regulations in the order of the level of trust you have in the results. The most direct further decision is then choosing some individual regulations for an in vitro or even in vivo experiment that can (in most cases) determine if the regulation takes place with high certainty. Those experiments are expensive, so you can do them on ~1% of the candidates.
AIC/WAIC and cross-validation make exactly the same assumptions. If you accept AIC/WAIC, then you accept cross-validation. Cross-validation has some computational and diagnostic advantages compared to AIC/WAIC.
Sorry about being too emphatic there. :) when you optimize for prediction in this context the problem (you point out) is that you’re ignoring uncertainty in whatever metric you use. IDK if stacking addresses that directly but you should use something that does. So you could even use cross-validate MSE and you’ll see that at some point the model improvements are small enough that you really can’t tell whether adding more complexity produces a real improvement.
That’s pretty much what it looks like (I spent a year or so doing cultures of an intracellular bacterium in cell culture at some point…) Really you’re bringing up another issue here where if the different experiments were optimized separately the data sets may not be comparable at all… but that can’t be fixed by a statistical method so there’s probably not much sense to worrying about it.
Good point, I should have read up on the metrics in more detail.
I hoped WAIC could be a useful and fast-to-compute metric anyway, so I tried WAIC and LOO (from the loo package) and
a) WAIC and LOOIC are almost the same
b) The package complains that some diagnostics are not OK for both WAIC and LOO (maybe because it is time series so some of the assumptions break?)
c) The ordering of different regulations by WAIC/LOOIC corresponds reasonably well to my intuitive ordering and it seems to be possible to define a soft threshold of good/bad fit over WAIC/LOOIC
There are probably more specific recommendations out there but if you’re doing time-series CV-like things then you need to at least do it block-wise since a good prediction at time t gives a similarly good prediction at time t+1 for trivial models.
There’s a really funny epidemic prediction paper out there that claims amazing predictive ability for a system with seasonal epidemics. The way it works out is that they use current state to predict future state in a binary (epidemic/non-epidemic) time-series and in a seasonal system that gives you exactly two chances to be wrong every year (10 chances to be right if you’re doing monthly predictions, and 50 chances to be right if you’re doing weekly predictions…). I’ll refrain from linking here.
Warnings do not relate to time series, but indicate that there are highly influential observations, which are difficult to predict if left out (aka outliers).
This depends on the prediction task. Even for time-series leave-one-out can be reasonable especially if it’s more about whether the model can explain the current data than about predicting what happens after time point 50. For this expression data, I don’t think that block-cv would be any better than loo. block-cv assumes exchangeability of the blocks, but this expression data is clearly non-stationary, ie it always starts from zero, goes quickly up, and then comes back down slowly. If there would be several time series for the same gene (repeated expression in time measurements), then leaving out repeated expression in time measurements as a block of data would make sense (ie if we repeat the experiment can we predict the expression curve).
Not all time series are same prediction tasks are same.
If we want to predict for time t=n+1 (1-step ahead prediction) or for time t=n+1,…n+m (m-step ahead prediction) given observations at t=1,…n, we need to take into account the dependency in time and need to leave a block of data to simulate that we don’t know the values after our prediction range. Prediction for the future is sensible task in predicting epidemics, as we want to know when extra resources are needed and when the epidemic is over. In the expression data example it is unlikely that we would want to predict beyond t=50 (see figures above), or that we would have cases where we would start seeing observations and would need to be able to predict what happens next. Expression time series is observed usually completely, and one way is to think that we are just interested how well the curve fits the data, but to avoid double use of data, we use leave-one-out-cv trick. I think it’s also sensible to state this as a prediction task for evaluating the model so that can we predict a missing observation.
I’m not certain whether I should have mentioned exhangeability assumption as it’s trickier. If leave-one-out is ok for simulating a task of predicting a missing observation, then why not consider a task of predicting a missing block of observations? If we use the logic of time dependency length for selecting the block size, we are likely to get big blocks here, but then I would say that we are clearly violating conditional exchangeability as the beginning, middle and end of the curve are a priori so different. In leave-one-out the prior assumption for one missing observation given nearby observation can be more similar in different parts of the space (and note conditional exchangeability can include conditioning on elaborate modeling assumptions).
Thanks for all the thoughts. I think I’ll stick to WAIC/LOOIC, as it works reasonably well for my case and full cross-validation would make the method prohibitively computationally expensive (it already is quite slow - I need to fit hundreds of genes with multiple model types).
However, marginal likelihoods are extremely sensitive to the parameters priors which makes their use non-trivial. Also note that some of the developers and persons behind Stan do not like this approach.
Marginal likelihoods can be useful in cases where the models that are compared with each others are all nested in one large supermodel. In this case, the default priors approach of Jeffreys can be used and some of the problems disappear.