Sivula, Magnusson, Matamoros and Vehtari invoke the concept of “oracle distribution”, and a paper that they cite makes quite heavy use of the concept — particularly in reference to an “oracle model” and “oracle inequalities”.

Could someone please try to explain/illustrate the latter two concepts, preferably in layman’s terms or by means of a simple R simulation? (Stan models should preferably be wrapped into brms syntax). I’m not a fluent reader of highly abstract mathematical notation, but I’ll always try.

While I find Sivula et al’s “oracle distribution” to be an easy enough concept to grasp, the “oracle model” and especially “oracle inequality” of the latter article cause trouble.

My own, potentially mistaken interpretation of “oracle model” so far, with regard to predictive performance, is that it is the best-predicting model which can be fit to the dataset at hand, using the covariates at hand. In other words, its predictive performance is necessarily capped by A) how many relevant covariates are available (=have been measured), and B) how representative the dataset is of the true data-generating mechanism. If this is correct, then it follows that the “oracle model” need not have the best elpd_loo in the dataset — any number of untrue models may outrank it on this metric IF the dataset is sufficiently unrepresentative of the true DGM. However, the oracle model will always be At Least Tied for the best true elpd within the set of models that can be fit to the dataset at hand.

But even if the above is correct (a big if!), “oracle inequality” is even harder to comprehend. How can inequality be desirable? Yet that’s what I gather from the article, despite failing to precisely understand what exactly it is. The screenshot below has a relevant exerpt from the article:

Sorry for the slight bump, but I’ve updated the original question with screenshots of what I believe to be key sections of the 2nd article, in order to lower the threshold for people to take a look and weigh in.

In Arlot and Celisse l() is defined as an average (see Section 1.1), so it’s not about the dataset at hand.

The key is in understanding the difference between \hat{s}_{\hat{m}(D_n)} and \hat{s}_{m} and that the right-hand side is equivalent to using oracle model, so it’s saying that there is some bound how much worse using a finite data selected model is compared to the oracle model (on average). There are then some proofs how this bound (ie constants C_n and R_n behave when n\rightarrow\infty. Arlot and Celisse introduce the notation a bit scattered, and the nested structure of the notation makes it difficult to follow for me, too.

If the oracle model isn’t limited by the sample and variables at hand, what is it limited by? The oracle model probably isn’t the same thing as the true DGM, because p. 46 says “s is not required to belong to \bigcup_{m\in\mathcal{M}_n}S_m”, and because Section 1.1 appears to define l() as “excess loss” relative to the true DGM. If the oracle model was the true DGM, it could not have excess loss relative to the true DGM.

Maybe the oracle model is limited by \mathcal{M}_n, the set of candidate models under consideration? But isn’t that set of candidates inevitably limited by what variables have been measured in the sample?

Thank you, this notion makes good sense. I just wish I could see how it follows from the formulas, because having to simply “take it from an authority” isn’t ideal. Hopefully someone less busy than you (or even myself) can concretize this with an example at some point.

I’m also still a bit hazy on why the “inequality” is a good thing in a model-selection procedure, rather than equality…

Sorry for not being clear on this. Arlot and Celisse write from the frequentist perspective, so even if what is measured is fixed the data is assumed to be random, and the average is over all possible datasets of size n and with defined measured variables.

It doesn’t need to be. It is the best model on average over infinite possible datasets you might have observed (as Arlot and Celisse discuss this in frequentist context)

The selection process causes the unavoidable worse performance compared to directly using the oracle model, and thus you can’t have equality. Inequality with the constants is just a way to present a bound and then focus on how these two constants behave asymptotically.

So in a frequentist context, on average, the oracle model is the best possible model that can be fit given this DGM and this set of covariates (interactions and transformations allowed). And it is usually worse than the “true model”, particularly because real datasets rarely measure every relevant covariate. Correct?

I’m trying to wrap my head around why this should be “unavoidable”. What if it’s a fairly simple DGM, and we happen to have measured every relevant covariate (in addition to some irrelevant ones)? Why is it unavoidable that we will select an imperfect model even then?

Is it because whatever functional form we choose for a quantitative covariate, it will never be exactly linear/logarithmic/quadratic/…, and because there will always be some interactions (at least near-zero ones) that we’ll have missed? Or is it because of something else entirely?

(I’d also like to remind everyone that Aki is not the only one allowed to reply :>)

It’s worse than true model if it’s not the true model.

It’s the same reason as why even if we have a single model, but we estimate the parameters (or in Bayesian case form the posterior and integrate over it) it is less efficient than knowing the oracle parameter value (either the true value or the value minimizing the KL divergence). So this is not special to model selection, but happens in general when parameter/model choices are made on data instead of given by the oracle, even if the oracle parameter value/model would be included in the parameter space/set of the models.

Oh. I guess you’re saying the “unavoidable” inferiority is due to the SEs, which (I suppose) are zero for every parameter in the oracle model.

So if I’ve understood correctly, a frequentist oracle model is a model that, with a view to infinite resampling with the same N, makes perfect use of the available variables, without any uncertainty.

What if the DGM is really simple, the correct predictors have been measured, the sample is perfectly representative of the DGM (to the relevant number of decimals), and we are frequentists i.e. use point estimates only? If we guess correctly and happen to fit the true model, will it not have exactly the same average performance over infinite resamples as the oracle model, given that the SEs don’t matter when we’re only using point estimates?

Yes, sir. That’s why I was referring to a(n extremely) lucky scenario with a small number of true predictors of which every one has been measured, and where the sample at hand is so representative of the actual DGM that if we guess correctly and fit the true model, all point estimates land on their true (=oracle) values. The conjecture was that in this scenario, our model will be equal to the oracle model from a frequentist perspective because the point predictions will be identical and frequentists don’t care about uncertainty information. From a Bayesian perspective that same model will still underperform the oracle model, because its posterior will be more diffuse than the oracle’s set of point masses at the true parameter values.

EDIT: In fact the sample shouldn’t even have to be perfectly representative of the DGM. If we correctly guess the true model, then in terms of point predictions it should be just as good as the oracle model over infinite resamples. But in his previous post, Aki seems to be referring to estimation uncertainty as the reason why the true model won’t be as good.

Is the idea that the oracle model will have a separate fit to each hypothetical new dataset (e.g. can have estimation uncertainty), or is the idea that the oracle model doesn’t have to be fit because it uses the true parameter values invariably. If the latter, then afaik the oracle model cannot have any variance, only bias (due to lacking unmeasured covariates).

But if the oracle model does have to be fit to each sample and thus can have estimation uncertainty, then it follows that with small n, the oracle model will be different from the true model even when all true covariates have been measured, because the true model will have too much variance. On the other hand if the oracle model is just a set of perfectly precise parameter estimates, then whenever all true covariates have been measured, the oracle model is simply the DGM.

It seems to me that there’s a difference between “oracle DGM” (the best-predicting set of perfectly precise true parameter values that can be chosen using the available covariates) and “oracle model” (the best model formula that can be put together for samples of size n, using the available covariates), and I dunno which interpretation applies here. Arlot says the quality of a model selection procedure is measured by “excess loss”, (defined in Section 1.1 as something independent of particular samples), but his subsequent definition of oracle model is S_{m^*}, where m^* is chosen to minimize l(s,\hat{s}_m(D_n)), and this looks specific to the sample at hand. (The formal definition of D_n on p. 44 flies way over my head).

The Oracle model was defined as the best model on average, but the inference for the parameters could still be made based on the data and thus the parameters can have uncertainty. The issue is that the same finite data is used to fit the models and do the selection, and thus it is unlikely that the selection approach would always select the oracle model. The same inequality holds whether we use distributions and integrate over or use point estimates.

Thanks Aki. I think I get it now, finally: it’s the best model, on average, that can be fit to a sample of this size with these covariates, coming from this DGM. It’s different from the true model IF either A) the sample is too small to estimate the true model properly, B) the sample is missing true predictors, or both.

If this is confirmed correct then I will proceed to declare the thread solved.

Yes! For A, it’s easy to make a simulation with e.g. 100 covariates with all of them having non-zero beta 1e-12, small n, noisy observations, and wide prior, where it is likely that the oracle model would have zero covariates while true model has 100 covariates.