I did a PhD in math and have formal training in neither stats nor ML. I studied ML first, circa 2014, using models like random forest to make predictions and cross validation to evaluate and calibrate. In this approach I found it difficult if not impossible to incorporate things like measurement error, spatiotemporal structure, process knowledge, etc. All of these seemed important in the applications I cared about (e.g. monitoring/mapping environmental hazards or soil properties). That led me to the kinds of stats that we talk about on this forum, which have served me well since then.
Until recently my understanding of stats vs ML models was a trade off: stats could, among other things, incorporate a lot of domain knowledge. But at the cost of requiring more time and statistical expertise to develop problem-specific models, and these models tended not to scale so well to large problems without even more work. ML could more quickly/easily provide generic models and could scale well to large problems.
More recently, it seems that AI/ML have taken notice of the importance of domain knowledge. I am hearing more and more about āknowledge-guidedā and āphysics-informedā ML (see e.g. here and here). So my question is: what do you all make of this and other trends in ML? Are the trade-offs I described above still accurate (if they ever were) and how might they change in the future?
P.S.
I ackowledge that the stats vs. ML thing is a false dichotomy. But Iām fairly sure this question still makes senseā¦ Also these trade-offs are not pre-determined, they are contingent on methodological and software devleopments and have been affected by things like approximtae Gaussian processes, brms, PSIS-LOO, INLA, etc. I also acknowledge that at this point my knowledge is very biased toward stats so I may not be representing the ML side fairly.
āWhatās up with this?ā seems like an overly broad question, but the main point is probably around the one you mention: Machine Learning is mostly black-box methods designed to generate āpredictionsā, since they consist of inscrutable models that are flexible enough to fit whatever patterns in the data ā precisely because they lack system/domain-derived constraints. Much is said about how complex these models are, but in my opinion they are quite simple, they are unintelligible by design; the lack of domain-specific structure is what makes them āfit wellā.
My take is that physics-informed/knowledge-guided/explainable AI(ML) are an effort to both make these models useful for tasks beyond generating synthetic data that looks like real data, as well as capitalizing on the success (and on the hype) of AI/ML. I think the trade-offs of this approach are just a result of this top-down view; the converse would be to start with a simple domain-specific model and add parameters to make them more flexible and fit more easily more data, but the general goal would be meeting in some middle (except itās unlikely this middle would be the same in this bottom-up view).
Iāve seen a couple of articles and a few posts on physics-informed neural networks, including the review you mention, but I havenāt seen a clear explanation of what is the actual goal of that, since fitting even a relatively complex PDE directly is not necessarily difficult nor intractabl if itās a āreasonableā model. I should probably spend more than five minutes looking at these loss functions (i.e. log-Likelihoods), but at a glance it seems like itās just replacing the likelihood of parameters given data [P(\theta|D)] by that of parameters given NN output times the likelihood of NN given data [P(\theta|Y_{NN})P(Y_{NN}|D)].
Iām sure you can come up with reasons why this is done, but Iād like to hear a good explanation of what this achieves, and what is the broader goal of introducing this kinds of models into physics in the first place.
In biology the approach seems to be a bit different, where the structure of the network somehow reflects the system, so the weights say something about one component or another contributing more to the output. Iām more familiar with the goals in this field, where the pros are being able to extract some information from the system without knowing much about it, while the cons are not being actually able to say much about the system mechanisms since the model parameters donāt have a biological counterpart (e.g. circadian rhythms are well studied and mathematical models can say something about the gene transcription rates and protein levels that modulate daily rhythms; itās probably easy to fit a neural network to this kind of data, but it will do nothing other than predict something we can measure directly).
Anyway, this is probably much longer than most people will be interested in reading, but itās an interesting discussion that may also benefit from multiple perspectives, so Iāll leave it here.
For what itās worth, there are techniques that mesh ML and traditional stats these days, so I donāt know if there is such a hard trade-off āthese daysā. (Edit: Which I see you already touch-on, but oh well.)
In my opinion, itās not the use of prior information that matters so much, but rather the causal relationships amongst the variables. You can incorporate prior knowledge in random forests and decision trees too, by being choosy about how you set-up the loss function, what predictors you include and more. I think, sometimes, you can also incorporate prior knowledge too when you make the decision that cross-validation isnāt necessary using the ājustificationā that youāre analyzing āall the dataā.
However, things like conditioning on a collider or mediator are not going to be solved just because you donāt use ML techniques. I think itās just that traditional stats has a more principled justification from Math Stat and philosophy of science throughout. Whereas the justification for things like neural networks and SVM in ML, can intuitively seem a bitā¦ unjustified, if anything. But incidentally, and evidently, they work!
So IMO demarcating traditional stats and ML on the basis of the incorporation of (prior) knowledge is no longer current.
Edit: What really matters is whether you choose between predictive and causal modelling, which is more about what prior information you bring and what data you use. The particular technique you use āmerelyā has different side-effects, justifications, guarantees, assumptions and uses.