What's up with knowledge-guided and physics-informed ML?

I did a PhD in math and have formal training in neither stats nor ML. I studied ML first, circa 2014, using models like random forest to make predictions and cross validation to evaluate and calibrate. In this approach I found it difficult if not impossible to incorporate things like measurement error, spatiotemporal structure, process knowledge, etc. All of these seemed important in the applications I cared about (e.g. monitoring/mapping environmental hazards or soil properties). That led me to the kinds of stats that we talk about on this forum, which have served me well since then.

Until recently my understanding of stats vs ML models was a trade off: stats could, among other things, incorporate a lot of domain knowledge. But at the cost of requiring more time and statistical expertise to develop problem-specific models, and these models tended not to scale so well to large problems without even more work. ML could more quickly/easily provide generic models and could scale well to large problems.

More recently, it seems that AI/ML have taken notice of the importance of domain knowledge. I am hearing more and more about ā€œknowledge-guidedā€ and ā€œphysics-informedā€ ML (see e.g. here and here). So my question is: what do you all make of this and other trends in ML? Are the trade-offs I described above still accurate (if they ever were) and how might they change in the future?

P.S.

I ackowledge that the stats vs. ML thing is a false dichotomy. But Iā€™m fairly sure this question still makes senseā€¦ Also these trade-offs are not pre-determined, they are contingent on methodological and software devleopments and have been affected by things like approximtae Gaussian processes, brms, PSIS-LOO, INLA, etc. I also acknowledge that at this point my knowledge is very biased toward stats so I may not be representing the ML side fairly.

1 Like

ā€œWhatā€™s up with this?ā€ seems like an overly broad question, but the main point is probably around the one you mention: Machine Learning is mostly black-box methods designed to generate ā€œpredictionsā€, since they consist of inscrutable models that are flexible enough to fit whatever patterns in the data ā€“ precisely because they lack system/domain-derived constraints. Much is said about how complex these models are, but in my opinion they are quite simple, they are unintelligible by design; the lack of domain-specific structure is what makes them ā€œfit wellā€.

My take is that physics-informed/knowledge-guided/explainable AI(ML) are an effort to both make these models useful for tasks beyond generating synthetic data that looks like real data, as well as capitalizing on the success (and on the hype) of AI/ML. I think the trade-offs of this approach are just a result of this top-down view; the converse would be to start with a simple domain-specific model and add parameters to make them more flexible and fit more easily more data, but the general goal would be meeting in some middle (except itā€™s unlikely this middle would be the same in this bottom-up view).

Iā€™ve seen a couple of articles and a few posts on physics-informed neural networks, including the review you mention, but I havenā€™t seen a clear explanation of what is the actual goal of that, since fitting even a relatively complex PDE directly is not necessarily difficult nor intractabl if itā€™s a ā€œreasonableā€ model. I should probably spend more than five minutes looking at these loss functions (i.e. log-Likelihoods), but at a glance it seems like itā€™s just replacing the likelihood of parameters given data [P(\theta|D)] by that of parameters given NN output times the likelihood of NN given data [P(\theta|Y_{NN})P(Y_{NN}|D)].
Iā€™m sure you can come up with reasons why this is done, but Iā€™d like to hear a good explanation of what this achieves, and what is the broader goal of introducing this kinds of models into physics in the first place.

In biology the approach seems to be a bit different, where the structure of the network somehow reflects the system, so the weights say something about one component or another contributing more to the output. Iā€™m more familiar with the goals in this field, where the pros are being able to extract some information from the system without knowing much about it, while the cons are not being actually able to say much about the system mechanisms since the model parameters donā€™t have a biological counterpart (e.g. circadian rhythms are well studied and mathematical models can say something about the gene transcription rates and protein levels that modulate daily rhythms; itā€™s probably easy to fit a neural network to this kind of data, but it will do nothing other than predict something we can measure directly).

Anyway, this is probably much longer than most people will be interested in reading, but itā€™s an interesting discussion that may also benefit from multiple perspectives, so Iā€™ll leave it here.

1 Like

For what itā€™s worth, there are techniques that mesh ML and traditional stats these days, so I donā€™t know if there is such a hard trade-off ā€˜these daysā€™. (Edit: Which I see you already touch-on, but oh well.)

Related to your random forest example, thereā€™s the application of decision trees to GLMMs and random forests to traditional covariance based SEM.

Thereā€™s also work on trying to justify the joint use of ML and traditional stats, for which Iā€™ll shamelessly plug my own work from M3 2024. Thereā€™s this great paper on incorporating ML in workflows. Thereā€™s also just the practical use of decision trees for high-dimensional big data cases like multiverse analyses and Monte Carlo simulations.

In my opinion, itā€™s not the use of prior information that matters so much, but rather the causal relationships amongst the variables. You can incorporate prior knowledge in random forests and decision trees too, by being choosy about how you set-up the loss function, what predictors you include and more. I think, sometimes, you can also incorporate prior knowledge too when you make the decision that cross-validation isnā€™t necessary using the ā€˜justificationā€™ that youā€™re analyzing ā€˜all the dataā€™.

However, things like conditioning on a collider or mediator are not going to be solved just because you donā€™t use ML techniques. I think itā€™s just that traditional stats has a more principled justification from Math Stat and philosophy of science throughout. Whereas the justification for things like neural networks and SVM in ML, can intuitively seem a bitā€¦ unjustified, if anything. But incidentally, and evidently, they work!

So IMO demarcating traditional stats and ML on the basis of the incorporation of (prior) knowledge is no longer current.

Edit: What really matters is whether you choose between predictive and causal modelling, which is more about what prior information you bring and what data you use. The particular technique you use ā€˜merelyā€™ has different side-effects, justifications, guarantees, assumptions and uses.

3 Likes