Biological modeling with no data

This is more of a phylosophical question that has puzzled me for a long time and I would like to ask for your opinion, and wheter Stan and bayesian inference could help.
In the molecular biology field is very common to find published (and highly cited) papers presenting models that relies on no data whatsoever. This is also due to the fact that a lot of biologial data is qualitative, and authos are reluctant to share their data.
Models parameters are obtained in one (or more) of the following ways:

  • resusing the estimation from previous papers and plugging them into the current model, even if formulas and causal connections are different (I suppose it accounts for some sort of “prior” in the bayesian framework, but never updated with real data)

  • making up some parameters values that somehow make things work as expected (again some sort of prior knowkledge)

  • adjusting parameters by trying several values along a range and check if the model is “robust” to these changes, meaning that outcomes do not vary too much (this is indeed very common but I can’t find a rationale for that).

Then they usually go on making prediction from these half-imaginary models and drawing conclusions from them, and that’s all.

Do you think that this makes any sense in the bayesian framework, or that I could harness the power of Stan or similar tools to improve the current (pathetic) situations?
As an example you can check this review (about auxin transport in the roots) http://dev.biologists.org/content/develop/140/11/2253.full.pdf?with-ds=yes. You will find many models like the one described, some of them are even Nature or Science publications.

thanks a lot

While I have little doubt that there are bad pure modelling papers, I think you are misunderstanding the purpose of those tools. Using a model with no or little data can be in fact very useful and I am not certain the situation in this regard is “pathetic”. I would even say, that biology might profit from wider use of formal models - even without data.

There are many sensible questions you can ask of a model without data, including, but not limited to:

  • Can this model explain the qualitative patterns we observe?
  • If the model can explain previous data, what further predictions it makes? How can we setup experiments that would most likely show disagreement with the model, if it is not correct?
  • If none of our models cannot explain the patterns, there have to be phenomena we are missing. Which of the model assumptions needs to be broken to explain the pattern? Can we check those assumptions experimentally?
  • We have competing models, can then they be distinguished by observing some of the behavior they produce? What would we need to measure to distinguish them?
  • Can we produce the observed patterns with a model that is simpler than the accepted mechanism?

There are famous examples of models doing exactly this:

  • The Lotka–Volterra predator-prey model
    explains why population sizes may fluctuate even in a stable environment. That was a big deal, since a lot of people were trying to find the changes in environment that drove observed population fluctuations, but it turned out they were not necessary.
  • Or this model (although I don’t know the original source) that shows that segregated neighbourhoods can easily arise even when most people care very little about race.

Note that the actual parameters of the model are not of interest, it is the qualitative behavior and the understanding they generate. Here, some measured quantities from elsewhere can be useful for ballpark estimates to make sure the parameters we test are at least roughly plausible. For the same reason grid searches of parameters are interesting - if the behavior holds across a wide range, it doesn’t matter that we only know an order of magnitude estimate.

It also does not make sense to make such models super precise as that means more complexity which hinders understanding. Since except possibly for parts of physics all models are strictly speaking false (e.g. environment obviously influences population sizes which Lotka-Volterra ignores), it is of interest to look for the “important” or “interesting” parts.

So as long as you are careful in interpreting, I believe models of this kind have value.
Judging by the abstract, the paper you linked to seems sensible and useful to me (given that we care about auxin at all) and they seem to be the right amount of careful about their conclusions.

I think Bayesian framework has limited use in qualitative modelling. But once you have at least some data, it can make sense to fit models that are deemed of interest to the data and then Bayes might be the way to go, especially if you don’t have a lot of data.

4 Likes

I was just reading this from Modeling and Medical Decision Making:

“Simulation is a computation method for exploring the implications of probabilistic modeling and variability in a wide variety of settings. It consists of generating artificial samples that are roughly consistent with specified probability distributions. The study of the samples is often much simpler than the study of the distributions originating those samples.”

“In realistic models random components are of high dimension, can be interconnected in complex ways, and may vary over time.”

“An efficient strategy is to set up an encompassing simulation-based framework for all computations that are required by the development, validation, and use of a decision model.”

your answer has been very exhaustive, thank you very much

I like your answer overall but I think we should be a little more critical than your suggest of the bulk of the simulation literature. When we think of qualitative patterns there’s really a limited space of outcomes. In biology aggregation and oscillation are two classical examples and in a wide variety of systems there are different simple models that have been used to suggest explanations. However, once you elaborate beyond very simple models there are too many options for producing a given pattern mathematically. Most theoretical biologists are well aware of these options and some papers read a little too much like post-hoc mathematical justifications rather than theoretical explanations that happen to produce the right patterns. I don’t know how serious the problem is across subdisciplines. In theoretical ecology it’s bad enough that I think a different way of doing business is needed. If somebody wants to build a theoretical model I’d prefer to see it built on a statistical model scaffold (data on basic known processes) rather than the pure theoretical approaches. In those contexts Bayesian meta-analysis has a big role to play in setting up baseline models.

3 Likes

Hello,

I find personnally that incorporating uncertainty in ecological/global change litterature would be a great opportunity.

First of all, well established ecosystems models often miss very important and basic features of the real world. I think that bayesian statistics may allow to update parameters (measurement error model), or to sort out principal processes. I think, but I might be wrong, that there is today enough data from all around the world and sufficiently advanced computation techniques to allow to validate models statiatically. Collin Prentice, in a very recent conference, argued approximatly in that way, without evoquing bayesiam statistics though.

The second approach I would like to see more is prior predictive checks. Why should we produce straight line when we could have intervals? I have not played with any of the more complex ecosystem models out there, so I dont know if it is doable. But I think that we could gain from incorporating some freedom in the parameters by describing them as priors, and to look at the results probalistically. It would certainly challenge our understanding to consider theoretical models has data generating processes rather than completely free thought experiments. But of course, thought experiments are useful, and the first step is always to simply describe intuitions by equations.

I am affraid that my answer is a little bit too centered on ecology, but I hope some elements are applocable widely.

Lucas

2 Likes

And I thought the question was about to reverse engineer biological papers to obtain the unpublished data :-)

2 Likes

You are probably correct. I overreacted a bit to the tone of the original post and overplayed my arguments. Despite that I find broad criticism of whole types of analyses mostly unhelpful and non constructive. I strongly believe that criticism should primarily be placed against a steel-man (as opposed to straw man) version of the practice or against particular cases.

As a personal experience, a group of biologists recently asked me to do a small simulation study for them. The problem was that a certain pathway was active only in some types of cells. Some people have tried to find regulatory mechanisms that shutdown the pathway. The idea of my collaborators was that the cells where the mechanism is not active are larger while having similar amount of the proteins participating in the mechanism. Maybe the inactivity is simply caused by different stochiometry where the diluted proteins just cannot cause any measurable activity? So I ran an oversimplified simulation, using measured kinetic constants for a related pathway and performing grid search across the parameters we could not constrain this way. The result was that yes, stochiometry could explain the observed pattern across a wide range of parameters. We further used the model to constrain concentrations to be tried in an in-vitro system. And it worked - my colleagues found you can rescue the activity at roughly middle of the range we chose based on the model. Now they are doing some more complex experiments to verify the finding in-vivo.

The simulation I did would check the boxes in the OP’s list of “pathetic” practice, while I believe it was a really helpful tool in the discovery process.

I like this idea and haven’t thought about the problem this way. While I am not sure interpreting such models would be easier (a range of parameters is IMHO clearer than a distribution over parameters), but I would certainly be happy is someone elaborated on that and tried to make it work.

1 Like