Learning from small data?

I’m looking for examples, in any field, of papers in which the author learned something using a good model, stan, and a small data set. I’m waving my hands about what small is, but my hope is that people from different fields can point me to different examples. Can you point me in the right direction?


1 Like

Very unclear what you are looking for, but maybe:


Sorry about that. Basically, I’m trying to find studies where the sample size is less than 100 but you can learn something useful by using the right methods. In the world I live in I hear a lot of “unless you have a huge sample, your study will be underpowered and there is nothing you can learn from a small sample.” My hope is to find examples in which that is not the case.

Thanks. Meta-analysis are the first thing that I thought about, but I’m hoping to find studies with small samples that are not a meta-analysis. For example, studies in fields where collecting an additional data point is really expensive.

Uhm… the point of the meta-analyses is to reduce the sample size needs of a future study. Any data point in a new study is expensive (a human is administered a drug). That’s why I pointed this out. Meta-analyses make the difference here to end up with good inferences which combine the existing with the new data.

Yes, I’m just hoping to find examples other than meta-analysis.


A lot of the times in medicine collecting another data point is expensive or even impossible. For example, if we’re trying to diagnose Alzheimer’s in a patient, it would be really useful to get as many MRIs over time of the patient’s brain as we can. Unfortunately, MRIs are very expensive to run. In some cases, if a patient has a pacemaker, then we can’t even do an MRI!

So in that case we either need to

  1. pool together information across patients and be explicit in how we’re modeling the pooling relationship
  2. incorporate biological knowledge about the system/data we’re modeling

Stan and Bayesian modeling are really good at doing these things. In my old Stancon notebook I used the first approach.

Another example is in traumatic injury. A lot of the times if a patient comes into the hospital after being injured in a car accident, we may not have time to take a blood sample, because we have to act quickly, or we can’t take samples as often as we like, or we can’t do all the tests we want to do because they’re slow or expensive. I work with trauma data and had another recent StanCon notebook where I used the second approach (link here).

1 Like

Thanks @arya this is exactly the type of research that I was trying to find. In addition to the notebooks, is there any published paper that you can point me to?

Hm not that I can think of off the top of my head, but you can cite Stan case studies.

I’ll also say that for clinical trials data is also expensive to get because running a clinical trial is expensive.

In https://www.ncbi.nlm.nih.gov/pubmed/28376897 and https://elifesciences.org/articles/35213 we built a complex model of malaria propagation through mice and mosquito populations that was fit to around 100 total observations and generated precise inferences about different vaccine efficacies.

Power is important for calibrating discovery claims and more replicate observations means higher power. In practice, however, additional data are not pure replications and instead bring with them systematic differences that must be modeled lest your inferences be biased. But then increasing the model complexity…reduces your power and you end up in this weird loop. Ultimately you have to accept the limitations of your experiment and hone your questions to those that can reasonably be answered with the available data.