Quantifying a reduction in prior uncertainty over several experiments

llewmills · September 17, 2019, 1:34am

I am interested in how to quantify reductions in uncertainty about the size of an experimental effect over a series of studies which, for hypothetical reasons, preclude the merging of data. I would like to use the posterior parameter estimates from each study to inform the priors of the next study. The model is a simple bayesian regression testing the difference between two groups at a single time point. The model equation is

\begin{align*}
y_i = \alpha + \beta_{B}x_{Bi} + \varepsilon_i
\end{align*}

where y_i is the score on the outcome for participant i, \alpha is the intercept – score in Group A – \beta_{B} is the difference in score between Group B and Group A, x_{Bi} is a binary {0,1} inclusion variable indicating whether participant i was in Group B and \varepsilon_i is the error term.

This model was repeated across three studies. For this example I’ll focus on the \beta_B parameter – the difference in score between Group A and Group B – but the same approach could be applied to the \alpha parameter as well.

Quantifying Incremental Reductions in Uncertainty

Study 1

The prior on \beta_{B} in the first Study was straightforward. Using a strategy employed by Kruschke I used the standard deviation of the outcome y_i to scale the prior distribution on \beta_B. The prior for \beta_B was a normal distribution centred on 0 (i.e. no difference between Group A and Group B), and with a standard deviation five times the standard deviation of post-beverage withdrawal scores. From here on I’m going to refer to the amount we multiply the standard deviation entered into the prior distribution as a multiplication factor. I chose 5 as the multiplcation factor because I knew it would generate a very wide prior distribution. The standard deviation of scores in Study 1 was 12.5, so the standard deviation for the prior distribution on \beta_B was 5 x 12.5 = 62.5, in other words a very wide, weakly regularising prior, reflecting absence of any prior knowledge about the Group A/Group B difference.

Study 2

Thanks to the results of Study 1, in Study 2 I knew I a little bit more about what Group A/Group B difference to expect. In fact I used the modal estimates for \beta_{B} (6.8) and \varepsilon (10.7) to help me set the prior distribution for \beta_{B} in Study 2. The centre of the prior distribution was easy: I simply centred it on 6.8. The tricky part, and the subject of this post, is what to multiply the estimated standard deviation from Study 2 by to specify the spread of the \beta_{B} prior distribution in Study 2. I created a function to generate the multiplication factors for the spread of priors on \beta_B across the three experiments…

sapply(0:2, function (i) 5*1/5^(i))
[1] 5.00 1.0 0.2

…where each new value in the vector is one fifth of the previous value. The function dictated that the second multiplication factor was 5 x 1/5^1 = 1. Therefore the standard deviation of the prior distribution on \beta_B in Study 2 would be 1 x the estimated standard deviation from Study 1 = 1 x 10.7 = 10.7. This prior is still quite wide, but is much less so than the prior on \beta_B in Study 1.

Study 3

In Study 3, I used the same approach as for Study 2: the normal prior distribution on \beta_B in Study 3 is centred on the modal estimate of \beta_B in Study 2 – 7.3 --, and the spread is determined partially by the modal estimated standard deviation of y_i in Study 2 – 14.2 – and partially by the function for the multiplication factor. The multiplication factor in this, the third iteration of the model, was 5 x 1/5^2 = 0.1. Therefore the standard deviation of the prior distribution on \beta_B in Study 3 is 14.2 x 0.1 = 1.4. As you can see in the bottom panel of the figure this is results in a much more precise prior, reflecting the further reduction in uncertainty about the Group A/Group B difference in what is now the third iteration of the experiment.

Here is the figure with the prior and posterior distributions for the \beta_B parameter across all three studies. The green density function is the prior and the pink is the posterior. The prior distribution on \beta_B, almost invisible in Study 1, becomes more narrow over time, reflecting a reduction in uncertainty about the Group A/Group B difference.

My question relates to the function that dictates the reduction in the multiplication factor across studies, essentially that quantifies the reduction in uncertainty. The first function resulted in multiplication factors of 5, 1, 0.2. A function with a steeper reduction in uncertainty (each new iteration one tenth of the previous, starting at 5)…

sapply(0:2, function (i) 5*1/10^(i))
[1] 5.00 0.50 0.05

…results in the following set of estimates.

This function results in a much narrower parameter estimate for \beta_B by the time we get to Study 3, and the prior has ‘dragged’ the modal posterior estimate downwards (compared to the Study 3 panel in the first figure). So this seems too certain.

So how do I quantify a reduction in uncertainty mathematically, in an ethical way?

And are there any functions that are better or more sensible than the two I devised?

martinmodrak · September 19, 2019, 10:28am

Since no one else answered, I will give it a try.

Maybe I am missing something, but what would be problematic with making the prior sd of \beta_B be the sd of the posterior for \beta_B from the previous study? In a more general language, you are trying to approximate the actual posterior density (given via the samples from Stan) with an analytic density (normal in this case). This would IMHO correspond to normal Bayesian updating which has good theoretical claims to being the best you can do.

Hope that helps!

llewmills · September 19, 2019, 10:32pm

Thank you for replying @martinmodrak. So you mean use the sd of the posterior parameter estimates for \beta_B rather than the estimate of the standard deviation of the outcome multiplied by the multiplication factor? Interesting. Is this a common approach? And are you aware of any studies where they have taken this approach?

llewmills · September 19, 2019, 11:10pm

And @martinmodrak does this method lead to more precise (i.e. narrower) parameter estimates with repeated studies past the second, or do you get a sharp drop in variance of the prior for parameters from study 1 to study 2, then reach asymptote, with additional studies after not yielding much tighter estimates? Does that make sense?

martinmodrak · September 20, 2019, 12:31pm

Yes :-) But the point is that you are “fitting” an analytic distribution to the samples from the posterior (for normal distribution, taking the empirical mean and sd of the samples just happens to be a good way to “fit” the distribution)

This approach is basically just the Bayes rule, i.e. when we have D as data and \theta as parameters with prior P(\theta) and likelihood P(D | \theta) Bayes theorem says:

P(\theta | D) \propto P(D | \theta) P(\theta)

For most models, the data can be split into independent observations, i.e.:

P(D | \theta) = P(D_1 | \theta) P(D_2 | \theta) ... P(D_N | \theta)

Let’s say we split the data at data point 1 < k < N
So we can substitute:

P(\theta | D) \propto P(D | \theta) P(\theta) = \left[ P(D_{k+1} | \theta) ... P(D_N | \theta) \right] \left[ P(D_1 | \theta) ... P(D_k | \theta) \right] P(\theta) \propto \\ \propto \left[ P(D_{k+1} | \theta) ... P(D_N | \theta) \right] P(\theta | D_{1..k})

So, here the posterior after seeing the first k data points (P(\theta | D_{1..k})) takes the role of the prior for further observations. In other words, if you plugged the exact posterior after seeing the first k data points as prior for the other N-k data points, your final posterior is exactly equal the posterior you would get if you used all the data in a single model. Since you only get samples from the posterior and then approximate it with a normal distribution, the process will introduce some error.

Is this common? I think it is commonly mentioned in theoretical discussions about Bayesian statistics, but not frequently employed in practice as it is usually much easier to refit the model with all the data at once instead of updating the previous posterior.

I think this would totally depend on the model and data - for some the full Bayesian update will give narrower estimates, for some it will give wider.

Hope that helps :-)

maxbiostat · September 20, 2019, 2:14pm

I think the OP was asking about the “fitting” of a Gaussian to the posterior samples. I don’t know how common it is, but if you were to show me a Gaussian-looking histogram/KDE of your posterior and the normal fit you got, I’m more than willing to buy your results.

I think Andrew Gelman had some sort of measure in mind that would quantify the reduction in uncertainty that had something to do with comparing standard deviations. @andrewgelman, wanna chime in?

andrewgelman · September 20, 2019, 2:54pm

My quick response to the above discussion is that I think it makes sense to fit all these data using a hierarchical model rather than think of the fitting and priors process sequentially. For computational reasons it could make sense to break up a big problem into parts–but assuming that computation is not the bottleneck here, I’d recommend a hierarchical model.

Also, the comparison of standard deviations that I’ve discussed is here: https://statmodeling.stat.columbia.edu/2019/08/10/for-each-parameter-or-other-qoi-compare-the-posterior-sd-to-the-prior-sd-if-the-posterior-sd-for-any-parameter-or-qoi-is-more-than-0-1-times-the-prior-sd-then-print-out-a-note-the-prior-dist/

llewmills · September 26, 2019, 8:56pm

@andrewgelman there are (to my mind at least) good reasons why I didn’t pool all the data and fit as a hierarchical model and instead used updating.

First, although the three studies tested the same manipulation (effect on caffeine withdrawal after receiving a cup of decaf of being told one had been given caffeine vs being told one had been given decaf) using the same outcome measure, this all happened within different experiments with different designs (e.g. in experiment 1 the manipulation was one of two performed on participants whereas in experiment 3 the caffeine information manipulation was the only factor), different samples, different stated purpose (these were experiments on a placebo caffeine withdrawal-reduction effect so there were cover stories given to participants to disguise the true purpose of the experiment). It just didn’t seem appropriate to pool the data.

Second, even if it was appropriate I kind of just wanted to do it as a theoretical exercise. Kruschke says several times in his book that ‘yesterday’s posterior is tomorrow’s prior’ and also recommends using past research to inform priors on current research. I guess part of what motivated this updating approach is the question ‘what if we were performing a replication or near-replication of an experiment run by someone else at another time in another lab, and they had lost their data, and all the information we had about the study to inform priors on our parameters was the sort of information you get in a paper, i.e. modal posterior parameter estimates of group mean, HPDI, standard deviation etc? Could we use that information to generate meaningful priors?’.

It seems to me that this is the strength of the Bayesian approach, that there is a formal way to integrate past research into the analyses of current data. I guess I was just looking for a formal way to quantify each iteration of knowledge increase via uncertainty decrease.

andrewgelman · September 26, 2019, 9:18pm

You write: “It just didn’t seem appropriate to pool the data.”

That’s right, you don’t completely pool, you partially pool. That’s what hierarchical modeling does.

You write, “this is the strength of the Bayesian approach, that there is a formal way to integrate past research into the analyses of current data.”

Yes, exactly. Hierarchical modeling allows you to do this, accounting for the differences between different problems being studied. See chapter 5 of BDA for further discussion of this point.

llewmills · September 26, 2019, 9:24pm

Thank you @andrewgelman I didn’t realise that those were the implications of hierarchical modelling.

But what about my second point, if all you had to inform your priors was summary data (mode, HPDI, sd, sd of estimates) not actual data? Could you use that to generate meaningful priors?

Also what is BDA? Bayesian Data Analysis? And by whom?

andrewgelman · September 26, 2019, 9:28pm

We have such exampels in chapter 5.

llewmills · September 26, 2019, 9:40pm

Thank you @andrewgelman I will get the book and have a gander.

Topic		Replies	Views
Bayes factors for a series of experiments using posteriors as priors Modeling bayes-factor	11	1602	March 26, 2021
How to address uncertainty of point estimates in priors Modeling prior-choice , priors , brms	3	293	April 4, 2024
How to think about prior effects on posterior distributions Modeling	25	4212	July 6, 2020
Bayes factors for a moderating effect brms rstan	11	1242	July 5, 2020
Model with product of priors on a parameter does not converge as expected, new to Stan Modeling fitting-issues	15	843	November 29, 2022

Quantifying a reduction in prior uncertainty over several experiments

Quantifying Incremental Reductions in Uncertainty

Related topics