Population as a point estimate or distrbution?

Hi Guys,

There is something very basic in Bayesian stat that I haven’t been able to settle with, and I’d really love your help here.
When you work under a Bayesian framework, do you assume the parameters of the latent data generating model has some fixed point estimate we are trying to estimate ?
Or, alternatively, do you see the latent data generating model as holding parameters distributions?

It was always easier for me to think of trying to estimate latent point-estimates, but I see that for example when doing power analysis in a Bayesian framework ‎Kruschke in his book (DBDA) actually generates a ‘true’ data generating model using parameters distributions, rather then point estimates.

I could really use your help in trying to figure out what is the main stream way of thinking is here.

Thank you,
Nitzan

@betanalpha

Everyone, both Bayesian and frequentist, assumes there is a true parameter value. For instance, suppose we’re estimating the gravitational constant based on some experiments rolling balls down ramps. There’s only one gravitational constant. But we, as humans, aren’t sure what it is. We represent our uncertainty mathematically using probability theory.

The upshot is that probability in Bayesian stats can be viewed epistemically in the sense that it represents our uncertainty in the true value, not that the true value is in some sense probabilistic. For example, suppose I flip a coin. It’s going to land either heads or tails, but I don’t know which, so I express my uncertainty as 50% heads and 50% tails.

Because there’s only one true value, the frequentists are philosophically unwilling to use probability to account for uncertainty, because they view probabilities as long-term frequencies. Instead, they concentrate on the uncertainty of estimators (and particular on tail values to calculate p-values) rather than the uncertainty of the values being estimated.

I like this quote from Laplace about what has come to be known as “Laplace’s demon” (Pierre-Simon Laplace. 1814. A Philosophical Essay on Probabilities. English translation of the 6th edition, Truscott, F.W. and Emory, F.L. 1951. Dover Publications. page 4.)

We may regard the present state of the universe as the effect of its past and the cause of its future. An intellect which at a certain moment would know all forces that set nature in motion, and all positions of
all items of which nature is composed, if this intellect were also vast enough to submit these data to analysis, it would embrace in a single formula the movements of the greatest bodies of the universe and those of the tiniest atom; for such an intellect nothing would be uncertain and the future just like the past would be present before its eyes.

And this one from John Stuart Mill (Mill, John Stuart. 1882. A System of Logic: Raciocinative and Inductive. Eighth edition. Harper & Brothers, Publishers, New York. Part III, Chapter 18):

We must remember that the probability of an event is not a quality of the event itself, but a mere name for the degree of ground which we, or some one else, have for expecting it. . . . Every event is in itself certain, not probable; if we knew all, we should either know positively that it will happen, or positively that it will not. But its probability to us means the degree of expectation of its occurrence, which we are warranted in entertaining by our present evidence.

8 Likes

I don’t think it matters. Like Spiegelhalter, I think probability is just a mathematical tool and does not have a realist ontology. (a “long-run” – which is really a possible-worlds argument, I think – is different and you do find frequentists arguing in realist terms) It’s a philosophical question as I see it because if the unknown value is no more than a distribution, then that distribution will have moments. So we can talk about it whichever way we like. But fundamentally I don’t think everything can be reduced to an eternal value like gravitational constant, eg complex systems, unless we choose to frame it that way.

2 Likes

@nitzan_shahar, I think your confusion comes from a subtlety of what Kruschke is doing when he talks about power analysis. In frequentist power analysis it is typical to ask “what sample size would we need in order to reject the null with probability at least p given some true parameter value \hat{\theta}?” In principle, we can structure a Bayesian power analysis in an analogous way; for example, “what sample size would we need in order to construct a credible interval that excludes zero with probability at least p given some true parameter value \hat{\theta}?”

But Bayesian inference provides such a neat way of integrating over uncertainty that we can also ask a (perhaps more relevant) question: “what sample size would we need in order to construct a credible interval that excludes zero with probability at least p given what we know about the system?” To figure this out, we can integrate our power analysis over the full (informative) prior distribution for \theta. This does not amount to assuming that there is no true value; it just recognizes that we are uncertain about what the true value is.

As an aside, note that the Bayesian approach based on choosing \hat{\theta} feels a little bit awkward because we are effectively defining a region of practical equivalence (ROPE) via our choice of \hat{\theta}, but then we’re asking about the probability of obtaining a credible interval that excludes zero, not a credible interval that excludes the ROPE. With a Bayesian power analysis, we can just as well ask “what sample size would we need in order to exclude some ROPE around zero with probability at least p given what we know about the system?” Note that it’s entirely possible that no sample size will exclude a given ROPE with high prior probability if the ROPE covers too much of the prior probability mass. Kruschke would probably argue that this is worth knowing ahead of time before you go to the trouble/expense of collecting a bunch of data, and that this is well within the purview of a Bayesian power analysis.

3 Likes

Many thanks to all of you for the great clarifications - indeed very helpful. There is still something I don’t get and power (or precision) analysis seems to flush that out for me.

When Kruschke is using parameters’ distributions as a data generating model, I thought that the deeper argument here might be that the “true” data generating model always have some irreducible stochasticity. As if the world has some inherent uncertainty or noise, which is not just coming from our measurement. I thought that was what Kruschke was implying (yet he never say that directly - Its my own interpenetration)…

“what sample size would we need in order to construct a credible interval that excludes zero with probability at least p given what we know about the system?”

@jsocolar if I follow your reasoning, practically, Kruschke is generating data by sampling parameters from a distribution, which reflect “what we know about the world”. Then he run N studies to estimate some precision (e.g., % of studies generating hdi smaller then some required value). Yet, these simulated studies are also using priors, which integrate what they assume about the world. So in the end it looks to me like we are taking the uncertainty twice…

Now that I understand that most do assume a point estimate latent value, should we not consider any uncertainty in the generative model as reflecting inherent noise in the true parameters’ values?

Thanks again for the discussion.
Nitzan

Think about it this way. I am going to apply Bayesian analysis to a statistical model using some prior P_1. What sort of results should I expect to see, as a function of sample size, when I do that analysis? The answer depends on what the true value of the parameter is, of course. But I’m uncertain about the true value of the parameter. I can express my uncertainty about the parameter via a distribution P_2, which may or may not be the same distribution as P_1 (perhaps I am intentionally using a vaguer prior P_1 in analysis than my true beliefs P_2 about the system would imply). So I can integrate over P_2 to figure out, for a given sample size, what I should expect the universe of possible results of my analysis to look like. This is the question that Kruschke feels a Bayesian power analysis ought to be answering.

If the world has some inherent noise or as you say irreducable stochasticity of a form that can be captured by putting a distribution over the parameters, then a good Bayesian would simply fit a multi-level model that includes that extra layer of stochasticity. Your idea is also similar to the idea of “all models are false”, the difference being that “all models are false” doesn’t imagine that we can necessarily get a true (or closer-to-true) model by sticking an extra layer of parametric stochasticity atop an existing model.

2 Likes

I think that is is a really deep and really important question, and one that is answered poorly in many popular references! My best attempt to clarify what’s going on is in Section 1 of Probabilistic Modeling and Statistical Inference, although keep in mind that this piece is long overdue for some minor edits.

The one unifying assumption that every statistical analysis makes is the existence of some true data generating process which, abstractly, is just a probability distribution over the observational space that quantifies the fundamental lack of predicability of a given measurement. If we were able to make perfect measurements with no noise then this distribution would be singular, concentrating entirely on the deterministic state of the world. Note that there is no parameter space here, just a single probability distribution over the space of possible measurement outcomes.

If we knew the true data generating process then we could quantify all of the possible outcomes of any process that consumes data – I refer to this as “calibration”. In particular if we have a black box algorithm that takes a measurement as input and returns a singe real number then we can

  1. Simulate possible measurements from the true data generating process \pi^{\dagger}(y),
\tilde{y}_{s} \sim \pi^{\dagger}(y).
  1. Evaluate the black box algorithm f on those inputs
\tilde{o}_{s} = f(\tilde{y}_{s}).
  1. Communicate the distribution of outputs, for example with a histogram summary or moments.

In any real analysis, however, we don’t know what the true data generating process is so we can’t do this kind of calibration just yet. First we have to find the true data generating process.

Unfortunately the space of all possible data generating processes is mathematically nasty. Just really disgusting stuff like probability distributions that can’t be represented or can’t be evaluated in finite time or are just too complex to work with in practice. Consequently we have to some how restrict our search to a subset of possible data generating processes.

Following Dennis Lindley I will refer to this subset as a “small world”. Small worlds are often chosen at least partially for their mathematical convenience. Most small worlds that you’ll encounter can be coordinated, that is we can identify each data generating process with a sequence of numbers that allow us to quickly look them up like addresses in a city grid. These numerical labels are also known as parameters. Until we choose a small world there is no notion of parameters!

If the small world contains the true data generating process ( see the first figure of Probabilistic Modeling and Statistical Inference) then this restriction doesn’t actually cost us anything. So long as we can exhaustively search the small world then we can find that true data generating process. It also means that there is some parameter configuration \theta^{\dagger} that identifies the true data generating process.

Given how complex the world, and hence any true data generation process, is we are unlikely to be so lucky as to have a mathematically convenient small world that contains the true data generating process. In that case we the situation shown in the second figure of Probabilistic Modeling and Statistical Inference. Here there is no parameter configuration that identifies the true data generating process. At best the elements of the small world approximate the relevant features of the true data generating process. This is the meaning of Box’s famous quote that “all models are wrong but some are useful”. For more discussion see Section 1.4.1 of Towards A Principled Bayesian Workflow.

So in the ideal circumstance where our model contains the true data generating process then there is a “true” parameter configuration and we can ask whether our inferences – a point estimator or a set estimator or a posterior distribution – are able to recover that true value under various circumstances.

In the more realistic case where our model does not contain the true data generating process then there is not a “true” parameter configuration. The best we can ask is whether our not our inferences identify data generating processes in our model that well-approximate certain features of the true data generating process.

Either way we can calibrate procedures that take in data as inputs using the small world. Instead of looking at what happens from data simulated from a single data generating process, however, we have to look at simulations from all of the data generating processes in the small world and them summarize the corresponding distribution of outputs.

Let’s take the gravity example that @Bob_Carpenter mentioned. If the world were perfectly described by Newtonian gravity from the earth then one of the parameters that determined the outcome of measurements sensitive to gravity would be the gravitational acceleration and there would be some true value of that parameter corresponding to real life.

But that isn’t real life. Objects are affected not just by the gravity of the Earth but also the gravity of the moon, not to mention all of the other planets. Sometimes these influences are so weak that they can be safely ignored but sometimes they can’t (tides!). Even worse we know that Newtonian physics is only an approximation to the more general relativity theory so even if we considered all of the planets a Newtonian model could not contain the true data generation process! That said for measurements on Earth of relatively low mass objects that aren’t going too fast a small world based on a Newtonian model is probably good enough to be “useful”.

The “power” analysis shown in Kruschke is also done in the context of a small world. After a small world is chosen one isolates a single data generating process as the “null hypothesis” and sequesters the rest as “alternative hypotheses”. The black box tries to decide if the null hypothesis is inconsistent with a given observation. By simulating data from the assumed null hypotheses over and over again we can see how often the black box decides on the null hypothesis and how often it makes the wrong decision (false positive rate). We can then repeat and ask the same for all of the alternate hypotheses (true positive rates).

Critical this analysis assumes that the true data generating process is either the null hypotheses or one of those alternative hypotheses. If the small world doesn’t contain the true data generation process then the false and true positive rates won’t quantify what happens when the black box is evaluated on real observations. When the small world contains only bad approximations to the true data generating process these rates can be arbitrarily wrong, but if we our small world contains decent approximations then the rates might do a decent job of quantifying what we would actually see.

Sorry that was so long but you hit on some deep points. The references I linked to above try to encapsulate the insights of George Box, I.J. Good, Dennis Lindley, L.J. Savage and more but with as many figures as I could fit in.

4 Likes