# Hypothesis testing using posterior samples of estimated parameter for two groups

Hello. This is more of a conceptual question than a coding one. (Please advise me to switch incase this is not the appropriate category)

I’m modeling recruitment curves using a Hierarchical Bayesian model. There is a key parameter in my recruitment curve, let’s call it P. I have two groups (A and B) of participants of respective size N_A and N_B.

After I have finished fitting recruitment curves, I get estimates for the parameter P: P_A of size (1000, N_A) (for group A) and P_B of size (1000, N_B) (for group B), where 1000 is the no. of posterior samples (post-warmup) that I am collecting.

Now, I want to test the hypothesis that the mean of parameter P for group A is less than that of B.

To test this, I could take the MAP estimates and do a frequentist test. But the problem with this is that MAP estimates are not very reliable in participants where these parameters are not clearly observed. Also, I lose valuable information provided by the posterior samples.

Another way is that, I have setup another Bayesian model where I’m modeling the mean of these parameters.

mu_A, mu_B ~ Normal(50, 100) [My prior knowledge says the mean lies here. But I’m putting the same prior on both mu_A and mu_B, essentially saying that there is no difference in them (Null Model)]

sigma ~ HalfCauchy(1)

P_A ~ Normal(mu_A, sigma)
P_B ~ Normal (mu_B, sigma)

and once I fit them, I’m going to reject the Null Hypothesis if the posterior samples of mu_A and mu_B follow Pr(mu_A < mu_B) > .95.

Is this a valid technique? If so, could you please tell me if it’s been used in literature and where I can read about it more?

Other question I have which is possibly out of scope — how to estimate false positives of such testing procedures (basically, how do I trust such procedures). Also, how sensitive is such a procedure to the number of posterior samples?

@mathlad the quantity that you’re describing (some approximate posterior probability) as a simple summary of an estimate has been called P-direction (https://doi.org/10.3389/fpsyg.2019.02767). This looks like a p-value (and a non-statistical audience will try and interpret it as one). But this seems to me to be an estimation-oriented statistic rather than a hypothesis-testing one. There is a more formal Bayesian hypothesis testing paradigm (personally I have little experience with it).

The concept of ‘rejecting hypothesis’ though sounds to me like you’re interpreting this as some sort of decision procedure. Unless you’re interested in specific hypotheses though, why not just summarize posterior probabilities of interest (like the apparent difference between \mu_A and \mu_B)? What is to be gained by setting a threshold on it?