Prior on Treatment Effect in Hierarchical Model

Hello, I am trying creating a hierarchical model with two levels (individuals and schools) to do causal inference with data from an RCT, where treatment was assigned to clusters. The model works well and it recreates the results of the original (frequentist) study, but I was wondering what prior should I put on the population level treatment effect. This is the likelihhod of the model

  for (n in 1:N) {
    m[n] = alpha[g[n]] + x[n, ] * beta[, g[n]] + effect*w[n];
  uno = rep_vector(1, N);
  d = sigma2_t*w + sigma2_c*(uno-w);
  y ~ normal(m, d);

And I put the following prior on the parameter “effect”

  effect ~ normal(0, 100);

I was wondering if there were better choices for this prior, either more conservative or more theoretically sound as this seems very arbitrary.

Domain expertise and prior predictive checks. It often makes sense to have a prior for the treatment effect centered on zero to accommodate the possibility of positive and negative effects, but maybe you have strong domain expertise that says they can’t possibly be one sign or the other.

Assuming a centered-on-zero prior, the next prior feature to consider is the scale (the inverse being the prior’s “informativeness”); generally, the tighter the prior around zero the more confident you are that the treatment effect is truly zero. You might be tempted to make the prior super wide/“uninformed” then, but prior predictive checks should be used to confirm that the width you choose produces data consistent with the kind of data you’d expect in the domain of study. For example, in my field (psychology), we usually observe effect sizes <<1, so a naive attempt to use a wide/uninformed prior like normal(0,10) would yield prior predictive data that would be very inconsistent with the domain. Hence the preference for “weakly-informed” priors where they’re wide but not so wide as to generate ridiculous-for-the-domain data during prior predictive checks.

In my case the treatment effect is very unlikely to be negative (but I guess one cannot exclude the possibility). My reasoning is that, in theory, one should put a “good chunk” of mass around 0 to be conservative, since the treatment is not effective unless otherwise proven by the data. I see what you mean regarding the scale, do you think you could relate it to the standard deviation of the outcome variable? For example, setting the variance to be the square of 3 times the standard deviation of the outcome?

I read a lot about horseshoe priors for variable selection. I feel like it might be an interesting idea, but at the same time I feel like there should be something better, since causal evaluation is a more delicate problem than just variable selection.

Anyway, thanks for the suggestion, I hope others can enrich the conversation, I am very curious about the topic.

If the outcomes can be relied upon to have a generally stable SD from study to study and you have a reasonably low uncertainty in the estimation of that quantity from a single study, then yes, that seems reasonable (indeed, it’s roughly what I usually do).

This is for contexts where one has many treatment variables or many outcomes and one is looking to isolate those with an effect of treatment from those that don’t have an effect.

I am still not totally convinced, I find it interesting that there have not been more careful work in causal inference.

For example in my case I would like a prior with a good mass around 0, very few on the left of it and some on the right.

Is a Normal really the usual way to go?

So far as I’ve seen, yes. Part of the idea is that if there were zero effect amid noisy measurement, the resulting distribution of study means would be symmetric around zero.

Remember that Bayesian inference makes more explicit some aspects of the socio-empirical process that is the scientific endeavour. So prior setting will always be contingent on the degree of consensus around discernible consistencies in the published literature for the specific phenomena of study. So you have to consider the mindset of the expert-but-skeptical-peer. That’s primarily why folks tend to use symmetric-centered-on-zero priors for effect-of-treatment parameters such as you’re modelling. Probably also at play is the minor ease-of-implementation of the normal to this end.

But if you feel a different shape would be more domain-consistent and that you can convince others of this view, you are free to devise whatever alternate shape you want. For example, a shifted log-normal (shifted so that peak credibility is still at zero), or even a mixture of a positive exponential and a negative exponential with different rates, etc.