I’m not one of the authors so I can’t speak for them (possibly @Avi_Feller can chime in). My thoughts on your questions:

rho 0 or 1 are modelled, but why not a rho in between, what does it actually mean when we say Y1 and Y0 are correlated? how would we estimate/derive a good prior for rho?

I’m not sure I understand what you mean by rho 0/1 or a rho in between, they have a rho to indicate the correlation between errors in the outcome and potential outcome. They point to this reference Causal Inference: A Missing Data Perspective p. 22. There’s some more info about this type of correlated outcomes at Chapter 1 Fundamental Problem of Causal Inference | Statistical Tools for Causal Inference. In the potential outcomes framework we have person i's (continuous) outcome, y_t under treatment t being either 0/1 for not treated/treated. If there’s a positive correlation between these two states then it says low/high values in the not treated state suggest low/high values in the treated state. For example, say we want to know if wealth increases after randomly assigning people a course in financial management. Well, people who have high wealth prior will probably have high wealth after.

As for a prior on rho, I think an uninformative prior centered at 0 makes sense, unless you have prior information about the correlation.

tau_fs has a smaller variance despite normal_rng noise, why?

`tau_fs`

is a sample average and the sd will continue to decrease with increasing sample size. We can derive an estimate for this quantity by using the fact that the variance of difference of two independent normal variates is \sigma_1^2 + \sigma_2^2 and the sd of the average is \sigma_{E(y_1 - y_0)} = \sqrt{\frac{\sigma_1^2 + \sigma_2^2}{N}}. In fact, we’ll also get more certain of the super population \tau with increasing sample size. So back to your original question of why would it be smaller? Well, I think the intuition there is that in the model block we only have info on the outcomes we observe but in the generated quantities block we condition on the effect and then draw the potential outcomes. This conditioning reduces the variance of the finite sample effect.

for the counterfactual y0 and y1, should we set a lower bound of 0? As earnings cannot be negative.

Probably.

in section 3.2.2, I don’t get the interaction part for the treatment, why do we need it?

They say, “Instead of imposing restrictions that the effects of X_i are the same for both potential outcomes, we define two different vectors of the slope coefficients \beta_c and \beta_t for the control and treated units repectively. The difference in the two vectors, \beta_t−\beta_c, can be obtained by including an interaction term between X and W in the model”