I hope this is a right place to ask this: I am fairly new at trying to understand my models through the eyes of Causal Directed Acyclic Graphs (DAGs). But I have been trying to wrap my head around it before choosing what variables I want to condition for when running my models and to inform me about potentially relevant variables before collecting my data.
My confusion is perhaps too naive, but here it goes. I have read that when choosing what variables to condition for, those that are ancestors of both X (target effect) and Y (response variable) should be conditioned for, in order to get an unbiased estimate. Thus, if X → Y, and X ← Z → Y one should condition for Z to get an unbiased estimate of the effect of X on Y. My question regards the case in which Y is the descendant of more than one parent. In a simple scenario in which the only arrows present are those of W, X, and Z directed to Y, but where I am specifically interested in the effect of X on Y, shouldn’t I also condition for W and Z? Wouldn’t I get a biased estimate if I do not include W and Z in my model given that they -like X- are affecting Y? That is, shouldn’t one add all the variables that cause Y? And if not, why?
I would appreciate any insight anyone could provide me.
I’m always keen to learn more about causal reasoning so here goes;
If your system is X→Y and X ← Z → Y, then Z is a confounder of X→Y, so in any model for X→Y, Z → Y must also be estimated because otherwise X→Y will be biased; even if the direct causal effect X→Y were 0, the pathway W→ X→ Y remains.
If your system also includes W→Y, then inclusion of W→Y in your model is not necessary to prevent bias in X→Y. If X→Y were 0 and X→W is also 0, then you expect the pathway W→ X→ Y to be zero. However, inclusion of W→Y may still be relevant, because the estimation of W→Y will improve the precision of X→Y. If W→Y is non-zero, then excluding it would add noise to X→Y, but X→Y will not be biased.
+1, although one important “but” is missing: so-called backdoor confounding caused by “controlling” for a collider, i.e., a third variable that is influenced by X and Y and which doesn’t require to be conditioned on. If you do, though, the estimated effect of X on Y will be biased.
Ok, I see. So In this DAG, controlling for Z gives me an unbiased estimate, but controlling for those variables that are a cause of Y but not X improves the precision of my estimate on the effect of X on Y. I imagine this would be reflected by the model is a greater standard deviation.
I highly recommend using dagitty (and there are other packages available) to tell you which variables to adjust for given a causal DAG: http://www.dagitty.net/