Causal inference question: is this Stan model consistent with this DAG?

I’m analyzing data from an experiment where researchers planted seeds at three separate sites. On each site, they selected plots covered by four habitats. In each plot, they recorded three environmental variables: soil pH (pH), soil moisture (SM) and soil temperature (ST). I each habitat they planted 6 seeds

The study aims to analyse the effect of habitat, pH, soil moisture and soil temperature on the probability of germination.

I drew the following DAG that assumes that “habitat” has both direct and indirect effects on germination. I added “Site” as varying intercept to account for the heterogeneity in germination success that is not explained by pH, SM, ST or habitat.

g <- dagitty("dag{
      Habitat -> pH -> G
      Habitat -> ST -> G
      Habitat -> SM -> G
      Habitat -> G
      Site -> G
Hbtt _||_ Site
SM _||_ ST | Hbtt
SM _||_ Site
SM _||_ pH | Hbtt
ST _||_ Site
ST _||_ pH | Hbtt
Site _||_ pH


I wrote the following model:

Number of seedlings that germinated ~ beta_binomial(6, mu * K, (1-mu) * K);

logit(mu) ~ α+α_site[site]+α_habitat[habitat]+
β1[habitat]*pH + β2[habitat]*Soil moisture + β4[habitat]*Soil temperature

α,α_site ~ Normal(0,0.1)

α_habitat ~ Normal(0,σ_habitat )

σ_habitat~ Exponential(3)

β1 ~ Normal(μ_pH,σ_pH)
β2 ~ Normal(μ_SM,σ_SM)
β2 ~ Normal(μ_ST,σ_ST)

μ_pH, μ_SM,μ_SM ~ Normal(0,0.1)
σ_pH,σ_SM, σ_ST ~ Exponential(3)


  1. Is this model consistent with the DAG I drew?

  2. Can I interpret “a_habitat” estimates as the effect of “habitat” after already considering the effects of PH, SM and ST?

  3. One of my colleagues asked if we could somehow estimate the pure effect of the “habitat” variable germination. I don’t think we can decouple the effects of the environmental variables from “habitat” with this experimental design, but I would like to hear your opinion.


Small thing: you define priors for a_habitat twice. Probably a typo for the second one?

Plot is not part of your dag. Without it your question is not answerable. After adding it, you can use ggdag or dagitty (there is also an R package for that) to answer the question :)
See this for an example for how to use ggdag_adjustment_set()


Yes, that was typo. Fixed.

Sorry, but I don’t follow your question. Are you not able to see the DAG in the post? Or do you want the see code I used to generate the DAG? If that is the case:

dag <- dagify(
ggdag_parents(dag, "G")+theme_dag_blank()
1 Like

I think @scholz meant something like this,

    g <- dagitty("dag{
      Habitat -> pH -> G
      Habitat -> ST -> G
      Habitat -> SM -> G
      Habitat -> G
SM _||_ ST | Hbtt
SM _||_ pH | Hbtt
ST _||_ pH | Hbtt

e.g., SM is conditionally independent from ST given Habitat, and so on. This is but an example. For more info see here:


@torkar is right. However you are using plot as a predictor in your model while it is not present in your DAG. So depending on the causal relations that you assume plot to have with the rest of your variables, the conditional independencies/adjustment sets will differ.


I’m sorry. “plot” shouldn’t be part of the equation, it was a mistake. That’s why I was so lost when I read your reply.

I rewrote the original question and used dagitty to draw the DAG and extracted the implied conditional independencies. I also added an explanation as to why “Site” has to be part of the model.

Hopefully my questions are clearer now.

1 Like

I think you’re right here - what is habitat if not a particular collection of biotic and abiotic components? I think what this gives you are the effects of the ‘unmeasured’ components of habitat - vegetation type, for example - conditional on the environmental variables you have access to in your data (which could be considered to form part of a habitat definition themselves).

I think where I’d urge a bit more caution here is to consider what unmeasured variables are at play here, and to include those in your DAG as well - going off your DAG, soil variables could all be influenced by the climate or the parent material (good to think about the Jenny state factor equation!). Depending on how these fit into your DAG or DAGs, you’ll get a much better sense of whether you can get an unbiased or less biased estimate of habitat or not.


Actually, they collected a few more environmental variables. I didn’t include them all in the example, because I thought it would be too scary.

Concerning unmeasured covariates, soil temperature and soil moisture are likely influenced by temperature and precipitation, which they did not record. The pairwise correlation between “soil_moist” and “soil_temp” across all habitat plots is -0.42 which I thought was low enough to include them both in the model. The soil type is the same across all habitats.

Here’s the “multiverse of madness”-like DAG.

g <- dagitty("dag{
      Habitat -> pH -> Germination
      Habitat -> Soil_temp -> Germination
      Habitat -> Soil_moist -> Germination
      Habitat -> Phosphorus -> Germination
      Habitat -> Potassium -> Germination
      Habitat -> G
      Site -> G

G _||_ Grmn | Phsp, Ptss, Sl_m, Sl_t, pH
G _||_ Grmn | Hbtt
G _||_ Phsp | Hbtt
G _||_ Ptss | Hbtt
G _||_ Prcp
G _||_ Sl_m | Hbtt
G _||_ Sl_t | Hbtt
G _||_ Tmpr
G _||_ pH | Hbtt
Grmn _||_ Hbtt | Phsp, Ptss, Sl_m, Sl_t, pH
Grmn _||_ Prcp | Hbtt, Sl_m, Sl_t
Grmn _||_ Prcp | Phsp, Ptss, Sl_m, Sl_t, pH
Grmn _||_ Site
Grmn _||_ Tmpr | Hbtt, Sl_m, Sl_t
Grmn _||_ Tmpr | Phsp, Ptss, Sl_m, Sl_t, pH
Hbtt _||_ Prcp
Hbtt _||_ Site
Hbtt _||_ Tmpr
Phsp _||_ Ptss | Hbtt
Phsp _||_ Prcp
Phsp _||_ Site
Phsp _||_ Sl_m | Hbtt
Phsp _||_ Sl_t | Hbtt
Phsp _||_ Tmpr
Phsp _||_ pH | Hbtt
Ptss _||_ Prcp
Ptss _||_ Site
Ptss _||_ Sl_m | Hbtt
Ptss _||_ Sl_t | Hbtt
Ptss _||_ Tmpr
Ptss _||_ pH | Hbtt
Prcp _||_ Site
Prcp _||_ Tmpr
Prcp _||_ pH
Site _||_ Sl_m
Site _||_ Sl_t
Site _||_ Tmpr
Site _||_ pH
Sl_m _||_ Sl_t | Hbtt, Prcp, Tmpr
Sl_m _||_ pH | Hbtt
Sl_t _||_ pH | Hbtt
Tmpr _||_ pH



That dag looks nice. However, I assume that G should have been Germination as well?

So now you can define exposure and outcome (one dag per causal query)
and either ask for an adjustment set or test if your model is an adjustment set (adjustment set means a set of variables to condition on to get the unbiased causal effect of the exposure on the treatment)

You can use paths to get a list of causal paths and if they are open or closed.
adjustmentSets() gives you adjustment sets for your dag and isAdjustmentSet allows you to test if your model is an adjustment set for the given graph. Note that adjustment set is only concerned with bias of your estimation, not with precision. So there can be a benefit to condition on additional variables that are not in the minimal adjustment set (see A crash course in good and bad controls)


Here’s the complete DAG, and the corresponding code, with both outcome and exposure identified. “Precip” and “Temp” are unobserved and I know they influence both soil temperature and soil moisture.

g <- dagitty("dag{
      Germination [outcome]
      Precip [unobserved]
      Temp [unobserved]
      pH [exposure]
      Soil_temp [exposure]
      Soil_moist [exposure]
      P [exposure]
      K [exposure]
      Site [exposure]
      Habitat -> pH -> Germination
      Habitat -> Soil_temp -> Germination
      Habitat -> Soil_moist -> Germination
      Habitat -> P -> Germination
      Habitat -> K -> Germination
      Habitat -> Germination
      Site -> Germination

Regression model:

Number of seedlings that germinated ~ beta_binomial(6, mu * K, (1-mu) * K);

logit(mu) ~ α+α_site[site]+α_habitat[habitat]+
β1[habitat]*pH + β2[habitat]*Soil moisture + β4[habitat]*Soil temperature+ β5*K+ β6*P


  1. Is it correct to keep both soil moisture and soil temperature (correlation = - 0.42) in the model?
  2. I stratified the covariates by “habitat”. Should I also keep α_habitat[habitat] in the model?

This are more general comments:

In causal inference one typically uses one regression model to estimate one causal effect for one exposure.

This is so, because inclusion or exclusion of covariates into a model depends on its relationship to the exposure. A covariate could be a confounder for one exposure variable and a mediator for another variable and would not be included if it is a mediator and one is interested in the total causal effect. (Things are different for direct effects).

One way forward would be to specify only one exposure and one outcome and ask dagitty/ggdag for the minimal adjustment sets.

If there is at least one such set and you include all its variables in the model, the causal effect is identified, provided the DAG is correct (Let’s ignore here that one could discuss nonlinear effects).

It is also useful to include covariates that predict the outcome and are unrelated to the exposure as this reduces the error variance.

If you are interested in causal effects from several variables, you would need to define the minimal adjustment set for each separately.

Hope this helps


About checking DAGs:

If you want to check if your DAG is correct by checking the implied conditional independecies, you will again need one regression model per implied independence.

The big challenge is that one needs to show that a coefficient is zero or practically equivalent to zero, which requires large sample sizes.

1 Like

Just one small note on this:

It is also useful to include covariates that predict the outcome and are unrelated to the exposure as this reduces the error variance.

This is not necessarily true as discussed in the A crash course in good and bad controls paper I mentioned earlier.
I wrote a blog post about this here where I simulate from some of the DAGs from the paper.

I just had a quick look at your blog (which I like).
But ;-)
Using the variables from you blog post, my point was that one should include z2, which you show reduces the variance of the coefficient for the causal effect.
I did not write that z3 should be included, which is where you find increased variance. z3 predicts the exposure, but I wrote one should include variables that predict the outcome and are unrelated to the exposure.

Is this a misunderstanding, or did I overlook where your blog post shows that including a variable that predicts the outcome and is unrelated to the exposure increases variance of the estimated causal effect?

Is this a misunderstanding,

This is totally me not reading your post carefully enough m(

I find this comment quite interesting. To be concrete, think about a simple model: treatment Z, some post-treatment variable X, and outcome Y. Assuming the treatment is randomized such that there is no other confounders.

For many casual inference people, a regression model Y~Z appears sufficient. Take the regression coefficient and that is it.

On the other hand, it natural to model this whole process in Stan, which amounts to a regression model Y~X+Z. Now, the regression coefficient is not quite the same as the treatment effect, but I could certainly make some imputation to compute the treatment effect, possible with extra modeling of Z~X. If I need to analyze a different treatment in the future, I may have a different imputation (the generated quantity block), while the model block remains the same.

As Andrew Gelman would say, there is always a tension between the fit-as-much-as-you-can-from-your-data attitude, and the go-for-robustness-using-a-reduced-form approach.

1 Like

My statement had a few implicit assumptions about the research setting and goals that typically hold when people start a discussion with a DAG, but I see that making these assumptions explicit would have been better. Here is a complete statment:

In causal inference of total effects from observational data one typically uses one regression model to estimate one causal (total) effect for one exposure.

Regarding your example: It is my understanding that the RCT literature, where the estimand typically is the total effect of a treatment, discourages from generally including post-treatment variables because these could be mediators.

For instance, if the DAG (or data generating process) is Z \rightarrow X \rightarrow Y, I would estimate the total effect of Z as Y \sim Z because including X in the model would lead to underestimating the effect of Z.*

If, om the other hand, the DAG (or data generating process) is Z \rightarrow X \leftarrow Y I would estimate the total effect as Y \sim Z + X because including X reduces the variance of the (regression) coefficient for Z.

Of course, things are different if one wants to estimate direct effects:

If the DAG (or data generating process) is either Z \rightarrow X \rightarrow Y or Z \rightarrow X \leftarrow Y, I would estimate the direct effect of Z as Y \sim Z + X.

*If I understand correctly, one point you are making is that one could still model Y \sim Z + X and use the coeffcients for Z and X to calculate the total effect of Z. I agree that this is possible, but I assume that the usefulness of such an estimator for the total effect depends on things like the accuracy and precision with which X is measured and the sd of the piors for the model coefficients. I suspect that the variance of such an estimator will typically be larger compared to just using the coeffcient for Z in the model Y \sim Z, but I admit that I haven’t given this issue much thought.


Thanks @Guido_Biele
Yes, I am talking about total effect of the treatment.

Sometimes the fit-as-much-as-you-can-from-your-data approach can still be useful. Consider a DAG in which Z is post-treatment and is independent of X, then including this Z in the model does help fit the data, and also help reduces variances of the treatment effect. This type of analysis is more complicated in non-linear models, but the point is that we may always start by a fit-as-much-as-you-can-from-your-data model, and impute missing potential outcomes to answer causal questions.

Which approach is more efficient for treatment effect estimate? I do not know.

I think we agree @yuling :-)

This what I tried to express, albeit not directly, here:

And more clearly here:

1 Like

On the code, I wouldn’t use beta-binomial that way. You should define mu as a local variable and then simply make it equal to the logit of the linear model, i.e.,

real mu;
mu = logit(\alpha + X\Beta);
seeds ~ beta_binomial(6, mu * K, (1-mu)*K);

Also, I don’t think there is any reason here to use the beta-binomial. The binomial is simpler and will get you what you need. I don’t think there is a reason to complicate things when you already have a lot going on in the DAG:

seeds ~ binomial_lpmf(6, mu);

If you want some intuitions on the beta-binomial, you can read my blog post here: What To Do (And Not to Do) with Modeling Proportions/Fractional Outcomes | Robert Kubinec