Multivariate model with subsetting, getting posterior predictions in brms

Plant height y is 0 if the plant died, and y > 0 if it survived. Plants are exposed to fire, drought, and herbivory.

real_data <- real_data %>% mutate(survived = as.integer(y > 0))

model_formula <- bf(survived ~ Fire + Drought + Herbivory, family = bernoulli()) +
  bf(log(y) | subset(survived) ~ Fire + Drought + Herbivory,  family = gaussian()) +
  set_rescor(FALSE)

fit_subsetting <- brm(
  formula = model_formula,
  data = real_data
)

How can I get posterior predictions for y, plant height, where y is set to 0 if survived == 0 ? I tried:

pp_log_y = posterior_predict(fit_subsetting, resp = "logy")
pp_y = exp(pp_log_y)
pp_survived = posterior_predict(fit_subsetting, resp = "survived")

conditioned_pp_y <- matrix(0, nrow = nrow(pp_survived), ncol = ncol(pp_survived))
conditioned_pp_y[,real_data$survived == 1] <- pp_y 

But this assumes survival status from the real data not the posterior simulations. Ultimately I’d like to do ppc like:

ppc_hist(real_data$y, conditioned_pp_y[1:8, ]) +
    ggtitle("Posterior Predictive Check") +
    xlab("Values of y") +
    theme_minimal()

Thank you so much !!

I would think that you would simulate the proportion that survived and then for that proportion simulate the heights and the rest would be zero. This would then provide a simulation of overall distribution of plant heights.

Say you plant 100 seeds at time zero, and all seeds germinate. Let’s suppose you sent out some undergrad student to measure the plants at time T and they took a ruler with markings to every cm and so all you have is integer data to the nearest cm of plants that are alive. Those alive get a non-zero height in cm and those dead get a 0 cm height. I might imagine your generated quantities in a Stan model to look something like:

generated quantities {
  vector[N] survived_preds;
  vector[N] y_preds;
 for (n in 1:N) { 
    survived_preds[n] = bernoulli_rng(alpha_s + beta1_s*Fire[n] + beta2_s*Drought[n] + beta3_s*Herbivory[n]);
    if (survived_preds[n]==1) {
      y_preds[n] = poisson_rng( alpha_h + beta1_h*Fire[n] + beta2_h*Drought[n] + beta3_h*Herbivory[n]  );
    }
    else if (survived_preds[n]==0) {
      y_preds[n] = 0.0;
    }
  }
}

In the generated quantities code above, to simulate the plant heights from your data generation process you simulate dead or alive and then if alive you simulate the height and if dead the height is just zero. (I did not test this code to make sure that else if will work this way in GQ block, but I think this would be the concept).

I’m not sure how you would do this in brms, but it seems to me that you could simply extract the MCMC samples and then use R’s rng functions to simulate predictions with similar code as shown above for the GQ block in a Stan model. For example, see below, where I simulate parameters that would be your MCMC samples extracted from the brms fit. Then plotting quantiles with some heavily modified code from Michael Betancourt’s excellent histograms (hopefully without botching the binning, but you should check for yourself).

M <- 1000  #number of MCMC samples from brms run
N <- 100  #number of observations in data

#data
Fire <- rbinom(N, 1, 0.1)
Drought <- rbinom(N, 1, 0.3)
Herbivory <- rbinom(N, 1, 0.4)

#MCMC samples from posterior probability distributions for model parameters
alpha_s <- rnorm(M, 1.75, 0.1)
beta1_s <- rnorm(M, -1, 0.1)
beta2_s <- rnorm(M, -0.5, 0.1)
beta3_s <- rnorm(M, -0.25, 0.1)
alpha_h <- rnorm(M, 30, 0.1)
beta1_h <- rnorm(M, -10, 0.1)
beta2_h <- rnorm(M, -5, 0.1)
beta3_h <- rnorm(M, -3, 0.1)

#simulate survivor status
p <- matrix(NA, nrow=M, ncol=N)
survived_preds <- matrix(NA, nrow=M, ncol=N)
for (n in 1:N){
  for (m in 1:M) {
  p[m,n] <- plogis(alpha_s[m] + beta1_s[m]*Fire[n] + beta2_s[m]*Drought[n] + beta3_s[m]*Herbivory[n])
  survived_preds[m,n] <- rbinom(1, 1, p[m,n])
  }
}

#simulate predictions of plant heights
lambda <- matrix(NA, nrow=M, ncol=N)
y_preds <- matrix(NA, nrow=M, ncol=N)
for (n in 1:N){
  for (m in 1:M) {
    if (survived_preds[m,n]==1) {
      lambda[m,n] <- alpha_h[m] + beta1_h[m]*Fire[n] + beta2_h[m]*Drought[n] + beta3_h[m]*Herbivory[n]
      y_preds[m,n] <- rpois(1, lambda=lambda[m,n])
    } else {
      y_preds[m,n] <- 0
    }
  }
}

#color settings
library(scales)
c_light_m1 <- c("#edf8e9")
c_light_highlight_m1 <- c("#c7e9c0")
c_mid_m1 <- c("#a1d99b")
c_mid_highlight_m1 <- c("#74c476")
c_dark_m1 <- c("#31a354")
c_dark_highlight_m1 <- c("#006d2c")

#plot quantiles of y_preds for each integer
B <- max(y_preds) + 1
idx <- rep(1:B, each=2)
x <- sapply(1:length(idx), function(b) if(b %% 2 == 0) idx[b] + 0.5 else idx[b] - 0.5)
x <- (x - 1)

pred_counts <- sapply(1:1000, function(n) 
                      hist(y_preds[n,], breaks=(0:B) - 0.5, plot=FALSE)$counts)

probs = c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9)
cred<- sapply(1:B, function(b) quantile(pred_counts[b,], probs=probs))
pad_cred <- do.call(cbind, lapply(idx, function(n) cred[1:9,n]))

plot(1, type="n", main="",
     xlim=c(-0.5, B + 0.5), xlab="height(cm)",
     ylim=c(0, 30), ylab="", cex.lab=2 , cex.axis=1.9)
title("Posterior predictions for plant height", line = 0.5, font.main=1, cex.main=2.2)
polygon(c(x, rev(x)), c(pad_cred[1,], rev(pad_cred[9,])),
        col = alpha(c_light_m1, 0.7), border = NA)
polygon(c(x, rev(x)), c(pad_cred[2,], rev(pad_cred[8,])),
        col = alpha(c_light_highlight_m1, 0.7), border = NA)
polygon(c(x, rev(x)), c(pad_cred[3,], rev(pad_cred[7,])),
        col = alpha(c_mid_m1, 0.7), border = NA)
polygon(c(x, rev(x)), c(pad_cred[4,], rev(pad_cred[6,])),
        col = alpha(c_mid_highlight_m1, 0.7), border = NA)
lines(x, pad_cred[5,], col=alpha(c_dark_m1, 0.7), lwd=2)


Then you would just overlay the histogram of your observations onto this plot to get your posterior retrodictive check.

2 Likes

Is this some sort of super controlled experiment? …if not I would worry about some sort of survivorship bias in estimating the effect of those stressors on plant height. It’s difficult for me to conceptualize the narratively generative story for this data.

1 Like

Yes, this is from a controlled experiment, where fire, drought, and herbivory are randomly assigned to plants.

However, even in random experiments we need to be careful about interpreting model parameters after conditioning on the post-treatment survival M. See Chapter 26 on Principal Stratification in Peng Ding’s book). The distributions of potential heights Y(\text{Fire}, \text{Drought}, \text{Herbivory}) are different for those that survived even with all these harmful treatments M(\text{Fire}, \text{Drought}, \text{Herbivory}) = 1 versus those that survived in the much easier control environment M(\text{Fire}, \text{Drought}, \text{Herbivory}) = 0.

So if we have a two-part model, the part that conditions on survival cannot be casually interpreted causally. However, I believe we can use the full model to predict potential heights (including those that end up zero due to death).

1 Like

Wow, thank you for modifying @betanalpha’s excellent histogram code ! I was just looking at those histograms in one of his many awesome case studies.

@paul.buerkner is there a way to simulate posterior predictive draws from a two-part model like this directly in brms ?

Thanks, all !!!

2 Likes

My above simulation code should work for your brms models. All you need to do is replace the simulated model parameters with your extracted parameters MCMC samples from brms, switch out the rpois() for rnorm() (being sure to wrap your formula for mu for y_preds in exp() and adding sigma in to the rng call), decide on some good bins for the plotting code, and add in a line for your observations of y.