How to compute expected value of the posterior predictive distribution (epred)

Rather than cluttering up other threads, I thought it might be helpful to start a topic for this discussion specifically.

In trying to understand what expectations of predictions represent, I’ve been trying to compute them in generated quantities, with little success.

I am currently working on a beta binomial model. I used brms first to generate the Stan code then modified for my use case

In looking through github, I found this:

posterior_epred_beta_binomial <- function(prep) {
  # beta part included in mu
  trials <- data2draws(prep$data$trials, dim_mu(prep))
  prep$dpars$mu * trials
}

versus posterior_predict_beta_binomial:

posterior_predict_beta_binomial <- function(i, prep, ntrys = 5, ...) {
  rdiscrete(
    n = prep$ndraws, dist = "beta_binomial",
    size = prep$data$trials[i],
    mu = get_dpar(prep, "mu", i = i),
    phi = get_dpar(prep, "phi", i = i),
    lb = prep$data$lb[i], ub = prep$data$ub[i],
    ntrys = ntrys
  )
}

So I attempted to add that to the generated quantities block:

functions {
  /* compute correlated group-level effects
  * Args:
    *   z: matrix of unscaled group-level effects
  *   SD: vector of standard deviation parameters
  *   L: cholesky factor correlation matrix
  * Returns:
    *   matrix of scaled group-level effects
  */
    matrix scale_r_cor(matrix z, vector SD, matrix L) {
      // r is stored in another dimension order than z
      return transpose(diag_pre_multiply(SD, L) * z);
    }
}

data {
  
  int<lower=1> N; // number of observations
  array[N] int Y; // response 
  array[N] int nTrials; // number of trials
  int<lower=1> K; // number of coefficients
  matrix[N, K] X; // design matrix
  // data for group-level effects subjects
  int<lower=1> nSubj; // number of subjects
  int<lower=1> mSubj; // number of coefficients per level
  array[N] int<lower=1> Subj; // subject index
  // subject-level noise-level values
  vector[N] NL_0;
  vector[N] NL_1;
  vector[N] NL_2;
  vector[N] NL_3;
  int<lower=1> nCor; // number of subject-level correlations
  
}

parameters {
  
  vector[K] b; // regression coefficients
  real<lower=0> phi; // precision parameter
  vector<lower=0>[mSubj] subj_sd; // subject-level standard deviations
  matrix[mSubj, nSubj] subj_z; // standardized subject-level effects
  cholesky_factor_corr[mSubj] L; // cholesky factor of correlation matrix
  
}

transformed parameters {
  
  matrix[nSubj, mSubj] r_subj; // actual subject-level noise effects
  vector[nSubj] r_0;
  vector[nSubj] r_1;
  vector[nSubj] r_2;
  vector[nSubj] r_3;
  // compute actual subject-level effects
  r_subj = scale_r_cor(subj_z, subj_sd, L);
  r_0 = r_subj[ : , 1];
  r_1 = r_subj[ : , 2];
  r_2 = r_subj[ : , 3];
  r_3 = r_subj[ : , 4];
  
  // initialize linear predictor term
  vector[N] mu = rep_vector(0.0, N);
  mu += X * b;
  
  // add more terms to the linear predictor
  for (n in 1 : N) {
    mu[n] += r_0[Subj[n]] * NL_0[n] + r_1[Subj[n]] * NL_1[n]
    + r_2[Subj[n]] * NL_2[n] + r_3[Subj[n]] * NL_3[n];
  }
  
   mu = inv_logit(mu);
   
}

model {
  
  real lprior = 0; // prior contributions to the log posterior
  // priors 
  lprior += normal_lupdf(b | 0, 1);
  lprior += gamma_lupdf(phi | 0.01, 0.01);
  lprior += normal_lupdf(subj_sd | 0, 1);
  lprior += lkj_corr_cholesky_lupdf(L | 1);
  target += lprior;
  target += std_normal_lupdf(to_vector(subj_z));
  
  // likelihood 
  
  for (n in 1 : N) {
    target += beta_binomial_lupmf(Y[n] | nTrials[n], mu[n] * phi, (1 - mu[n]) * phi);
  }
  
}

generated quantities {
  
  vector[N] ypred ;
  vector[N] epred ;
  vector[N] log_lik ;
  
  // compute subject-level correlations
  corr_matrix[mSubj] Cor = multiply_lower_tri_self_transpose(L);
  vector<lower=-1, upper=1>[nCor] cor;
  // extract upper diagonal of correlation matrix
  for (k in 1 : mSubj) {
    for (j in 1 : (k - 1)) {
      cor[choose(k - 1, 2) + j] = Cor[j, k];
    }
  }
  
  
      
      for (i in 1:N){
      epred[i] = mu[i] * nTrials[i] ;
      
      }
  
  for (i in 1:N){
    ypred[i] = beta_binomial_rng(nTrials[i], mu[i] * phi, (1 - mu[i]) * phi);
  }
  
  for (i in 1:N){
    log_lik[i] = beta_binomial_lpmf(Y[i] | nTrials[i], mu[i] * phi, (1 - mu[i]) * phi);
  }
  
}


The samples of ‘epred’ are nothing like what I get by using posterior_epred from brms.

brms is giving me epreds in the 2000-3000 range, my generated quantities are in the 0 to 2 range.

I’m clearly missing something, but I’m not sure what.

Can anyone assist with how I can calculate comparable epreds?

If I take draws of mu,

draws <- sample(c(1:4000), size = 250)
mu <- as_draws_matrix(subset_draws(Fit$draws()
                                       , variable = "mu"
                                       , draw = draws))

combine them with my input data,

Pred_Frame <- cbind(Data,
                    as_tibble(t(mu))) %>%
  pivot_longer(`1`:last_col(), names_to = ".draw", values_to = "mu") %>% 
  mutate(Prob = as.numeric(mu)
         , .draw = as.numeric(.draw))

then take the mean of mu for each group and each sample,

Epred_Frame <- Pred_Frame %>%
  group_by(Group, Condition,  .draw) %>%
  summarize(epred = mean(mu))

I seem to get pretty close.

Is that the correct interpretation/implementation?

I’d recommend this: 27 Posterior Predictive Sampling | Stan User’s Guide

The short answer is that to compute some expectation relative to the posterior predictive distribution, p(\tilde{y} \mid y), where \tilde{y} is “test” data and y is “training” data, goes as follows

\mathbb{E}[f(\tilde{y}) \mid y] \approx \frac{1}{M} \sum_{m=1}^M f(\tilde{y}^{(m)}),

where \tilde{y}^{(m)} \sim p(\tilde{y} \mid y). It’s important to include sampling uncertainty as well as estimation uncertainty for sampling \tilde{y}^{(m)}. For instance, if you have a beta-binomial model and have posterior draws \theta^{(m)} \sim p(\theta \mid y) for model parameters \theta, then you want to simulate \tilde{y} \sim \textrm{beta-binomial}(N, \theta) rather than just using the expectation N \cdot \theta.

Thanks!

The manual gives this example:

data {
  int<lower=0> N;
  array[N] int<lower=0> y;
}
parameters {
  real<lower=0> lambda;
}
model {
  lambda ~ gamma(1, 1);
  y ~ poisson(lambda);
}
generated quantities {
  int<lower=0> y_tilde = poisson_rng(lambda);
}

The expectations are computed as:

generated quantities {
  real f_val = f(y_tilde, theta);
  // ...
}

Am I correct that I would use

vector[N] epred;

 epred  += beta_binomial(y_tilde, mu * phi, (1 -  mu) * phi);

Is that what brms does for posterior_epred()?

That seems to conflict with the documentation saying it doesn’t include the variance/dispersion parameters.

I tried this with a student-t model and all epred samples were nan:

  vector[N] ypred;
  vector[N] log_lik;
  vector[N] epred;
  
    for(i in 1:N){
	  ypred[i] = student_t_rng(nu, mu[i], sigma[i] ) ;
	   }
  
  epred += student_t_lpdf(ypred | nu_unscaled, mu_unscaled, sigma_unscaled);

 for(i in 1:N){
	  log_lik[i] = student_t_lpdf(Y[i] | nu, mu[i], sigma[i]); ;
	   }

Still trying to dig further, I fit a student model with brms and called add_epred_draws() from tidybayes.

The .epred values are different for every subject in the same conditions, so there is still individual variability represented by .epred.

When I fit the same model with cmdstanr and then take draws of mu, there are very different from the values of .epred.

I also tried

for(i in 1:N){
                epred[i] = student_t_lpdf(ypred[i] | nu, mu[i], sigma[i]); ;
        }

Those values are nowhere close to what I get from add_epred_draws, either.

This is what brms has for epred from the student family:

posterior_epred_student <- function(prep) {
  if (!is.null(prep$ac$lagsar)) {
    prep$dpars$mu <- posterior_epred_lagsar(prep)
  }
  prep$dpars$mu
}

I think @andrewheiss summarizes it nicely here:

(here’s the full post, btw: Visualizing the differences between Bayesian posterior predictions, linear predictions, and the expectation of posterior predictions | Andrew Heiss)

This was extremely helpful! Thank you!

What I’d like to do is recreate posterior_epred manually, which you show nicely for the gaussian case.

What would “the average of each draw from posterior_predict” look like?

Posterior_predict is easy to do in generated quantities using *_rng functions. But what would I average over to get epred?

For a Student-t example, if I have

generated quantities {

vector[N] ypred;

for(i in 1:N){
                ypred[i] = student_t_rng(nu, mu[i], sigma[i] ) ;
        }

}

That will give me posterior predictions (4000 for each observation in the data (i). What do I average over?

posterior_epred from a brms fit will give 4000 epred’s for each observation.

See this notebook I made when working through Bayes Rules! chapter 9, where I make their example model in rstanarm, brms, and raw Stan: bayesf22 Notebook - 9: Simple normal regression

In the Stan code, I use *_rng to create posterior predict, like you said:

generated quantities {
  vector[n] Y_rep;
  
  for (i in 1:n) {
    Y_rep[i] = normal_rng(mu[i], sigma);
  }
}

If you go down to the “Posterior Prediction” section (bayesf22 Notebook - 9: Simple normal regression) and click on the Stan tab, you can see how to get epreds from those

Thank you very much!!!

Would the equivalent to brms epred be the mean of predictions for each row/observation?

I’m still confused by why there are multiple epred draws per row from brms

In digging around through available functions, I found rstudent_t(n, df, mu = 0, sigma = 1) in brms

So I tried this:

Epred_Frame <- left_join(Data %>%
                      mutate(row = row_number())
              , fit %>%
                      spread_draws(mu[row], nu
                                   , ndraws = 100) %>%
                      mutate(epred = rstudent_t(n(), df = nu, mu = mu, sigma = 1)) %>%
                      ungroup() %>%
                      select(row, epred, .draw)
              , by = "row")

Which gives what looks like a reasonable approximation. Perhaps it’s the fixing of sigma to 1 that is the difference between .epred and '.predictioninbrms`?

Epred is the expectation of the data given the covariates. This expectation is taken draw-wise over the posterior.

Is it correct that posterior_epred is like using posterior_pred and fixing sigma to 1 for the rng?

From the brms code:

rstudent_t <- function(n, df, mu = 0, sigma = 1) {
  if (isTRUE(any(sigma < 0))) {
    stop2("sigma must be non-negative.")
  }
  mu + sigma * rt(n, df = df)
}

So it looks like if sigma were set to 1, the result is based on mu (and it’s uncertainty). Is that what happens when posterior_epred() is called?

No, that’s not correct. I’m unsure of what is giving you that impression. posterior_pred gives a draw from the posterior predictive distribution. posterior_epred gives the posterior distribution for the expectation of the response.

I’m trying to calculate it myself, but can’t find anything that comes close.

@andrewheiss 's write up has this for epred:

epred_manual <- model_normal |> 
  spread_draws(b_Intercept, b_flipper_length_mm, sigma) |> 
  mutate(mu = b_Intercept + 
           (b_flipper_length_mm * 
              penguins_avg_flipper$flipper_length_mm),  # This is posterior_linpred()
         y_new = rnorm(n(), mean = mu, sd = sigma))  # This is posterior_predict()

# This is posterior_epred()
epred_manual |> 
  summarize(epred = mean(y_new))

Calling mean averages over the draws, so there is a only single estimate/draw rather than draws of ndraws() per row.

What does brms do to get multiple epreds per observation?

This post:

Suggests posterior_epred() would be mu for a Student model.

posterior_epred is just the (conditional) expected value of y, so just mu in this case. sigma doesn’t come into play until you want posterior_predict.

posterior_epred doesn’t average over draws. It’s computed and stored for each draw in brms. So in the student t case you mentioned it would contain all posterior draws of mu (ndraws x nobs if you extract it as a matrix).