How to compute expected value of the posterior predictive distribution (epred)

Rather than cluttering up other threads, I thought it might be helpful to start a topic for this discussion specifically.

In trying to understand what expectations of predictions represent, I’ve been trying to compute them in generated quantities, with little success.

I am currently working on a beta binomial model. I used brms first to generate the Stan code then modified for my use case

In looking through github, I found this:

posterior_epred_beta_binomial <- function(prep) {
  # beta part included in mu
  trials <- data2draws(prep$data$trials, dim_mu(prep))
  prep$dpars$mu * trials
}

versus posterior_predict_beta_binomial:

posterior_predict_beta_binomial <- function(i, prep, ntrys = 5, ...) {
  rdiscrete(
    n = prep$ndraws, dist = "beta_binomial",
    size = prep$data$trials[i],
    mu = get_dpar(prep, "mu", i = i),
    phi = get_dpar(prep, "phi", i = i),
    lb = prep$data$lb[i], ub = prep$data$ub[i],
    ntrys = ntrys
  )
}

So I attempted to add that to the generated quantities block:

functions {
  /* compute correlated group-level effects
  * Args:
    *   z: matrix of unscaled group-level effects
  *   SD: vector of standard deviation parameters
  *   L: cholesky factor correlation matrix
  * Returns:
    *   matrix of scaled group-level effects
  */
    matrix scale_r_cor(matrix z, vector SD, matrix L) {
      // r is stored in another dimension order than z
      return transpose(diag_pre_multiply(SD, L) * z);
    }
}

data {
  
  int<lower=1> N; // number of observations
  array[N] int Y; // response 
  array[N] int nTrials; // number of trials
  int<lower=1> K; // number of coefficients
  matrix[N, K] X; // design matrix
  // data for group-level effects subjects
  int<lower=1> nSubj; // number of subjects
  int<lower=1> mSubj; // number of coefficients per level
  array[N] int<lower=1> Subj; // subject index
  // subject-level noise-level values
  vector[N] NL_0;
  vector[N] NL_1;
  vector[N] NL_2;
  vector[N] NL_3;
  int<lower=1> nCor; // number of subject-level correlations
  
}

parameters {
  
  vector[K] b; // regression coefficients
  real<lower=0> phi; // precision parameter
  vector<lower=0>[mSubj] subj_sd; // subject-level standard deviations
  matrix[mSubj, nSubj] subj_z; // standardized subject-level effects
  cholesky_factor_corr[mSubj] L; // cholesky factor of correlation matrix
  
}

transformed parameters {
  
  matrix[nSubj, mSubj] r_subj; // actual subject-level noise effects
  vector[nSubj] r_0;
  vector[nSubj] r_1;
  vector[nSubj] r_2;
  vector[nSubj] r_3;
  // compute actual subject-level effects
  r_subj = scale_r_cor(subj_z, subj_sd, L);
  r_0 = r_subj[ : , 1];
  r_1 = r_subj[ : , 2];
  r_2 = r_subj[ : , 3];
  r_3 = r_subj[ : , 4];
  
  // initialize linear predictor term
  vector[N] mu = rep_vector(0.0, N);
  mu += X * b;
  
  // add more terms to the linear predictor
  for (n in 1 : N) {
    mu[n] += r_0[Subj[n]] * NL_0[n] + r_1[Subj[n]] * NL_1[n]
    + r_2[Subj[n]] * NL_2[n] + r_3[Subj[n]] * NL_3[n];
  }
  
   mu = inv_logit(mu);
   
}

model {
  
  real lprior = 0; // prior contributions to the log posterior
  // priors 
  lprior += normal_lupdf(b | 0, 1);
  lprior += gamma_lupdf(phi | 0.01, 0.01);
  lprior += normal_lupdf(subj_sd | 0, 1);
  lprior += lkj_corr_cholesky_lupdf(L | 1);
  target += lprior;
  target += std_normal_lupdf(to_vector(subj_z));
  
  // likelihood 
  
  for (n in 1 : N) {
    target += beta_binomial_lupmf(Y[n] | nTrials[n], mu[n] * phi, (1 - mu[n]) * phi);
  }
  
}

generated quantities {
  
  vector[N] ypred ;
  vector[N] epred ;
  vector[N] log_lik ;
  
  // compute subject-level correlations
  corr_matrix[mSubj] Cor = multiply_lower_tri_self_transpose(L);
  vector<lower=-1, upper=1>[nCor] cor;
  // extract upper diagonal of correlation matrix
  for (k in 1 : mSubj) {
    for (j in 1 : (k - 1)) {
      cor[choose(k - 1, 2) + j] = Cor[j, k];
    }
  }
  
  
      
      for (i in 1:N){
      epred[i] = mu[i] * nTrials[i] ;
      
      }
  
  for (i in 1:N){
    ypred[i] = beta_binomial_rng(nTrials[i], mu[i] * phi, (1 - mu[i]) * phi);
  }
  
  for (i in 1:N){
    log_lik[i] = beta_binomial_lpmf(Y[i] | nTrials[i], mu[i] * phi, (1 - mu[i]) * phi);
  }
  
}


The samples of ‘epred’ are nothing like what I get by using posterior_epred from brms.

brms is giving me epreds in the 2000-3000 range, my generated quantities are in the 0 to 2 range.

I’m clearly missing something, but I’m not sure what.

Can anyone assist with how I can calculate comparable epreds?

If I take draws of mu,

draws <- sample(c(1:4000), size = 250)
mu <- as_draws_matrix(subset_draws(Fit$draws()
                                       , variable = "mu"
                                       , draw = draws))

combine them with my input data,

Pred_Frame <- cbind(Data,
                    as_tibble(t(mu))) %>%
  pivot_longer(`1`:last_col(), names_to = ".draw", values_to = "mu") %>% 
  mutate(Prob = as.numeric(mu)
         , .draw = as.numeric(.draw))

then take the mean of mu for each group and each sample,

Epred_Frame <- Pred_Frame %>%
  group_by(Group, Condition,  .draw) %>%
  summarize(epred = mean(mu))

I seem to get pretty close.

Is that the correct interpretation/implementation?

I’d recommend this: 27 Posterior Predictive Sampling | Stan User’s Guide

The short answer is that to compute some expectation relative to the posterior predictive distribution, p(\tilde{y} \mid y), where \tilde{y} is “test” data and y is “training” data, goes as follows

\mathbb{E}[f(\tilde{y}) \mid y] \approx \frac{1}{M} \sum_{m=1}^M f(\tilde{y}^{(m)}),

where \tilde{y}^{(m)} \sim p(\tilde{y} \mid y). It’s important to include sampling uncertainty as well as estimation uncertainty for sampling \tilde{y}^{(m)}. For instance, if you have a beta-binomial model and have posterior draws \theta^{(m)} \sim p(\theta \mid y) for model parameters \theta, then you want to simulate \tilde{y} \sim \textrm{beta-binomial}(N, \theta) rather than just using the expectation N \cdot \theta.

Thanks!

The manual gives this example:

data {
  int<lower=0> N;
  array[N] int<lower=0> y;
}
parameters {
  real<lower=0> lambda;
}
model {
  lambda ~ gamma(1, 1);
  y ~ poisson(lambda);
}
generated quantities {
  int<lower=0> y_tilde = poisson_rng(lambda);
}

The expectations are computed as:

generated quantities {
  real f_val = f(y_tilde, theta);
  // ...
}

Am I correct that I would use

vector[N] epred;

 epred  += beta_binomial(y_tilde, mu * phi, (1 -  mu) * phi);

Is that what brms does for posterior_epred()?

That seems to conflict with the documentation saying it doesn’t include the variance/dispersion parameters.

I tried this with a student-t model and all epred samples were nan:

  vector[N] ypred;
  vector[N] log_lik;
  vector[N] epred;
  
    for(i in 1:N){
	  ypred[i] = student_t_rng(nu, mu[i], sigma[i] ) ;
	   }
  
  epred += student_t_lpdf(ypred | nu_unscaled, mu_unscaled, sigma_unscaled);

 for(i in 1:N){
	  log_lik[i] = student_t_lpdf(Y[i] | nu, mu[i], sigma[i]); ;
	   }

Still trying to dig further, I fit a student model with brms and called add_epred_draws() from tidybayes.

The .epred values are different for every subject in the same conditions, so there is still individual variability represented by .epred.

When I fit the same model with cmdstanr and then take draws of mu, there are very different from the values of .epred.

I also tried

for(i in 1:N){
                epred[i] = student_t_lpdf(ypred[i] | nu, mu[i], sigma[i]); ;
        }

Those values are nowhere close to what I get from add_epred_draws, either.

This is what brms has for epred from the student family:

posterior_epred_student <- function(prep) {
  if (!is.null(prep$ac$lagsar)) {
    prep$dpars$mu <- posterior_epred_lagsar(prep)
  }
  prep$dpars$mu
}

I think @andrewheiss summarizes it nicely here:

1 Like

(here’s the full post, btw: Visualizing the differences between Bayesian posterior predictions, linear predictions, and the expectation of posterior predictions | Andrew Heiss)

1 Like

This was extremely helpful! Thank you!

What I’d like to do is recreate posterior_epred manually, which you show nicely for the gaussian case.

What would “the average of each draw from posterior_predict” look like?

Posterior_predict is easy to do in generated quantities using *_rng functions. But what would I average over to get epred?

For a Student-t example, if I have

generated quantities {

vector[N] ypred;

for(i in 1:N){
                ypred[i] = student_t_rng(nu, mu[i], sigma[i] ) ;
        }

}

That will give me posterior predictions (4000 for each observation in the data (i). What do I average over?

posterior_epred from a brms fit will give 4000 epred’s for each observation.

See this notebook I made when working through Bayes Rules! chapter 9, where I make their example model in rstanarm, brms, and raw Stan: bayesf22 Notebook - 9: Simple normal regression

In the Stan code, I use *_rng to create posterior predict, like you said:

generated quantities {
  vector[n] Y_rep;
  
  for (i in 1:n) {
    Y_rep[i] = normal_rng(mu[i], sigma);
  }
}

If you go down to the “Posterior Prediction” section (bayesf22 Notebook - 9: Simple normal regression) and click on the Stan tab, you can see how to get epreds from those

2 Likes

Thank you very much!!!

Would the equivalent to brms epred be the mean of predictions for each row/observation?

I’m still confused by why there are multiple epred draws per row from brms

1 Like

In digging around through available functions, I found rstudent_t(n, df, mu = 0, sigma = 1) in brms

So I tried this:

Epred_Frame <- left_join(Data %>%
                      mutate(row = row_number())
              , fit %>%
                      spread_draws(mu[row], nu
                                   , ndraws = 100) %>%
                      mutate(epred = rstudent_t(n(), df = nu, mu = mu, sigma = 1)) %>%
                      ungroup() %>%
                      select(row, epred, .draw)
              , by = "row")

Which gives what looks like a reasonable approximation. Perhaps it’s the fixing of sigma to 1 that is the difference between .epred and '.predictioninbrms`?

Epred is the expectation of the data given the covariates. This expectation is taken draw-wise over the posterior.

Is it correct that posterior_epred is like using posterior_pred and fixing sigma to 1 for the rng?

From the brms code:

rstudent_t <- function(n, df, mu = 0, sigma = 1) {
  if (isTRUE(any(sigma < 0))) {
    stop2("sigma must be non-negative.")
  }
  mu + sigma * rt(n, df = df)
}

So it looks like if sigma were set to 1, the result is based on mu (and it’s uncertainty). Is that what happens when posterior_epred() is called?

No, that’s not correct. I’m unsure of what is giving you that impression. posterior_pred gives a draw from the posterior predictive distribution. posterior_epred gives the posterior distribution for the expectation of the response.

I’m trying to calculate it myself, but can’t find anything that comes close.

@andrewheiss 's write up has this for epred:

epred_manual <- model_normal |> 
  spread_draws(b_Intercept, b_flipper_length_mm, sigma) |> 
  mutate(mu = b_Intercept + 
           (b_flipper_length_mm * 
              penguins_avg_flipper$flipper_length_mm),  # This is posterior_linpred()
         y_new = rnorm(n(), mean = mu, sd = sigma))  # This is posterior_predict()

# This is posterior_epred()
epred_manual |> 
  summarize(epred = mean(y_new))

Calling mean averages over the draws, so there is a only single estimate/draw rather than draws of ndraws() per row.

What does brms do to get multiple epreds per observation?

This post:

Suggests posterior_epred() would be mu for a Student model.

posterior_epred is just the (conditional) expected value of y, so just mu in this case. sigma doesn’t come into play until you want posterior_predict.

1 Like

posterior_epred doesn’t average over draws. It’s computed and stored for each draw in brms. So in the student t case you mentioned it would contain all posterior draws of mu (ndraws x nobs if you extract it as a matrix).

1 Like