Integrated LOO with a PGLMM

Hi everyone. I’m trying to write a phylogenetic generalized linear mixed model for use with microbial strain data. I’m trying to use LOO to do a model comparison against a model with the phylogenetic component removed, but I’m running into some issues.

Using cmdstanr’s loo() method with log_lik[i] = normal_id_glm_lpdf(...) was producing many bad Pareto k diagnostic values, particularly with some of my real-world datasets which are sometimes >50% “bad”. I saw this tweet and I’ve been trying the integration-based method but I have a couple questions and a problem:

  • Question: Is it impactful to disregard the covariance of the elements of phylo_effect terms in the integrand function? It seems like it would be pretty tricky to pack the covariance matrix into the array[] real x_r argument. Or should I just pass in the appropriate element of z_1 and use std_normal_lpdf()?
  • Problem: The code below has some sort of bug when calling normal_id_glm_lpdf() in the integrand that I can’t figure out. The error message shows up on every chain, saying it detects infinite values in the independent variable matrix but I never see any.
Chain 4 Exception: Exception: normal_id_glm_lpdf: Matrix of independent variables is inf, but must be finite! (in 'C:/Users/aghazi/AppData/Local/Temp/RtmpIpsjz3/model-347421127513.stan', line 19, column 4 to line 23, column 53) (in 'C:/Users/aghazi/AppData/Local/Temp/RtmpIpsjz3/model-347421127513.stan', line 83, column 4 to line 92, column 41)

I started out with a model from following along with the brms PGLMM vignette but I’ve made some simplifications to make it more readable. The code blocks below give the model and some R code to generate synthetic data.

Thanks for any help on this!


The model:

functions {
  real integrand(real phylo_effect, real notused,
                 array[] real theta,
                 array[] real yXi, array[] int Kc) {

    real sigma_phylo = theta[1];
    real sigma_resid = theta[2];
    real Intercept = theta[3];
    int ntheta = num_elements(theta);
    vector[ntheta-3] b = to_vector(theta[4:ntheta]);

    real yi = yXi[1];
    row_vector[Kc[1]] Xci = to_row_vector(yXi[2:(Kc[1]+1)]);

    // V These never show any infinite values
    // print(theta);
    // print(to_matrix(to_vector(yXi)));

    return exp(normal_lpdf(phylo_effect | 0, sigma_phylo) + // is this right given the covariance of phylogenetic effects?
               normal_id_glm_lpdf(yi | to_matrix(Xci), // there are supposedly inf values here
                                       phylo_effect + Intercept,
                                       b,
                                       sigma_resid));
  }
}
data {
  int<lower=1> N;  // total number of observations
  vector[N] Y;  // response variable
  int<lower=1> K;  // number of population-level effects
  matrix[N, K] X;  // population-level design matrix
  matrix[N, N] Lcov;  // cholesky factor of known correlation matrix
  real<lower=0> resid_scale;
}
transformed data {
  int Kc = K - 1;
  matrix[N, Kc] Xc;  // centered version of X without an intercept
  vector[Kc] means_X;  // column means of X before centering
  for (i in 2:K) {
    means_X[i - 1] = mean(X[, i]);
    Xc[, i - 1] = X[, i] - means_X[i - 1];
  }
}
parameters {
  vector[Kc] b;  // population-level effects
  real Intercept;  // temporary intercept for centered predictors
  real<lower=0> sigma_resid;  // residual noise
  real<lower=0> sigma_phylo;  // phylogenetic noise
  vector[N] z_1;
}
transformed parameters {
  vector[N] phylo_effect;
  phylo_effect = (sigma_phylo * (Lcov * z_1));
}
model {
  // likelihood including constants
  vector[N] mu = Intercept + phylo_effect;

  target += normal_id_glm_lpdf(Y | Xc, mu, b, sigma_resid);

  // priors including constants
  // target += gamma_lpdf(sigma_phylo / sigma_resid | 1, 2);
  // ^ this can help avoid divergences

  target += student_t_lpdf(sigma_resid | 3, 0, resid_scale)
    - 1 * student_t_lccdf(0 | 3, 0, resid_scale);

  target += student_t_lpdf(sigma_phylo | 3, 0, resid_scale)
    - 1 * student_t_lccdf(0 | 3, 0, resid_scale);

  target += std_normal_lpdf(z_1);
}
generated quantities {
  // actual population-level intercept
  real b_Intercept = Intercept - dot_product(means_X, b);
  array[N] real yrep;
  vector[N] log_lik;
  for (i in 1:N){
    // V This log_lik value gives bad pareto k diagnostics
    // log_lik[i] = normal_id_glm_lpdf(Y[i] | to_matrix(Xc[i]), Intercept + phylo_effect[i], b, sigma_resid);

    yrep[i] = normal_rng(Intercept + phylo_effect[i] + Xc[i]*b, sigma_resid);

    log_lik[i] = log(integrate_1d(integrand,
                                  negative_infinity(),
                                  positive_infinity(),
                                  append_array({sigma_phylo},
                                               append_array({sigma_resid},
                                                            append_array({Intercept},
                                                                         to_array_1d(b)))),
                                  append_array({Y[i]},
                                               to_array_1d(Xc[i])),
                                  {Kc}));
  }
}

And the R script:

library(ape)
library(cmdstanr)

n = 100
sigma_phylo = 1
sigma_resid = 1

tr = rtree(n)

vcv = vcv.phylo(tr)
s = sqrt(diag(vcv))
cor_mat = diag(1/s) %*% vcv %*% diag(1/s)
rownames(cor_mat) = rownames(vcv)
Lcov = chol(cor_mat)


dat = data.frame(sample_id = colnames(vcv),
                 x1 = rnorm(n),
                 x2 = rnorm(n))

# intercept = -2, beta[1] = 1, beta[2] = -.5
dat$outcome = sigma_phylo * MASS::mvrnorm(1, mu = rep(0, n), Sigma = cor_mat) +
              sigma_resid * rnorm(n, mean = dat$x1 - .5*dat$x2 - 2, sd = 1)

pglmm_model = cmdstanr::cmdstan_model(stan_file = "C:/Users/aghazi/strain_stats/src/stan_files/cont_pglmm_int_loo.stan",
                                      quiet = TRUE)

data_list = list(N = n,
                 Y = dat$outcome,
                 K = 3,
                 X = model.matrix(outcome ~ x1 + x2, data = dat),
                 Lcov = Lcov,
                 resid_scale = sd(dat$outcome))


pglmm_fit = pglmm_model$sample(data = data_list,
                               chains = 4,
                               parallel_chains = 4,
                               iter_warmup = 500,
                               iter_sampling = 500,
                               refresh = 50)

This was an interesting one! The error isn’t too informative in this case, because the issue is actually caused by the intercept parameter, more specifically the integrated outcome phylo_effect. This quantity is taking on exceedingly large values (possibly because of the bounds of integration?) which is causing an overflow in one of the normal_id_glm calculations.

Part of the normal_id_glm likelihood calculates:

y_scaled = x * b;
y_scaled = (y - y_scaled - alpha) / sigma;
sum(y_scaled * y_scaled)

Which in terms of your input variables is:

y_scaled = Xci * b;
y_scaled = (yi - y_scaled - (phylo_effect + Intercept)) / sigma_resid;
sum(y_scaled * y_scaled)

If we print() the input variables and look at the iteration where the error occurs:

Chain 1 yi: -2.1641 
Chain 1 Xci[[0.454296,0.195315]] 
Chain 1 phylo_effect: 1.79769e+308 
Chain 1 Intercept: -2.52091 
Chain 1 b: [0.66128,-0.723408] 
Chain 1 sigma_resid: 1.18568 
Chain 1 Exception: Exception: normal_id_glm_lpdf: Matrix of independent variables is inf, but must be finite! (in 'C:/Users/he184239/AppData/Local/Temp/RtmpApmzgP/model-1f246bde7728.stan', line 26, column 4 to line 30, column 53) (in 'C:/Users/he184239/AppData/Local/Temp/RtmpApmzgP/model-1f246bde7728.stan', line 90, column 4 to line 95, column 41) 

You can already see that phylo_effect is very large (1.79769e+308), substituting these values into the definition above you can see where the Inf comes from:

> y_scaled = t(c(0.454296,0.195315)) %*% (c(0.66128,-0.723408))
> y_scaled
          [,1]
[1,] 0.1591244

> y_scaled = (-2.1641 - y_scaled - (1.79769e+308 + -2.52091)) / 1.18568 
> y_scaled
               [,1]
[1,] -1.516168e+308

> sum(y_scaled * y_scaled)
[1] Inf

I believe you need to adjust your integrating bounds so that phylo_effect doesn’t get so large/small, but I’m not familiar enough with the integrate_1d function to say for sure

1 Like

You should have priors on b and Intercept.

I think your use of integrate_1d is not valid, but before thinking that more carefully, I suggest considering writing the model as multivariate normal.

If the priors on b and Intercept would be normal, then you could write the model as multivariate normal, and integrate mu, b and z_1 analytically out, and use multi_normal_cholesky. Then you could do the leave-one-out cross-validation as described in Efficient leave-one-out cross-validation for Bayesian non-factorized normal and Student-t models. That approach takes properly into account the correlation in the covariance matrix.The additional benefit of the multivariate approach is that the remaining posterior would be just 2-dimensional (sigma_resid and sigma_phylo). The multivariate version can get slow if your N is big.

Ah, okay. I had incorrectly assumed integrate_1d avoided this sort of problem auto-magically somehow. Setting the integration limits to e.g. +/- 100 solves the infinity issue, leaving aside the validity of the integrand.

That seems like a really promising approach, but I’m kind of lost at implementing the “integrate out mu, b, and z_1” part properly. I got as far as setting up the normal priors on these “latent” parameters but I don’t understand what the steps should be to get to the multi_normal_cholesky() arguments I need.

I will also eventually have to handle bernoulli outcomes as well though, which if I’m understanding the paper correctly wouldn’t be applicable with this method. I guess ultimately I could use manual k-fold CV.

resid_scale = sd(dat$outcome)

latent_prior_mean = c(mean(dat$outcome), # prior mean on intercept
                      rep(0, 2),         # prior mean on covariate effects
                      rep(0, n))         # prior mean on normal deviates underlying phylo_effect

# Normal(0,1) priors on latent parameters except an empirical estimate on the
# spread of the intercept:
latent_prior_cov = diag(length(latent_prior_mean))
diag(latent_prior_cov)[1] = resid_scale
data{
  ...
  vector[1+K+N] latent_prior_mean;
  matrix[1+K+N, 1+K+N] latent_prior_cov;
}
parameters {
  real<lower=0> sigma_resid;  // residual noise
  real<lower=0> sigma_phylo;  // phylogenetic noise
}
model {
  // Not sure how to get these analytically
  matrix[1+K+N, 1+K+N] latent_post_cov = ???;
  vector[1+K+N] latent_post_mean = ???;

  vector[N] phylo_effect = ???; // Pull these out from latent_post_mean?
  vector[K] b = ???;
  real Intercept = ???;

  cholesky_factor_cov[N] L = ???; // derived from combining Lcov & latent_post_cov somehow?

  // likelihood
  vector[N] mu = Intercept + Xc * b + phylo_effect;
  Y ~ multi_normal_cholesky(mu, L);

  // priors
  target += student_t_lpdf(sigma_resid | 3, 0, resid_scale)
    - 1 * student_t_lccdf(0 | 3, 0, resid_scale);

  target += student_t_lpdf(sigma_phylo | 3, 0, resid_scale)
    - 1 * student_t_lccdf(0 | 3, 0, resid_scale);
}
generated quantities {
  vector[N] log_lik = multi_normal_cholesky_lpdf(Y[i], mu[i], L);
}

Ah, that is quite crucial part of information for any advice I can provide.

Did those help? (I mean adding the priors to your previous code should help, too)

You would not need to present these explicitly at all, and you would just have a joint covariance matrix for these. But there is smaller benefit if you want to combine it with non-normal observation model, so you can also continue with your previous model code.

You can still use the conditional distributions obtained from the phylo covariance matrix for to get the correct normal parameters to be used in the integrate_1d. In your previous code you used marginal distribution, but you need the phylo_effect[i] conditional on all other phylo_effects, which you get from the common multivariate normal equations (sorry, it’s too late for me to copy them here)

Ah, that is quite crucial part of information for any advice I can provide.

Sorry, I was trying to keep my first post from being overwhelming with detail. If there’s some super performant/efficient model I can apply in the continuous case and then fall back to a more general but slower model for bernoulli outcomes that’s perfectly fine.

To provide a bit more scientific context: this model is being used to examine the effect of within-species variation of gut microbes on host health outcomes. We’d like to be able to detect examples where strain-level phylogenetic variation is informative of the outcome e.g. E coli picks up an enzyme in one branch that makes produce some toxin that drives inflammation (which we’ve measured) in the host. Usually the trees are derived from a tool from my lab called StrainPhlan. My test case dataset has data from over 1000 patient samples who have been profiled for strain-level variation in ~100 species by examining variation in marker genes. Each species yields a tree that retains between 200 and 1000 samples (samples are not always able to be placed in the phylogeny). We can reasonably assume that species act independently, so running the model one bug at a time is sufficient. Speed is a bit of a concern, but I want to make sure what I’m doing is correct first. I think my biologist colleagues will end up requesting even more complicated models (particularly longitudinal sampling and/or repeated samples from the same individual) but I’ll cross that bridge when I come to it…

Did those help? (I mean adding the priors to your previous code should help, too)

The priors on b and Intercept helped a small amount just by adding a bit of regularization, but they don’t change the results substantially. From a sampling efficiency point of view the biggest issue is cases where sigma_phylo gets large and sigma_resid approaches 0 – this causes a lot of divergences and sometimes maximum tree depth warnings. I think that situation is sort of acting like a multivariate funnel but it’s difficult to think about. I’ve tried putting in a (sigma_phylo / sigma_resid) ~ Gamma(1,2) term which helps alleviate that behavior somewhat, in addition to adding some needed regularization to the phylogenetic effect we’re interested in in the first place.

In your previous code you used marginal distribution, but you need the phylo_effect[i] conditional on all other phylo_effects, which you get from the common multivariate normal equations.

Okay, that’s exactly what I was wondering in my first post. Is it essential that I run the integrals in Stan via integrate_1d? I was thinking it might be easier to keep track of the bookkeeping / vectorize if I fit the model with Stan but then compute the log_lik terms separately after the fact in R.

No problem. I appreciate not trying to overwhelm. It’s a challenge to see which details can make certain fast approaches infeasible.

I think it’s good to first consider the more accurate model with smaller data, and only if that doesn’t scale to full data, then consider less accurate models, and then these models can be compared with that smaller data size to assess whether the simpler model might be sufficient

Cool

It seems you have great colleagues :D

You can do the computation outside of Stan. The conditional of the normal is described also in Wikipedia Multivariate normal distribution - Wikipedia, or see https://www2.imm.dtu.dk/pubdb/pubs/3274-full.html. R’s integrate() should be fine. You could also test sampling based approach, where for each Stan posterior draw, you draw additional draws from that joint normal, and compute Monte Carlo estimate of the integral. You can use Pareto-khat diagnostic then to check whether that Monte Carlo estimate is reliable

That I do.

Thanks for the help on this so far. I think I’ve got it working for small simulated datasets at least (though I’d really appreciate it if someone could double-check the validity of my vec_integrand() function). I’ve been pursuing the integration approach over the marginalization approach because A) the latent phylogenetic effect parameters are of scientific interest / help visualize results and B) I think it should be easier to generalize to non-normal outcomes / more complicated model structures eventually. I was running into some underflow problems on larger datasets so I had to use a LogSumExp trick to compute the integral (LogIntExp? it made me feel like a sunglasses emoji). I’ve attached my solution in case it’s helpful to someone else. I still sometimes get infrequent less than Good Pareto k values (~1%) but they’re much better than before and I think longer chains with higher adapt_delta might be enough to fix them.

It’s possible to speed up the integrate() call substantially and avoid over/underflow in some circumstances (when the posterior mean is far from the value that maximizes the log-likelihood or when the log-likelihood function is extremely sharp) by setting the offset term and integration limits in a more intelligent way (or even just doing a quadratic approximation of the integrand – not sure when that would be violated), but what I have here should be generally applicable.

edit: One lingering question that I have. Are these integrated log_lik values comparable to those generated on a simpler model using the usual loo() function? That is, can I do loo_compare(integrated_pglmm_loo_result, simple_model$loo()) or do I need to get integrated log-likelihood values in the simple model too?

library(tidyverse)
library(ape)
library(cmdstanr)
library(loo)
# library(furrr); plan(multisession, workers = 4)
# library(progressr); handlers(global = TRUE)


# simulate data -----------------------------------------------------------

n = 100
sigma_phylo_true = 1
sigma_resid_true = 1

tr = rtree(n)

vcv = ape::vcv.phylo(tr)
s = diag(vcv) |> sqrt()
cor_mat = diag(1/s) %*% vcv %*% diag(1/s)
Lcov = cor_mat |> chol() |> t()

real_phylo_effects = sigma_phylo_true * MASS::mvrnorm(1, mu = rep(0, n), Sigma = cor_mat)

x = rnorm(n) # random linear covariate

outcome = real_phylo_effects + rnorm(n, mean = .6*x, sd = sigma_resid_true)

model_input = data.frame(sample_id = colnames(vcv),
                         outcome = outcome,
                         x = x,
                         real_phylo_effects)

data_list = list(N           = nrow(model_input),
                 Y           = model_input$outcome,
                 K           = 1 + 1,
                 X           = model.matrix(outcome ~ x, data = model_input),
                 Lcov        = Lcov,
                 int_mean    = mean(model_input$outcome),
                 resid_scale = sd(model_input$outcome))


# fit the model -----------------------------------------------------------

pglmm_model = cmdstanr::cmdstan_model(stan_file = "src/stan_files/2022-05-08.stan")
pglmm_fit = pglmm_model$sample(data = data_list, parallel_chains = 4,
                               iter_sampling = 2000, adapt_delta = .98)


# extract results ---------------------------------------------------------

Xc = matrix(nrow = data_list$N,
            ncol = data_list$K - 1)

mx = colMeans(data_list$X)[-1]

for (i in 2:data_list$K) {
  Xc[,i-1] = data_list$X[,i] - mx[i-1]
}

draw_df = pglmm_fit$draws(format = "data.frame") |>
  as_tibble() |>
  select(-matches("std_phylo")) |>
  tidyr::nest(phylo_effects = matches("phylo_effect"),
              beta = matches("^b\\[")) |>
  mutate(phylo_effects = map(phylo_effects,
                             unlist),
         beta = map(beta,
                    ~matrix(unlist(.x), ncol = 1)))

fit_summary = pglmm_fit$summary() |> as_tibble()

# These are decent starting places for offset terms
effect_means = fit_summary |>
  filter(grepl("^phylo_effect", variable)) |>
  pull(mean)

# compute conditional log-likelihood integrals ----------------------------

vec_integrand = function(phylo_effect_vec,
                         mu_bar_j, sigma_bar_j,      # phylo term components
                         sigma_resid, yj, lm_term, # LM term components
                         offset_term,
                         log = FALSE) {
  # This function for the integrand is vectorized because that's what
  # integrate() needs. The log argument keeps the result on the log scale, which
  # can be helpful for finding an offset for numerical precision.

  phylo_term = dnorm(phylo_effect_vec,
                     mean = mu_bar_j,
                     sd = sqrt(sigma_bar_j),
                     log = TRUE)

  model_mean = c(lm_term) + phylo_effect_vec

  fit_term = dnorm(x = yj,
                   mean = model_mean,
                   sd = sigma_resid,
                   log = TRUE)

  res = phylo_term + fit_term - offset_term

  if (!log) res = exp(res)

  return(res)
}

log_lik_i_j = function(j, lm_mean,
                       i_df,
                       effect_means, cov_mat) {
  # i = posterior iteration
  # j = index over phylo_effects

  p = length(effect_means)
  ord = c(j, (1:p)[-j])

  # https://en.wikipedia.org/wiki/Multivariate_normal_distribution#Conditional_distributions
  sigma12 = cov_mat[ord,ord][1,-1, drop = FALSE]
  sigma21 = t(sigma12)
  sigma22_inv = solve(cov_mat[-j,-j])
  sigma_bar_j = cov_mat[j,j] - c(sigma12 %*% sigma22_inv %*% sigma21)

  mu_bar_j = c(sigma12 %*% sigma22_inv %*% (i_df$phylo_effects[[1]][-j]))
  #^ a = the other phylo effects from iteration i, mu2 = 0

  offset_j = effect_means[j] # Use the posterior mean of phylo_effect[j] as the offset term

  int_res = integrate(vec_integrand,
                      lower            = -Inf,
                      upper            = Inf,
                      mu_bar_j         = mu_bar_j,
                      sigma_bar_j      = sigma_bar_j,
                      sigma_resid      = i_df$sigma_resid,
                      yj               = data_list$Y[j],
                      lm_term          = lm_mean,
                      offset_term      = offset_j)

  ll_ij = log(int_res$value) + offset_j

  ll_ij
}

log_lik_terms_i = function(i_df,
                           effect_means) {

  p = length(effect_means)
  cov_mat = i_df$sigma_phylo^2 * cor_mat

  lm_means = Xc %*% matrix(i_df$beta[[1]], ncol = 1) + i_df$Intercept

  # Map over all p leaves
  map2_dbl(1:p, c(lm_means),
           log_lik_i_j,
           i_df = i_df,
           effect_means = effect_means,
           cov_mat = cov_mat)
}


get_ll = function(draw_df, max_i, effect_means) {

  # Turn on the progress/furrr library lines above to parallelize & show a progress bar
  # Change map() to furrr::future_map() to parallelize
  # Turn on the two comments below for the progress bar

  # p = progressor(along = 1:max_i)
  map(1:max_i,
      function(.x) {# p()
                    log_lik_terms_i(i_df = draw_df[.x,],
                                    effect_means = effect_means)
      })
}

ll_list = get_ll(draw_df,
                 max_i = nrow(draw_df),
                 effect_means = effect_means) # ; beepr::beep(9)

ll_mat = ll_list |>
  unlist() |>
  matrix(nrow = length(ll_list),
         byrow = TRUE)

colnames(ll_mat) = paste("log_lik[", 1:ncol(ll_mat), "]", sep = "")

library(loo)
int_loo = loo(x = ll_mat, r_eff = relative_eff(exp(ll_mat),
                                               chain_id = draw_df$`.chain`))

plot(int_loo)
data {
  int<lower=1> N;             // total number of observations
  vector[N] Y;                // response variable
  int<lower=1> K;             // number of covariates
  matrix[N, K] X;             // population-level design matrix (covariates only)
  matrix[N, N] Lcov;          // cholesky factor of known correlation matrix
  real int_mean;
  real<lower=0> resid_scale;
}
transformed data {
  int Kc = K - 1;
  matrix[N, Kc] Xc;           // centered version of X without an intercept
  vector[Kc] means_X;         // column means of X before centering
  for (i in 2:K) {
    means_X[i - 1] = mean(X[, i]);
    Xc[, i - 1] = X[, i] - means_X[i - 1];
  }
}
parameters {
  vector[Kc] b;               // covariate effects
  real Intercept;             // temporary intercept for centered predictors
  real<lower=0> sigma_resid;  // residual noise
  real<lower=0> sigma_phylo;  // phylogenetic noise
  vector[N] std_phylo_effects;
}
transformed parameters {
  vector[N] phylo_effect;     // scaled phylogenetic effects
  phylo_effect = (sigma_phylo * (Lcov * std_phylo_effects));
}
model {
  // likelihood
  vector[N] mu = Intercept + phylo_effect;
  target += normal_id_glm_lpdf(Y | Xc, mu, b, sigma_resid);

  // priors
  target += normal_lpdf(Intercept | int_mean, resid_scale);
  target += normal_lpdf(b | 0, 1);

  target += student_t_lpdf(sigma_resid | 3, 0, resid_scale)
             - 1 * student_t_lccdf(0 | 3, 0, resid_scale);

  target += student_t_lpdf(sigma_phylo | 3, 0, resid_scale)
             - 1 * student_t_lccdf(0 | 3, 0, resid_scale);

  target += std_normal_lpdf(std_phylo_effects);
}

Cool! I didn’t have time to check your code

Does that roaches example help to see this?

Ah yes, I see the relevant part in there. Good to know.

1 Like