LOO in Meta-analysis with mixed aggregate and individual data

timdisher · September 14, 2025, 4:04pm

I’ve been experimenting with the multinma package for multi-level network meta-regression (ML-NMR), and I’ve run into something that’s left me unsure how to interpret LOO-CV results in this setting.

In the multinma package it looks like each individual patient from IPD contributes one observation to the likelihood, while each arm from an aggregate (AgD) study contributes a single observation. This means that if we use LOO on the combined IPD+AgD network, the number of “data points” driving the estimate is the sum of IPD patients and aggregate arms.

This raises two questions I’m not sure how to answer

Are we really getting a mix of estimates of leave-one-patient-out error and leave-one-study-out error? I tried to explore this a little bit below by comparing an aggregate vs individual model on the aggregate data studies only since when not adjusting it just involves expanding the binomial counts and I do indeed get very different estimates from loo.
Could this mixed LOO drive different model selection then having everything be leave one study out? My thinking is that for if example If LOO is dominated by the many IPD observations, could this cause model selection to favor a model that predicts well within the IPD study but poorly in the AgD studies?
I notice that LOO has all bad pareto k estimates when evaluated in aggregate data only. I assume that means we’d have all bad or many bad pareto k estimates if trying to estimate all as leave-one-study out? When expanding the aggregate data to individual patients I still get 2 bad and 1 very bad pareto k but all others are now “good”.

Minimal example

library(multinma)
library(loo)
library(dplyr)

# Load psoriasis data (includes both IPD and AgD)
pso_ipd <- filter(plaque_psoriasis_ipd,
                  studyc %in% c("UNCOVER-1", "UNCOVER-2", "UNCOVER-3"))

pso_agd <- filter(plaque_psoriasis_agd,
                  studyc == "FIXTURE")

pso_ipd <- pso_ipd %>% 
  mutate(# Variable transformations
    bsa = bsa / 100,
    prevsys = as.numeric(prevsys),
    psa = as.numeric(psa),
    weight = weight / 10,
    durnpso = durnpso / 10,
    # Treatment classes
    trtclass = case_when(trtn == 1 ~ "Placebo",
                         trtn %in% c(2, 3, 5, 6) ~ "IL blocker",
                         trtn == 4 ~ "TNFa blocker"),
    # Check complete cases for covariates of interest
    complete = complete.cases(durnpso, prevsys, bsa, weight, psa)
  ) |> 
  dplyr::filter(complete)

pso_agd <- pso_agd %>% 
  mutate(
    # Variable transformations
    bsa_mean = bsa_mean / 100,
    bsa_sd = bsa_sd / 100,
    prevsys = prevsys / 100,
    psa = psa / 100,
    weight_mean = weight_mean / 10,
    weight_sd = weight_sd / 10,
    durnpso_mean = durnpso_mean / 10,
    durnpso_sd = durnpso_sd / 10,
    # Treatment classes
    trtclass = case_when(trtn == 1 ~ "Placebo",
                         trtn %in% c(2, 3, 5, 6) ~ "IL blocker",
                         trtn == 4 ~ "TNFa blocker")
  )


expand_pasi75_to_ipd <- function(agd) {
  do.call(rbind, lapply(seq_len(nrow(agd)), function(i) {
    row <- agd[i, ]
    
    # responders: pasi75_r rows of 1
    responders <- data.frame(
      studyc = row$studyc,
      trtc   = row$trtc,
      pasi75 = rep(1, row$pasi75_r)
    )
    
    # non-responders: (pasi75_n - pasi75_r) rows of 0
    nonresponders <- data.frame(
      studyc = row$studyc,
      trtc   = row$trtc,
      pasi75 = rep(0, row$pasi75_n - row$pasi75_r)
    )
    
    rbind(responders, nonresponders)
  }))
}

ipd_like <- expand_pasi75_to_ipd(pso_agd)

pso_agd_as_ipd_net <- set_ipd(ipd_like,
                              study = studyc,
                              trt = trtc,
                              r = pasi75)


pso_net <- combine_network(
  set_ipd(pso_ipd, 
          study = studyc, 
          trt = trtc, 
          r = pasi75,
          trt_class = trtclass),
  set_agd_arm(pso_agd, 
              study = studyc, 
              trt = trtc, 
              r = pasi75_r, 
              n = pasi75_n,
              trt_class = trtclass)
)

pso_net <- add_integration(pso_net,
                           durnpso = distr(qgamma, mean = durnpso_mean, sd = durnpso_sd),
                           prevsys = distr(qbern, prob = prevsys),
                           bsa = distr(qlogitnorm, mean = bsa_mean, sd = bsa_sd),
                           weight = distr(qgamma, mean = weight_mean, sd = weight_sd),
                           psa = distr(qbern, prob = psa),
                           n_int = 64
)

pso_net_agd <-   set_agd_arm(pso_agd, 
                             study = studyc, 
                             trt = trtc, 
                             r = pasi75_r, 
                             n = pasi75_n,
                             trt_class = trtclass)


pso_fit_FE <- nma(pso_net, 
                  trt_effects = "fixed",
                  link = "probit", 
                  likelihood = "bernoulli2",
                  class_interactions = "common",
                  prior_intercept = normal(scale = 10),
                  prior_trt = normal(scale = 10),
                  prior_reg = normal(scale = 10),
                  init_r = 0.1,
                  QR = TRUE)


pso_fit_FE_agd <- nma(pso_net_agd, 
                      trt_effects = "fixed",
                      link = "probit", 
                      likelihood = "binomial2",
                      prior_intercept = normal(scale = 10),
                      prior_trt = normal(scale = 10),
                      prior_reg = normal(scale = 10),
                      init_r = 0.1,
                      QR = TRUE)

pso_fit_FE_agd_as_ipd <- nma(pso_agd_as_ipd_net, 
                      trt_effects = "fixed",
                      link = "probit", 
                      likelihood = "bernoulli2",
                      prior_intercept = normal(scale = 10),
                      prior_trt = normal(scale = 10),
                      prior_reg = normal(scale = 10),
                      init_r = 0.1,
                      QR = TRUE)

# Compute LOO
loo_combined <- loo(pso_fit_FE)
loo_agd_only <- loo(pso_fit_FE_agd)
loo_agd_as_ipd <- loo(pso_fit_FE_agd_as_ipd)
print(loo_combined) # ll matrix is 4000 x 3858 consistent with one obs per IPD patient + one per agd arm
print(loo_agd_only) # ll matrix is 4000 x 4 consistent with one obs per study arm
print(loo_agd_as_ipd) # ll matrix is 4000 x 1297

nrow(pso_ipd) + nrow(pso_agd)
nrow(pso_agd)

avehtari · September 16, 2025, 5:24pm

I’m not familiar with multinma package and I don’t know how binomial2 and bernoulli2 differ from binomial and bernoulli, but if they are not different, then in Bernoulli case loo() approximates leaving out single 0/1 observation, and in binomial case loo() approximates leaving out many Bernoulli events aggregated together. These will give different results and can lead to different conclusions. Usually LOO-CV with Bernoulli is weaker to detect differences than with Binomial.

Can you clarify, what is the difference between “one observation” and “single observation”?

I think so, but can you clarify the data structure a bit more?

Leaving out more observations changes the posterior more, especially if there are group specific parameters and all the observations for that group are removed. It’s easiest to use K-fold-CV in such case. If there is only one group specific parameter then in theory 1D quadrature integrated PSIS-LOO would be possible, but that would require some coding (see Roaches cross-validation case study for an example).

timdisher · September 16, 2025, 5:38pm

Sorry what I mean is that a study with 200 patients has 200 observations in the ll matrix in loo and an aggregate study with 200 patients enters the ll matrix once.

We have studies where we have individual patient data so we see for each patient their covariate values and whether or not they had the event. This is the part of the model that is estimated with a bernoulli likelihood. We have a second set of studies where we only have aggregate data for the patients so the event counts are modelled as binomial and then the linear predictor is marginalized over a set of aggregation points that represent the assumed joint distribution of covariates.

So when we get loo from a multinma object my impression is that we are getting an approximation of leaving a single 0/1 observation out for the studies with individual participant data but then an approximation of leaving an entire study for the aggregate data that are modelled as binomial.

timdisher · September 16, 2025, 6:07pm

If it helps, attaching the original pub for the method here: https://pmc.ncbi.nlm.nih.gov/articles/PMC7362893/

avehtari · September 18, 2025, 11:43am

That was helpful. For me the network part looks just like a regular hierarchical model, and for a large number of studies I think a tabular presentation is more space efficient (see, e.g., Study vs. Treatment plot in Section 13 of my brms demo).

The description of how to combine individual and aggregate data was very clear in that paper.

If there would be many studies (arms?), I would go for leave-one-study-out (as I did in that demo, as all studies had only aggregated data). Joint log score for whole study is also better in discriminating different models.

If there are not many studies (like you seem to have 4), I would separately do predictive checking and calibration checking for IPD studies. For a small number of aggregated studies there is not much information to do predictive and calibration checking. Although joint log score for studies is more discriminative than pointwise individual log scores, if you have only 4 studies, then the uncertainty in the cross-validation comparison will be high.

By computing “de-aggregated” individual likelihoods for aggregated studies, you are not able to gain more information, but if you have a small number of studies, you may be able to put the elpd’s for individual and aggregated on the same level for comparison.

All above was assuming cross-validation computation just works. If you get high Pareto-k’s easiest choice is to use K-fold-CV instead.

I’m also interested in seeing the loo() outputs which you did not include in your first post.

timdisher · September 18, 2025, 12:42pm

This is helpful thanks. This dataset is just a small tutorial set where everything runs quickly so in practice we often have maybe 1 dataset with IPD and then 30-40 aggregate studies (in larger networks). I see from this post the re-aggregating the IPD studies is an option but may come with consequences re: high Pareto-k’s. Since I think it would be a heavy lift to hack together exact loo (for me), do you think the issue of mixing loo for IPD and loso for aggregate studies is likely to lead to decision differences in practice or is it more of an academic/interesting problem?

Have pasted these below.

print(loo_combined) # ll matrix is 4000 x 3858 consistent with one obs per IPD patient + one per agd arm
#> 
#> Computed from 4000 by 3858 log-likelihood matrix.
#> 
#>          Estimate   SE
#> elpd_loo  -1604.8 37.4
#> p_loo         7.7  1.2
#> looic      3209.6 74.8
#> ------
#> MCSE of elpd_loo is NA.
#> MCSE and ESS estimates assume MCMC draws (r_eff in [0.4, 1.5]).
#> 
#> Pareto k diagnostic values:
#>                          Count Pct.    Min. ESS
#> (-Inf, 0.7]   (good)     3856  99.9%   282     
#>    (0.7, 1]   (bad)         2   0.1%   <NA>    
#>    (1, Inf)   (very bad)    0   0.0%   <NA>    
#> See help('pareto-k-diagnostic') for details.

print(loo_agd_only) # ll matrix is 4000 x 4 consistent with one obs per study arm
#> 
#> Computed from 4000 by 4 log-likelihood matrix.
#> 
#>          Estimate  SE
#> elpd_loo    -16.4 0.8
#> p_loo         3.6 0.1
#> looic        32.7 1.6
#> ------
#> MCSE of elpd_loo is NA.
#> MCSE and ESS estimates assume MCMC draws (r_eff in [0.5, 0.6]).
#> 
#> Pareto k diagnostic values:
#>                          Count Pct.    Min. ESS
#> (-Inf, 0.7]   (good)     0      0.0%   <NA>    
#>    (0.7, 1]   (bad)      3     75.0%   <NA>    
#>    (1, Inf)   (very bad) 1     25.0%   <NA>    
#> See help('pareto-k-diagnostic') for details.

print(loo_agd_as_ipd) # ll matrix is 4000 x 1297
#> 
#> Computed from 4000 by 1297 log-likelihood matrix.
#> 
#>          Estimate   SE
#> elpd_loo   -670.5 17.6
#> p_loo         3.9  0.2
#> looic      1341.0 35.2
#> ------
#> MCSE of elpd_loo is 0.0.
#> MCSE and ESS estimates assume MCMC draws (r_eff in [0.8, 1.2]).
#> 
#> All Pareto k estimates are good (k < 0.7).
#> See help('pareto-k-diagnostic') for details.

avehtari · September 18, 2025, 5:13pm

With 1 vs 30-40, those 30-40 dominate the comparison anyway, so it is unlikely that it matters, and if the comparison would be that near edge that this would affect the decision, then it shows in the difference uncertainty estimates anyway and other non-modelled aspects are likely to have bigger impact.

In the first one if those bad ones are near 0.7, they might be just by chance and you can try what happens just by increasing the number of posterior draws.

In the second one 4-fold-CV would be a good choice

Topic		Replies	Views
Information criteria - conflicting advice for hierarchical models General	19	2574	January 2, 2020
Loo and "logo" combination. Two data types, one model Modeling loo	6	750	May 17, 2022
Using loo to compare nested hierarchical models with missing observations Modeling loo	5	872	December 27, 2019
Loo issues on simulated meta-analysis data Modeling loo , meta-analysis	13	1183	May 6, 2019
Calculating LOO-CV for a multinormal regression model Modeling loo	32	2540	April 7, 2020

LOO in Meta-analysis with mixed aggregate and individual data

Minimal example

Related topics